BAYESIAN METHODOLOGIES FOR GENOMIC DATA WITH MISSINGCOVARIATES
By
ZHEN LI
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2008
1
c© 2008 Zhen Li
2
To my great parents: Ping Li and Xuemei Tang and my beloved husband, Jian
3
ACKNOWLEDGMENTS
First of all, I want to express my deepest gratitude to Dr. George Casella. Not only
for his patient advisement for numerous academic problems, his time, his peals of wisdom
shared with me, but also for for supporting me, encouraging me and inspiring me to
always to be better. I also want to thank Dr. Hani Doss, Dr. John Davis, Dr. Gary Peter
and Dr. Rongling Wu for serving as my committee.
I would like to thank my parents, Ping Li and Xuemei Tang, for their endless love,
constant emotional support and for their belief in me. I thank my sister for always being
there for me and loving me.
I could never thank my husband Jian enough for his love and his emotional support.
He has always been keeping me in peace and calm in those days. Without that, I could
not finish this journey.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
CHAPTER
1 LITERATURE REVIEW AND PROJECT INTRODUCTION . . . . . . . . . . 11
1.1 Introduction to the Project . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2 Introduction to Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Patterns of Missing Data . . . . . . . . . . . . . . . . . . . . . . . . 141.2.2 Mechanism of Missing Data . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Existing Methods for Missing Data . . . . . . . . . . . . . . . . . . . . . . 191.3.1 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 211.3.2 EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.3.3 Multiple Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4 Bayesian Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 251.4.1 Semiautomatic Bayesian Variable Selection Method . . . . . . . . . 271.4.2 Automatic Bayesian Variable Selection Method . . . . . . . . . . . . 281.4.3 Stochastic Search Algorithm . . . . . . . . . . . . . . . . . . . . . . 30
1.5 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2 A HIERARCHICAL BAYESIAN MODEL FOR GENOME-WIDE ASSOCIA-TION STUDIES OF SNPS WITH MISSING VALUES . . . . . . . . . . . . . . 33
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2.1 The Model without Ramet Random Effect . . . . . . . . . . . . . . 392.2.2 The Model with Remete Random Effect . . . . . . . . . . . . . . . . 452.2.3 Increasing Computation Speed . . . . . . . . . . . . . . . . . . . . . 50
2.3 Results for Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 512.4 Results for Loblolly Pine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532.5 Quantifying the Covariance and Variance . . . . . . . . . . . . . . . . . . . 572.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3 BAYESIAN VARIABLE SELECTION FOR GENOMIC DATA WITH MISS-ING COVARITES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773.2 Bridge Sampling Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.2.1 General Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5
3.2.2 How to Choose g(β, γ, Zmis, σ2, φ2) ? . . . . . . . . . . . . . . . . . . 92
3.2.3 Comparison with the Simplest Model . . . . . . . . . . . . . . . . . 943.2.4 Marginal Likelihood for mδ(Y ) . . . . . . . . . . . . . . . . . . . . . 98
3.3 Markov Chain Monte Carlo Property . . . . . . . . . . . . . . . . . . . . . 1013.3.1 Candidate Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 1033.3.2 Convergence of Bayes Factors . . . . . . . . . . . . . . . . . . . . . 1043.3.3 Ergodicity Property of This M-H Chain . . . . . . . . . . . . . . . 105
3.3.3.1 Fixed n, uniformly ergodic converges to the distributionB(n) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.3.3.2 Ergodic convergence to B . . . . . . . . . . . . . . . . . . 1083.4 Computation Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.4.1 Matrix Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1103.4.1.1 Two columns of parameters for one column of SNPs . . . . 1103.4.1.2 Determinant calculation . . . . . . . . . . . . . . . . . . . 118
3.4.2 Replace Z with an Average . . . . . . . . . . . . . . . . . . . . . . . 1223.5 Simulation and Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . 128
3.5.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1283.5.2 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4 SUMMARY AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . 132
4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1324.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
APPENDIX
A ERGODICITY OF GIBBS SAMPLING WHEN UPDATING Z MATRIX BYCOLUMNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
B AN ALGORITHM FOR CALCULATING THE NUMERATOR RELATION-SHIP MATRIX R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6
LIST OF TABLES
Table page
1-1 Illustration of univariate non-response . . . . . . . . . . . . . . . . . . . . . . . 14
1-2 Illustration of multivariate missing . . . . . . . . . . . . . . . . . . . . . . . . 15
1-3 Illustration of monotone missing . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1-4 Illustration of general missing data pattern . . . . . . . . . . . . . . . . . . . . 16
1-5 File matching missing data pattern . . . . . . . . . . . . . . . . . . . . . . . . 16
2-1 Illustration of combinations of missing data for one observation. . . . . . . . . . 37
2-2 The percentages of SNP categories in the generated data sets. . . . . . . . . . . 50
2-3 The percentages of correctly imputed SNPs for different probabilities of SNPcategories. 10% missing values exist. . . . . . . . . . . . . . . . . . . . . . . . . 52
2-4 The estimated means of family effects for different data sets with different per-centages of missing values. This methodology give accurate estimates as thepercentage of missing values goes up to 20%. . . . . . . . . . . . . . . . . . . . . 52
2-5 The estimated means of SNP effects for the data set without missing values andfor data sets with different percentages of missing values. . . . . . . . . . . . . 53
3-1 Comparison of time spent on inverse calculation using standard software andMiller’s method with 2 columns and 2 rows updated. . . . . . . . . . . . . . . . 115
3-2 Comparison of time spent in calculation of matrix inversion for different methods. 118
3-3 Records of subsets indicators with actual values of γ for Table 3-4 and Table 3-5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3-4 Bayes factor calculation approximation. Use first 20000 iteration as burnin andtake the following 400 samples for the calculation. . . . . . . . . . . . . . . . . 127
3-5 Bayes factor calculation comparisons. Use first 40000 iteration as burnin andtake the following 400 samples for the calculation. . . . . . . . . . . . . . . . . 130
3-6 Simulation results of Bayesian variable selection for 15 SNPs and 450 observa-tions, 10% random missing values. Using the Bayes factor estimation formula. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
3-7 Bayes variable selection for lesion length data set. Using the average of imputedmissing SNPs as if observed. 20000 steps of burnin and another 20000 steps assamples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7
LIST OF FIGURES
Figure page
2-1 The first trace plot is for the first 2 family effect parameters for the lesion lengthdata. The second plot is for one of SNP parameter for the carbon isotope datafrom Paltaka, Florida. The samples are taken after initial 40000 steps of burnin.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2-2 95% Confidence intervals for the additive and dominant SNP effect of the 44SNPs for lesion length. The 22th SNP has significant dominant effects with 95%confidence, while the other SNPs are not significant. Other SNPs such as the2nd and the 23th SNPs are approximately significant with 95% confidence. Theseare good candidates for further biological exploration. . . . . . . . . . . . . . . 58
2-3 95% Confidence intervals for the additive and dominant SNP effect of the 44SNPs for the carbon isotope data from Paltaka, Florida. The 16th SNP has sig-nificant dominant effects with 95% confidence, while the other SNPs are notsignificant. Other SNPs such as the 6nd and the 40th SNPs are approximatelysignificant with 95% confidence. These are good candidates for further biologi-cal exploration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2-4 95% Confidence intervals for the additive and dominant SNP effect of the 44SNPs for carbon isotope data from Cuthbert, Georgia. The 8th, 35th and the36th SNP have significant dominant effects with 95% confidence, while the otherSNPs are not significant. Other SNPs such as the 13nd and the 28th SNPs areapproximately significant with 95% confidence. These are good candidates forfurther biological exploration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
BAYESIAN METHODOLOGIES FOR GENOMIC DATA WITH MISSINGCOVARIATES
By
Zhen Li
December 2008
Chair: George CasellaMajor: Statistics
With advancing technology, large single nucleotide polymorphism (SNP) datasets
are easily available. For the ADEPT 2 project, we have candidate SNPs and interesting
phenotypic trait values available, while about 10% of the SNPs are missing.
Standard software packages cannot deal adequately with missing SNP data. For
example, SAS either uses an available case analysis (which employs all the complete cases
for the inference of target parameters) or the procedure MI (or MIANALYZE) where SAS
assumes multivariate normal distributions for all the variables. Some software deletes
the incomplete observations, which is generally unacceptable for datasets with many
SNPs, because it can give biased estimates, or possibly delete all the data. More recently,
single SNP association, linkage disequilibrium based imputation, and haplotype based
imputation have been proposed.
I describe a Bayesian hierarchical model to explain the SNP effects for the phenotypic
traits, and incorporated family structure information for the observations. For this
association test, the information of the degree of linkage disequilibrium is not required
and missing SNPs are imputed based on all the available information. We used a Gibbs
sampler to estimate the parameters and prove that updating one SNP at each iteration
still preserves the ergodic property of the Markov chain, and at the same time it improves
computation speed.
9
We also ran a stochastic search algorithm to search the good subsets of variables
or SNPs. Bayes factor is used as a model comparison criterion and a new Bayes factor
approximation formula is proposed. The hybrid Metropolis-Hastings algorithm was used
to search the good models in the model sample space and proven to have the ergodic
convergence property. To improve the computation speed, first a matrix identity was
applied to avoid direct calculation of matrix inversions and determinants, then we replaced
the imputed missing SNPs with the average of imputed SNPs, which substantially
increased the computation speed.
10
CHAPTER 1LITERATURE REVIEW AND PROJECT INTRODUCTION
1.1 Introduction to the Project
Loblolly pine is an economically and ecologically important tree species in the United
States. Loblolly pine grows in 14 states in the US from north in New Jersey, south in
Florida and west in Texas. Its annual harvest value is approximately 19 billion dollars, see
McKeever and Howard (1996). The pine species in the southern states produces 58% of
the timber in the US and 15.8% of the world’s timber, see Wear and Greis (2002).
Scientists are interested in discovering the relationship between the phenotypic traits
and the complex biological roles and functions of genes in loblolly pine, which could help
to explain the evolution of adaptive traits for other land plants. Right now, researchers
from several universities in the ADEPT 2 project are collaborating to identify alleles
which control wood properties and disease resistance. Another goal of the ADEPT 2
project is to associate allelic variation with phenotypic variation for some subset of genes.
More specifically, there are 4 objectives: 1) to identify 5000 target candidate genes for
wood property and disease resistance traits; 2) to discover alleles for 5000 candidate
genes using a high throughput resequencing pipeline; 3) to estimate the extent of linkage
disequilibrium in regions of a small number of candidate genes; 4) to detect and verify
associations between SNPs in ∼ 1000 candidate genes and a suite of wood property
phenotypes, disease resistance phenotypes.
In the previous ADEPT project, SNP, or single nucleotide polymorphism discovery
was done for 50 genes involved in resistance to disease and response to water-deficit. A
single nucleotide polymorphism is a DNA sequence variation occurring when a nucleotide -
A, T, C, or G- differs from other nucleotide in the individuals from the same species at the
same locus. The SNPs may occur in the coding genes or non-coding genes, so researchers
are trying to detect the SNPs which have significant influence for the quantitative traits
from large amount of potential SNPs.
11
From the ADEPT project, there were 61 loblolly pine families from a circular design
with some off-diagonal crossings. The 32 parents for these families were from University of
Florida and North Carolina State University. In each family there are certain number of
clones and each clone has a maximum of 5 ramets and the number of clones in the families
are not necessary equal. The details of the experiment can be found in Kayihan et al.
(2005). Following the terminology used in this project, two ramets within any clone share
exactly the same genetic information while two clones within any family only share the
same parents, and they can be considered to be siblings.
Loblolly pine is the host of two fungal pathogens, which causes the fusiform rust
disease and pitch canker disease. Fusiform rust has a spindle-shaped swelling on branches
and stems of loblolly pine trees and this disease is caused by a fungus. Fusiform rust will
form stem galls, which lead to short survival times, bad wood properties and slow growth.
This disease causes losses of hundreds of millions dollars in southern United States. Pitch
canker is also a very important disease, which is caused by fungus and produces resinous
lesions and leads to seedling mortality, decreased growth rates and crown dieback.
The large pitch canker data were recorded in the USDA Forest Service Resistance
Screening Center in Bent Creek, North Carolina and the smaller pitch canker screen was
conducted at the University of Florida. Both ten gall rust screens and one gall rust screens
were inoculated in North Carolina with different density of spores/ml. This data set
information is called CCLONES and will be referred by the name CCLONES later. We
will have another data set later from NCSU which consists of individual trees not recently
mated to each other and can serve as a natural occurring pine population and will be used
for SNP verification at the end of the ADEPT project.
For this dissertation, we have 46 genotyped SNPs in the CCLONES population
for association discovery and the methods developed in this study will be applied for
the ∼ 7600 SNPs later. The response are mainly the measurements of lesion length for
fusiform rust and pitch canker, where lesion is an infected or wounded patch of skin.
12
About 10% of the 46 SNPs have missing values in CCLONES, and on average, each
observation has more than one SNP missing. If we employ the conventional listwise
deletion method to deal with missing data, almost all the entries of observations will be
deleted and the analysis will be meaningless. Even worse, when considering the 3000
SNPs dataset later, then each observation has 300 SNPs, and if there is still about 10%
missing values, listwise method will delete the entire dataset. So effective methods, which
treat the incomplete data properly, is in urgent need and will save the expenses of the
experiment, labor and energy dramatically compared with the conventional methods such
as re-genotyping.
The goals of this dissertation are:
• To model the relationship between the phenotypic disease response of loblolly pine,
such as fusiform rust and pitch canker lesion length, and the genotyped SNPs, while
utilizing the family structure information of the loblolly CCLONES. Methodologies
for dealing with missing data in modeling and association tests are in need to utilize
the available incomplete data sets. As missing data are typical in many data sets
these days, including genetic data sets, this research potentially will have wide
applications.
• Furthermore, subsets of SNPs may exists and they may interactively responsible for
the phenotypic traits. Another goal is to be able to find the “good subsets” of SNPs
and let the biologists further investigate them.
• The last goal relates to the computation problem. According to a preliminary
analysis, the CCLONES data has 888 clones in it and if we utilize the family
structure by the means of covariance, we are dealing with 888 × 888 matrix and it
will involve many inversions of that large matrix. To select the important SNPs and
impute the missing SNPs (according the available information, generate the missing
SNP by the probability of what the genotype of that SNP could be), Markov chain
Monte Carlo will be the way to go and it definitely has a heavy computation burden,
13
we need to figure out ways to be able to run the Markov chains and to speed up the
chains. Computation is also an issue for the project of selecting subsets of variables.
To solve the computation problem is a critical step of applying the methods to
thousands of SNPs later.
1.2 Introduction to Missing Data
1.2.1 Patterns of Missing Data
The data sets discussed here are always arranged in a rectangular shape. The rows
correspond to the records for the observations and the columns record the variables or
responses. In this section, we will introduce the patterns of the missing data.
a) Univariate Non-response
Table 1-1. Illustration of univariate non-response
Age Weight(lb) Height(feet) Race28 160 5.8 white34 115 6.1 black55 216 5.4 white45 230 6.3 *
For the data sets with univariate missing pattern, only one column has missing data
and all the other columns have complete information. I use “ ∗ ” to denote the missing
value. Suppose we want to know the relationship of Height versus Age, Weight, and Race
from the data in this table. If the data set is complete, a most possible method to employ
is linear regression. For our situation, people might suggest discarding the last observation
and continue to do linear regression. But is the data missing completely at random? Will
deleting the incomplete case lead to a biased estimator?
b) Multivariate missing
In surveys, it is usual that the design variables are known before the study, but some
variables will not always be filled in by the people being surveyed. The following is an
example of multivariate missing pattern.
14
Table 1-2. Illustration of multivariate missing
Location Immigration Income Household number Working yearsAlachua Yes 53000 5 10Alachua Yes 65000 5 4
Lee No 70000 6 2Lee No * * *
Alachua No * * *
In the Table 1-2 the county name and immigration status are design variables and can
be filled in before the survey. The left columns are questions supposed to be answered in a
questionnaire and sometimes people do not answer. Again, before deleting the incomplete
cases, we need to ask ourselves is the missing data missing completely at random? Are
there any biases of the missing data? If the missing data are missing completely at
random, there would be no harm to just deleting these incomplete cases.
c) Monotone missing
This kind of missing pattern often happens in longitudinal studies where subjects
drop out over time and the column records in the later time tend to have more missing
data. For example, Table 1-3 is a medical experiment and researchers want to test the
effect of a certain drug for blood pressure:
Table 1-3. Illustration of monotone missing
Weight(lb) BP in 1 M BP in 3 M BP in 6 M BP in 9 M BP in year 1150 145 150 150 156 160167 111 132 140 145 150200 167 150 156 160 170230 170 * * * *180 210 200 * * *
In the above table, “BP in 1 M” means the record of blood pressure in one month.
If the observation is missing at 3 months, then the records for the later months tend to
be missing too. The reason for the occurring of missing values might be that the subject
moved to a new city, or he was too sick to continue the experiment or just any other
reason that he couldn’t continue the experiment. It will likely to introduce bias if we just
15
delete the incomplete cases. Here the mechanism of missing values needs to be further
investigated.
d) General missing pattern
The general missing pattern is the pattern that can not be put in any special cat-
egory. For example in genetic data analysis, the following data are typical: In this
Table 1-4. Illustration of general missing data pattern
Phenotypic value female parent male parent SNP1 SNP2 SNP315.33 11345 13453 gg tt *16.235 11345 13451 * tc gc17.89 11354 11542 gc * *19.31 11453 11671 * * cc20.32 11345 11651 cc cc *
example, the first column records the phenotypic value of the trait which the researcher
is interested in. The second and the third column records the parents information and
the left columns are the SNP information. For microarray data, typically we have more
columns of SNP information than that of the above table. So listwise deletion is totally
inappropriate since it tends to delete the majority of the data and waste lots of effort.
More appropriate methods should be applied.
e) File matching
In this category, for each observation, only one variable is observed out of the covari-
ates to be filled in. So in the worst cases, there might be no complete cases. Definitely
better methods than listwise deletion need to be employed. The following is an illustra-
tion:
Table 1-5. File matching missing data pattern
Age Job title Income (dollars)55 manager *40 banker *31 * 7000023 * 3500049 CEO *
16
In the above examples, we see directly that deleting the incomplete cases is quite
dangerous and understanding the mechanism of the missing data is very important for
valid statistical inference to be conducted.
1.2.2 Mechanism of Missing Data
The following categorizations are due to Little and Rubin (1987). According to their
definition, the mechanism of missing data can be divided into 3 categories. Suppose there
are missing data on variable Y and we use variable M to denote whether the variable Y is
observed or not. M = 1 means the variable is missing and M = 0, otherwise.
MCAR If whether or not Y is observed is independent of the values of Y itself, and
also is independent of any other variables in the data set, we say Y is missing completely
at random(MCAR). When this assumption is satisfied, the complete cases can be regarded
as sub-samples from the population and statistical inference with regard to the complete
cases is totally legitimate. The only problem is that the sample size is decreased and the
standard error is bigger than what it could be. MCAR is a rather strong assumption and
in most cases it is not satisfied. But when it is satisfied, listwise deletion is straightforward
to apply and gives a valid statistical inference.
If we denote the observed part of Y by Yobs, and the missing part by Ymis, then
Y = (Yobs, Ymis). Let ξ be the parameter of the missing mechanism. The situation of
missing completely at random can be expressed as
P (M |Yobs, Ymis, ξ) = P (M |ξ).
MAR If the probability that an observation is missing depends on Yobs, but not on
Ymis, then it is missing at random, MAR. This is a more general situation than missing
completely at random since the probability of missing data in MAR could depend on the
observed data while in MCAR it should be independent of both Yobs and Ymis. The precise
formula for MAR is:
P (M |Yobs, Ymis, ξ) = P (M |Yobs, ξ).
17
NMAR The last category is not missing at random, NMAR. When the case is not
missing at random, the probability that the data is missing depends not only on the
observed values but also on the missing value self. In this situation, typically we need to
model the mechanism of missing data specifically.
One thing to note is that, strictly speaking, we cannot verify that the missing data
is MAR since the missing data are not observed and thus impossible to verify whether
the probability of the missing data are totally independent of the unobserved value.
However, MAR is a reasonable assumption in many cases and it is an easier assumption
for statistical inference than NMAR, unless there is evidence of not missing at random,
generally we could assume it is MAR.
When the data set is complete, direct procedures like maximum likelihood can be
employed for the statistical inference. We would have
ζ = arg maxζ
P (Y |ζ).
Further, we assume that the parameter ζ of the data model and the parameter ξ of
the missing value mechanism are distinct. This means from the frequentist perspective,
the joint parameter space of (ζ, ξ) will be the Cartesian cross product of the space of ζ
and ξ. On the other hand, from the Bayesian perspective, the joint prior of (ζ, ξ) could
be factored into the product of independent priors for ζ and ξ. If both the MAR and
distinctness hold, we say the missing data mechanism is ignorable by Little and Rubin
(1987). When the distinctness holds, we have:
P (M,Yobs, Ymis|ζ, ξ) = P (M |Yobs, Ymis, ξ)P (Yobs, Ymis|ζ).
While the data is MAR, P (M |Yobs, Ymis, ξ) = P (M |Yobs, ξ), and the above formula can be
simplified as
P (M,Yobs, Ymis|ζ, ξ) = P (M |Yobs, ξ)P (Yobs, Ymis|ζ).
18
The probability distribution of the observed data (Yobs,M) would be :
P (M,Yobs|ζ, ξ) =
∫P (M,Yobs, Ymis|ζ, ξ) dYmis (1–1)
=
∫P (M |Yobs, ξ)P (Yobs, Ymis, |ζ) dYmis
= P (M |Yobs, ξ)
∫P (Yobs, Ymis|ζ) dYmis
= P (M |Yobs, ξ)P (Yobs|ζ).
Typically if we do maximum likelihood inference we will have
(ζ , ξ) = arg maxζ,ξ
P (M,Yobs|ζ, ξ)
under the assumption of ignorable non-response,
ζ = arg maxζ
P (Yobs|ζ),
and
ξ = arg maxξ
P (M |Yobs, ξ).
If the data is NMAR, the joint distribution of the observed and missing data conditioned
on the parameters will be:
P (M,Yobs, Ymis|ξ, ζ) = P (M |Yobs, Ymis, ξ)P (Yobs, Ymis|ζ).
Based on that, the observed likelihood is:
P (M,Yobs|ξ, ζ) =
∫P (M |Yobs, Ymis, ξ)P (Yobs, Ymis|ζ) dYmis,
which cannot be further simplified. In this situation, we cannot ignore the missing data
mechanism and we need investigate it to get valid statistical inference for the parameter ζ.
1.3 Existing Methods for Missing Data
Although a lot of methods have been proposed to handle missing data, only a few
have gained widespread popularity. I will briefly review the listwise deletion method,
19
the pairwise deletion method in this section and pay more attention to the maximum
likelihood method, Expectation Maximization (EM), and multiple imputation method
since these last three are very general approaches and have been widely used to handle
otherwise difficult missing data problems.
The listwise deletion method basically deletes the incomplete cases and processes the
statistical analysis as if there were no missing data. This method is the most straightfor-
ward method and is the default option in a lot of popular software. When the data are
MCAR, listwise deletion is equivalent to a sub-sampling of the population and the result-
ing statistical inference is legitimate except that we will get larger standard error because
of fewer observations. When the assumption of MCAR is violated, listwise deletion will
give biased estimates since the remaining data are weighted more than they should be.
Another disadvantage of this method is that it tends to lose substantial information from
the data if the missing data occur in multiple covariates.
Another method is pairwise deletion, which can be used in linear regression to
estimate the means or the covariance matrix. The idea of pairwise deletion is to use all the
cases that are available for the summary statistics which were intended to be computed.
For example, suppose we have a bivariate response (Y1, Y2) and each variable has missing
data. When we calculate the mean for Y1 we use all the observed data for Y1 although
the corresponding observation for Y2 might be missing. When we calculate the covariance
for Y1 and Y2, the cases with both Y1 and Y2 are observed will be used. The biggest
problem of pairwise deletion is that the estimates and standard errors in most software are
biased when the missing values are not MCAR, since the principle of pairwise deletion is
ambiguous in its implementation in software. When computing the covariance for Y1 and
Y2, there is no clear direction about which observed values should be used to calculate the
mean of Y1, the complete cases for Y1 or the complete cases for Y1 and Y2.
20
Mean substitution is another choice in the literature. It imputes the missing values
with the mean of the complete observed cases. It requires strong assumptions and gives
downwards biased standard errors.
1.3.1 Maximum Likelihood
The maximum likelihood method is a general approach which can handle MAR quite
well. General maximum likelihood estimates have a number of desirable properties such as
consistency, asymptotic efficiency and asymptotic normality. Consistency means that the
estimators converge to the true values under general conditions, and asymptotic efficiency
means that the true standard errors are the smallest among the consistent estimates.
Asymptotic normality implies that the repeated estimates have an asymptotic normal
distribution and the approximation will improve with larger sample sizes.
Maximum likelihood is especially easy when the missing data pattern is monotonic.
Consider the two variable case where (Y1, Y2) with Y2 having missing data. Y1 has n
observations and Y2 are only observed for m observations and the remaining n − m
observations of Y2 are missing. Obviously this is a monotone missing pattern. Suppose
that we write the observed likelihood as:
L(λ, µ|Y ) =m∏1
h(Y2j|Y1j, λ)n∏1
g(Y1j|µ),
where g(Y2|Y1, λ) is the conditional distribution of Y2 given Y1 with parameter λ and
g(Y1|µ) is the marginal distribution of Y1 with parameter µ. If we can achieve the above
factorization, we can continue to maximize the likelihood equations separately.
1.3.2 EM Algorithm
Direct factorization of the likelihood is not always possible, which means that
the direct maximum likelihood method has limited application for missing data. The
Expectation-Maximization (EM) algorithm is a very general method for obtaining the
maximum likelihood (ML) estimates and it has gained great popularity since its intro-
duction by Dempster et al. (1977). The basic idea of EM is to maximize the likelihood
21
of a difficult incomplete data set by repeatedly maximizing a “complete” data problem.
The EM algorithm requires the MAR assumption and is often used to find the maximum
likelihood (ML) estimates of the parameters.
In any incomplete-data problem, the density of the complete data can be factored as:
p(Y |θ) = p(Yobs|θ)p(Ymis|Yobs, θ).
If I take logs of both side, we will have
l(θ|Y ) = l(θ|Yobs) + log p(Ymis|Yobs, θ) + c, (1–2)
where l(θ|Y ) is the log likelihood of p(Y |θ) and l(θ|Yobs) is the log likelihood of p(Yobs|θ)and c is a constant. p(Ymis|Yobs, θ) is called the predictive distribution of the missing
data given θ. Because the second term in the above formula has Ymis in it, we cannot
maximize Equation 1–2 directly, instead we will integrate out the Ymis over the predictive
distribution. Suppose the current realization of θ is θ(t) and the predictive distribution is
P (Ymis|Yobs, θ(t)), the integration yields:
l(θ|Yobs) = Q(θ|θ(t)) + H(θ|θ(t)) + c,
where
Q(θ|θ(t)) =
∫l(θ|Y )p(Ymis|Yobs, θ
(t)) dYmis,
and
H(θ|θ(t)) =
∫log P (Ymis|Yobs, θ)p(Ymis|Yobs, θ
(t)) dYmis. (1–3)
So the EM algorithm consists of 2 steps of iterations:
• The expectation step: in which the function Q(θ|θ(t)) is calculated with respect tothe predictive distribution of the missing data p(Ymis|Yobs, θ
(t)).
• The maximization step is to find θ(t+1) by maximizing Q(θ|θ(t)) .
Repeat the two steps until the ML estimates converge. Wu (1983) was the first to prove
that θ(t) −→ θ. Dempster et al. (1977) proved that if we let θ(t+1) be the value that
22
maximizes Q(θ|θt), then θ(t+1) is always better than θ(t), which means the observed log
likelihood of θ(t+1) is at least as good as that of θ(t):
l(θ(t+1)|Yobs) ≥ l(θ(t)|Yobs). (1–4)
Data augmentation by Tanner and Wong (1987) is another widely used method
which often is used to explore the posterior distribution with missing data. It has a strong
resemblance to the EM algorithm and it is also composed of two steps. In the first step of
data augmentation, it simulates a complete data sufficient statistic while in EM algorithm
the first step is used to calculate the expectation. The other step in data augmentation
draws random simulation of parameters from the posterior distribution of complete
sufficient data which comes from the first step, while the other step of the EM algorithm
maximizes the parameter. Data augmentation assumes MAR also.
1.3.3 Multiple Imputation
Multiple imputation has been a popular method for handling general missing data
patterns since it was introduced by Rubin (1978). A considerable number of books have
been devoted to implementing the framework of multiple imputation, such as Analysis
of Incomplete Multivariate Data by Schafer (1997) and Missing Data by Allison (2002).
Generally they assume MAR if the missing data mechanism is not modeled.
The idea of multiple imputation is to complete the data to get a usable likelihood,
based on which statistical inference is much easier to be conducted. To get complete data,
random imputation takes the place of missing values. Since we know the imputed values
are not the true values of missing data, we need to impute the data multiple times to let
the random effect of imputation center around the unobserved true value. The multiple
complete data sets then are combined to give estimates.
Suppose there is data Y(mis) which has missing values in it and we create K complete
data sets Y(C1), Y(C2), ..., Y(CK). Suppose we perform regression analysis for each complete
data set Y(Cj) and have estimate β(Cj), we then average the estimates β(Cj) to get a better
23
estimate:
β =1
K
K∑j=1
β(Kj). (1–5)
The within variance is calculated as
U =1
K
K∑j=1
V ar(β(Kj)),
and the between imputation variance is calculated as
B =1
K
K∑j=1
(β(Kj) − β)(β(Kj) − β)′.
Combining the within variance and between variance produces the total variance as
V ar(β) = U + (1 +1
K)B, (1–6)
and the corresponding degrees of freedom VK is VK = (K − 1)[1 + U1+K
B
]2. This combined
variance will approximately have a t distribution with degrees of freedom VK .
Barnard and Rubin (1999) suggested using V ∗K = [ 1
VK+ 1
V ar(β)]−1 for small sample sizes.
To perform likelihood ratio tests, Li et al. (1991) and Meng and Rubin (1992) suggest the
following as an approximate likelihood ratio test.
DK =(β − β0)
′V ar(β)−1
(β − β0)
p(1 + rK),
where p is the dimension of β and β0 is the value of β under null hypothesis, and
rK = (1 +1
K)tr(BK V ar(β)
−1
)/p,
and
BK =1
K
K∑j=1
(βKj − β)(βKj − β)′.
This test has an associated F distribution Fp,w with p and w degrees of freedom, where
w = 4 + (p (K − 1)− 4)
[1−
1− 2p(K−1)
rK
]2
.
24
This test has a nice feature similar to the likelihood. It has the strong assumption that the
portion of missing data should be equal for each variable, however simulations by Li et al.
(1991) show that test is quite robust to the violation of this assumption when K ≥ 3.
One question that needs to be answered is how big should K be? Rubin (1987) gives
the following formula to calculate the efficiency under γ proportion of missing data and K
created complete data sets:
Ef = (1 +γ
K)−1.
From this formula it is easy to calculate the needed number of complete data sets to
achieve efficiency. It turns out that for low proportions of missing data, say γ ≤ 0.3,
K = 5 will give reasonably high efficiency while when large amounts of data are missing,
for example, γ = 0.7, K = 10 will have 93.5% efficiency.
In practice, multiple imputation requires a mechanism to impute the missing data and
there are a number of ways to do that. Reilley (1993) considers the Hot Deck imputation
which draws random samples from the actual observed values with equal weights as an
imputation. This is an easily employed method and Efron (1994) calls the Hot Deck
method a non-parametric bootstrap method. Xie and Paik (1997) proposed using the
Bayesian version of Hot Deck method; instead of drawing the random sample with equal
weights from the observed values, they put a Dirichlet prior on the weights. Another
imputation method assumes all of the covariates are normally distributed, even for
categorical variables. The category is rounded to the nearest cumulative function category.
1.4 Bayesian Variable Selection
Variable selection is a frequently used method in the statistical data analysis and
Bayesian variable selection has been the subject of substantial research in recent years.
Most of the theoretical developments have been based on normal linear regression. Smith
and Spiegelhalter (1980) proposed a unified approach of prior specification for model
choice and they showed that different prior approaches could lead to a Schwarz-type
criterion or Akaike Information Criterion. Mitchell and Beauchamp (1988) used a “spike
25
and slab” type of prior for the candidate variables to be considered. Their approach is
not fully Bayesian and an important parameter (the height of the spike over the height
of the slab) is estimated from the data and is used as a known parameter. George and
McCulloch (1993), George and McCulloch (1995), and George and McCulloch (1997)
advocated the Markov chain Monte Carlo approach for the posterior calculation and this
made the variable selection for large candidates possible. Brown et al. (1998) extended
the application of Bayesian variable selection to multivariate responses and a Markov
chain Monte Carlo algorithm was employed to speed up the computation. Berger and
Pericchi (2001) proposed the objective intrinsic Bayes Factors for model selection. An
objective fully automatic Bayesian procedure was proposed by Casella and Moreno (2006).
They used intrinsic priors to calculate the posterior probabilities and a stochastic search
algorithm was employed. Just a few related papers are listed above and many others exist
in the literature.
In normal linear regression, we have a dependent response Yn×1 and a set (X1, ..., Xs)
of s potential explanatory predictors and we assume Y = Xβ + ε, with Y |Xβ ∼ N(0, σ2I).
The actual value of some of the βi, i = 1, ..., s, may not be significantly different from 0 or
may be correlated with another. The goal of variable selection is to find a set of predictors
X∗1 , ..., X
∗r , which is a subset of X1, ..., Xs, such that each corresponding βi has a practical
significant effect on Y and the full model is simplified to a certain degree. I will employ an
index vector γ with length s and let each element of γ index whether the corresponding
predictor Xi is included in the selected model or not. The value γi = 1 means Xi is in the
selected model and γi = 0 otherwise. So the variable selection problem is to find a vector γ
according to some criterion.
Many variable selection methods have been based on the Akaike Information Crite-
rion, AIC Akaike (1973), Mallows’ Cp Mallows (1973) and Bayesian Information Criterion,
BIC Schwarz (1973) and have been applied to many problems when s is reasonably small.
26
These methods basically try to maximize a penalized sum of squares function, although
taking different penalized term in the target function.
Suppose Xγ represents the selected predictor matrix indexed by γ and let βγ =
(X ′γXγ)
−1XγY denote the least squares estimate of βγ. I use r to represent the number of
variables in the selected subsets. The sum of squares for regression is
SSγ = β′γX′γXγβγ.
AIC, Cp, and BIC are criteria trying to maximize the function
SSγ/σ2 − C · r, (1–7)
where σ2 is an estimate of σ2 and C is a penalized term chosen by different criterion. If
C takes the value of 2, Cp is resulted and gives exact BIC too. For AIC, take C = log n.
George and Foster (1994) proposed to take C = 2 log p as the risk inflation criterion,
RIC. The above mentioned methods have different motivations, unbiased predictive risk
estimate for Cp, expected information distance for AIC and asymptotic Bayes factor
for BIC. When the number of available predictors are small, the above methods have
been widely applied. When considering problems with hundreds of predictors, each
potential predictor can either be included in the selected model or not and therefore
there are 2smodels, a number that can easily go beyond the computation ability and the
above mentioned criteria will be impossible to be applied. New methodologies to handle
hundreds of predictors are needed.
1.4.1 Semiautomatic Bayesian Variable Selection Method
A popular hierarchical Bayesian formulation for variable selection is the followings:
Take the prior for (β, γ) to be
P (γi = 1) = 1− P (γi = 0) = pi,
27
and
π(β|σ, γ) = π(β|γ) = Np(0, QγRγQγ),
where R is the prior correlation matrix and
Qγ = Diag[b1τ21 , ..., bpτ
2p ],
with bi = 1 when γi = 0 and bi = c when γi = 1, where ci is some hyperparameter.
The parameter τ 2i is the prior variance for the element of β. Whether the pre-specified
hyperparameter is big or not represents the statistician’s preference for a saturated model
or a parsimonious model, while pi shows the prior belief about whether certain predictors
should be included in the model or not. The values c and pi are typically set to some
fixed numbers according to the statistician’s experience, and c = 100 and pi = 1/2 for
i = 1, ..., p, is a popular choice. Cui and George (2008) also proposed putting priors for c
and pi and integrating them out.
In the hierarchical model, the prior for σ2 needs to be specified too and normally it is
taken as the inverse Gamma Ga(µγ/2, λγ), where µγ and λγ are hyper-parameters. More
detailed discussion about how to decide the hyperparameter c, p, µγ and λγ has been given
in George and McCulloch (1993), George and McCulloch (1997).
With the above specification, the posterior distribution f(γ|Y ) could be calculated
and then the idea is to rank the posterior distribution to decide the ideal sets of predic-
tors. The actual posterior distribution calculation is a challenge and methods to address
this will be discussed in a later subsection.
1.4.2 Automatic Bayesian Variable Selection Method
Casella and Moreno (2006) proposed the fully automatic Bayesian variable selection
procedure by using intrinsic priors. Let Γ represent the set of realizations of all γ and
γ = 1 corresponds to the full model with all potential predictors. The Bayes factor is
defined as
Bγ1 =mγ(Y, X)
m1(Y, X),
28
where mγ(Y, X) is the marginal distribution for the model indexed by vector γ, while
m1(Y,X) is the marginal distribution with all the predictors in the model. It can be
shown that
P (γ|Y,X) =Bγ1(Y, X)
1 +∑
γ∈Γ,γ 6=1 Bγ1(Y,X), γ ∈ Γ,
where P (γ|Y,X) is the posterior distribution of a model with predictors indexed by
γ conditioned on the observed Y and X. Suppose that the standard default prior is
πD(βγ, σγ) = Cγ/σ2γ, where Cγ is a constant. If there are two models M1 and M2, and the
default priors are πD1 (βγ, σγ) and πD
2 (βγ, σγ). The proposed intrinsic priors are:
πIns1 (βγ1, σγ1) = πD
1 (βγ1, σγ1),
and
πIns(βγ2, σγ2|βγ1, σγ1) = πD(βγ2, σγ2)EM2
x(l)|βγ2,σγ2
f1 (x(l)|βγ1, σγ1)∫f2 (x(l)|βγ2, σγ2) πD(βγ2, σγ2) dβγ2 dσγ2
,
where f1 (x(l)|βγ1, σγ1) is normally distributed and x(l) is training sample. It was proved
that the intrinsic priors for β and σ conditioned on any point (βγ, σγ)is
πI(β, σ|βγ, σγ) = Np
(β|βγ, (σ
2 + σ2γU
−1)) 1
σγ
(1 +
σ2
σ2γ
)− 32
,
with U = Z ′Z and Z is a theoretical design matrix with dimension (p+1)×p. Furthermore
the unconditional intrinsic priors for β, σ can be directly calculated. By using the intrinsic
priors, hyperparameters are avoided and automatic priors could be achieved. One step
further, the Bayes factors can be calculated using the intrinsic priors and further the
posterior model probability for Mγ, γ ∈ Γ can be computed.
The intrinsic prior approach comes from the model structure and is free of hyper-
parameters (do not need to consider a range of hyperparameters), so it can be used as
default prior and it is currently one of the unique objective procedures.
29
1.4.3 Stochastic Search Algorithm
Either using the intrinsic priors or setting the hyperparameters for the priors, the
posterior model probability is troublesome to calculate. Furthermore, it is often unrealistic
to calculate all the 2p posterior probabilities. Fortunately a Markov chain Monte Carlo
stochastic search algorithm has been developed and has been successfully applied.
A general way to implement an MCMC procedure is to run a Gibbs sampler
(β0, σ0, γ0), (β1, σ1, γ1), ..., (βt, σt, γt), ...
with
βt+1 ∼ f(βt+1|σt, γt, X, Y ),
σt+1 ∼ f(σt+1|βt+1, γt, X, Y ),
γt+1 ∼ f(γt+1|βt+1, βt+1, X, Y ).
After enough burn in, we can rank the posterior frequencies by counting the visited models
and the highest ranked or almost equally highest ranked models will be selected. Although
the Gibbs sequence does not necessary visit the entire posterior model space, it normally
has enough chance to visit the best models. Casella and Moreno (2006) have more detailed
discussion about it.
Another way to explore the posterior distribution of P (γ|Y, X) is to construct
a Metropolis-Hastings Markov Chain. Normally the chain will not only visit all the
models, but also visit the better model more. This is how it works: at iteration t, draw
a candidate γ from a candidate distribution V ; accept the candidate γ with probability
min(1, P (γt′ |Y,X)V (γt)
P (γt|Y,X)V (γt′ )
), and stay at γt with probability 1 − min
(1, P (γt′ |Y,X)V (γt)
P (γt|Y,X)V (γt′ )
).
Normally the draws from V are independent and the generated Markov Chain has
stationary distribution P (γ|Y, X).
Carlin and Chib (1995) used the Gibbs sampler to generate samples from the joint
distribution of model and parameters conditioned on the response. This method is
30
computationally demanding and works well for small amount of candidate variables.
Dellaportas et al. (2002) proposed a hybrid Gibbs-Metropolis strategy based on Carlin
and Chib’s method. Green (1995) suggested the reversible jump MCMC method which
could be used for model selection with different dimensions. Chib (1995) also proposed
to estimate the marginal likelihood by using blocks of parameters. These listed methods
generally work well for moderate amounts of candidate variables and some of them have
good application results for special situations. For example, reversible jump is good
for mixture modeling with unknown number of elements. However, there is almost no
investigation in variable selection when a certain percentage of variables are missing.
1.5 Outline of the Dissertation
In the previous sections, I introduced the ADEPT 2 project and talked about the
design of the loblolly population. The patterns and the categories of missing data and
different methods for missing data were reviewed. Variable selection methods based on the
linear regression models, especially Bayesian variable selection methods, were reviewed.
In Chapter 2, an association test is proposed to detect the significant SNPs for
the target phenotypic traits. As the population structure contains substantial genetic
information of the population and it indexes the closeness of two individuals within the
population, we use a numerator relationship matrix to quantify the closeness between any
two individuals with the population. A Bayesian hierarchical model is proposed and the
Gibbs sampler is employed to sample the parameters of the joint likelihood. In order to
be able to handle thousands of SNPs towards the end of ADEPT 2 project, methods to
speed up the computation are proposed: a Gibbs sampling procedure which just samples
one column of SNPs in each cycle instead of all the columns of SNPs in each cycle; and a
matrix identity which takes advantages of the updating scheme and avoids direct matrix
inversion calculation. Simulation studies and real data analyse will be detailed.
In Chapter 3, a Bayesian model selection method is proposed to select “good”
subsets of variables, meanwhile missing data are being properly taken care of. We use the
31
Bayes factor as the criterion of model selection and a novel Bayes factor approximation
formula is proposed. Stochastic search algorithm is employed to search the subsets and
meanwhile the convergence properties are proved. The Bayes factor approximation formula
potentially has wide application for comparing subsets of different dimensions. The
method is applied to simulated data sets as well as the loblolly data.
In Chapter 4, a summary of the dissertation is given and the contribution of this
dissertation will be discussed. Some future work will also be discussed.
32
CHAPTER 2A HIERARCHICAL BAYESIAN MODEL FOR GENOME-WIDE ASSOCIATION
STUDIES OF SNPS WITH MISSING VALUES
2.1 Introduction
All species are made up of DNA sequences. Among any species, all individuals share
most of the DNA components and only a very small percentage of DNA sequences are
different. These different nucleotides in the DNA sequence are called single nucleotide
polymorphisms, SNPs. For human beings, we share about 99.9% DNA sequences and
the remaining 0.1% DNA sequences make us different individuals. Also, these 0.1% DNA
sequences are genetically responsible for different disease development and drug responses
for different individuals. Loblolly pine has 0.5% diversity of nucleotides, which is slightly
larger than that of soybeans and human beings. In this 0.5% SNPs, we are interested in
detecting the significant SNPs which affect the disease response of loblolly pine, especially,
pitch canker.
As the technologies are developing so rapidly, SNP data are getting cheaper and easily
available. At the same time, scientists are having more opportunity to use high through-
put SNP data sets which were not possible before. Although microarray technology could
provide high throughput SNP data, at the current ability, typically 5% to 10% of geno-
types are missing as pointed by Dai et al. (2006). One challenge scientists are facing is
to use available SNP and phenotypic trait information to detect the SNPs which have
strong association with phenotypic traits, at the same time a statistical method is needed
to properly address the missing SNPs.
Population association involving missing genotypic markers has made a lot of progress
for human genetics, especially in the case of tightly linked genotypes, that is, most atten-
tion has been focused on fine-scale molecular region. “Phase” package by Stephens et al.
(2001) aims at haplotype reconstruction of genotyped data and uses an EM algorithm
to maximize the likelihood of haplotype frequencies. They use an approximate type of
prior for the conditional haplotype distribution. Scheet and Stephens (2006) proposed
33
“Fastphase” for missing genotype imputation and haplotype phasing. They used a hidden
Markov chain for the cluster origins of alleles in the haplotypes and the origin of clusters
for genotypes. Imputation was taken as the “best guess” for the likelihood and parame-
ters were again estimated using the EM algorithm. These methods were reported to give
accurate estimates for tightly linked markers. Chen and Abecasis (2007) proposed family
based tests for genome wide association. They use an identical by descent parameter to
measure correlation for the test SNPs, and a kinship coefficient to model the correlation
between siblings. Their method is used for one family, one SNP at a time, and it seems
not fully applicable for complicated population pedigrees and for simultaneous SNP test-
ing. Servin and Stephens (2007) proposed a Bayesian regression approach for association
test. The missing genotypes were imputed by “Fastphase” beforehand, and a mixture
prior is used for the SNP effect. The priors for the number of significant SNPs is set to
be small, and has some influence on the results. Dai et al. (2006) used EM, weighted
EM, and a nonparametric method (CART) for association studies. They used multiple
imputation samples from the tree based algorithms. Roberts et al. (2007) proposed a fast
nearest-neighbor based algorithm to infer the missing genotypes. Marchini et al. (2007)
proposed a unifying framework of missing genotype imputation and association testing
based on haplotypes and other available human genomic data sets. Sun and Kardia (2008)
proposed a neural network model for the missing SNPs and used the BIC to choose the
predictors in the model and further predict the missing SNPs according to the chosen
model. Balding (2006) gives a detailed review of association studies and some missing
genotype methods based on human genetics. There is much more literature on this topic,
the above being just a sample. Almost all the literature we found, however, focuses on
human genetic association where the markers are tightly linked and haplotype based
inference are dominant. Some papers are devoted to the situations of parents information
being missing and proposed likelihood based methods, see Weinberg (1999), Martin et al.
(2003), and Boyles et al. (2005).
34
The Expectation Maximization algorithm originally was developed to impute the
missing data and some authors have been using EM to do the missing SNP imputation
too. But for our situation, with many potential correlated SNPs, EM is not computational
feasible and we will show why here. For the EM algorithm, we start with the following
model as in the Section 2.2 .
Y = Xβ + Zγ + ε (2–1)
Each row of the matrix Z, Zi, i = 1, . . . n corresponds to the SNP genotype informa-
tion of one individual and for one individual we write
Zi = (Zoi , Z
mi ).
The EM algorithm begins by building the complete data likelihood, which is the likelihood
function that would be used if the missing data are filled in.
When we fill in the missing data we write
Z∗i = (Zo
i , Zmi ),
and the complete data are (Y, Z∗) with likelihood function
LC =1
(2πσ2)n/2
∏i∈I0
exp
(− 1
2σ2(Yi −Xiβ − Ziγ)2
)
(2–2)
×∏i∈IM
exp
(− 1
2σ2(Yi −Xiβ − Z∗
i γ)2
)
where Io indexes those individual with complete SNP data, and IM indexes those individu-
als with missing SNP information.
The observed data likelihood, which is the function that we eventually use to estimate
the parameters, is based on the observed data only. Where there is missing data, the
complete data likelihood must be summed over all possible values of the missing data. So
35
we have
Lo =1
(2πσ2)n/2
∏i∈I0
exp
(− 1
2σ2(Yi −Xiβ − Ziγ)2
)
×∏i∈IM
∑
Z∗i
exp
(− 1
2σ2(Yi −Xiβ − Z∗
i γ)2
),
Here we do not have any information about the weight of each term and we will use equal
weights. The distribution of the missing data Z∗i is given by the ratio of LC/Lo:
P (Z∗i ) =
exp(− 1
2σ2 (Yi −Xiβ − Z∗i γ)2
)∑
Z∗iexp
(− 12σ2 (Yi −Xiβ − Z∗
i γ)2) , (2–3)
where the sum in the denominator is over all possible realizations of Z∗i . This is a discrete
distribution on the missing SNP data for each individual. To understand it, look at one
individual.
Suppose that there are g possible genotypes (typically g = 2 or 3) and individual i
has missing data on k SNPs. So the data for individual i is Zi = (Zo, Zm) where Zm has k
elements, each of which could be one of g class. For example, if g = 3 and k = 7, then Zm
can take values in the following Table 2-1, where the ∗ shows one possible value of the Zmi .
For the example, there are 37 = 2187 possible values for Zmi . In a real data set this could
grow out of hand. For example, if there were 12 missing SNPs then there are 531,441
possible values for Zmi ; with 20 missing SNPs the number grows to 3,486,784,401 (3.5
billion). If we employ the EM algorithm, in every iteration step, for each observation we
need to sum over a gigantic number of terms to calculate the distribution of the missing
data and this is computational infeasible. So then we thought we might run a Gibbs
Sampling program inside of each EM step to generate the means for the missing data and
avoid summing over billions of terms.
The Gibbs Sampler for the missing data simulates the samples of Zmi according to
the distribution of each missing element conditioned on the rest of the vector. More of
this will be discussed in Section 2.2 and here we just give a flavor why the Gibbs sampler
36
Table 2-1. Illustration of combinations of missing data for one observation.
SNP∗ ∗
Genotype ∗ ∗ ∗∗ ∗
works. For a particular element Zmij , the conditional distribution given the rest of the
vector Zmi(−j) is
P (Zmij = c|Zm
i(−j)) =exp
(− 1
2σ2 (Yi −Xiβ − Zoi γ
oi − Zm
i(−j)γmi(−j) − cγm
ij )2)
∑3`=1 exp
(− 1
2σ2 (Yi −Xiβ − Zoi γ
oi − Zm
i(−j)γmi(−j) − c`γm
ij )2) (2–4)
where there are only g = 3 terms in the denominator sum and this basically made the
Gibbs sampler to be computational possible for us.
For plant association or other agricultural genome association studies, the whole
genome sequence library might not be complete and the degree of linkage disequilibrium
information might not be available. Thus new methodologies are needed.
To impute missing data, multiple imputation has received much attention since its
proposal by Little and Rubin (1987). It acknowledges the uncertainty due to missing data
and bases inference on multiply imputed data sets. It addresses the variability in the
missing data through multiple imputation. Most methods for large genetic data sets in
the literature, except Dai et al. (2006), use single imputation and base the imputation on
the largest probability for the candidate categories. Although single imputation tends to
give fast calculation, it generally loses power and can give biased parameter estimates. Not
like doing single imputation, we treat the missing SNPs as parameters and impute them
each iteration. This would be less biased compared with single SNP imputation and would
more accurately capture the variation in the data.
In an association study across the entire population, typically the family pedigree is
very important, since it explains the genetic information shared by relatives and which is
not necessary genotyped. Yu et al. (2005) used a kinship matrix and population structure
37
to describe the quantitative inheritance. Chen and Abecasis (2007) proposed family-
based association testing, but his method counts the relationship within families and not
the relationship between families. We propose to use a numerator relationship matrix
to explain the relationship of individuals within families and between families as well.
The idea behind calculation of a numerator relationship matrix was originally due to
Henderson (1976); see also Quaas (1976).
With the motivating CCLONES data set, we propose a Bayesian methodology for
association studies. We model the missing SNPs as parameters and use Gibbs sampling
to sample the parameters, including the missing SNPs. For missing SNP imputation, this
is essentially multiple imputation. We take advantage of the family information from the
available family pedigree and parent pedigree. It is not a haplotype-based method and
does not require the SNPs to be in linkage disequilibrium and allows the model to find the
correlation (if it exists). This property is quite appealing for plant association or any other
association for which we do not have a sequenced genome library or detailed information
about the candidate SNPs. It is especially useful for genome scans and can point out some
significant SNPs for further study. Finally, we prove that just updating one missing SNP
for each observation in each iteration will still achieve the target stationary distribution.
This will substantially increase the calculation speed if there are large numbers of SNPs
for each observation, and gives us the ability to deal with large high throughput data sets.
This chapter is organized as follows. In Section 2.2 we explain two proposed models,
and also propose a method to increase the computation speed. In Section 2.3, simulation
results are given and also we compare with other results. In Section 2.4, the real data
analysis results are reported here. In Section 2.5, we explain a method to categorize the
covariance relationships for any give pedigree situation. In the end, at Section 2.6, we give
a discussion.
38
2.2 Proposed Method
In this chapter, the response is assumed to be continuous with a normal distribution,
although the following method can be easily adapted to a discrete situation (using a latent
variable probit model), which will be explained in the discussion section. The data set
has fully observed family covariates for all the observations and missing values only exist
among SNPs. Interest is focused on developing methods to test the relationship between
the response and the SNPs. To quantify the effect of the SNPs, we decompose each SNP
into an additive and a dominant effect. For each SNP, if its genotype is homozygous, it
is assumed to have additive effect, dominant effect for heterozygous SNP, and a negative
additive effect for the mutant homozygous SNP.
In this section, we discuss two models for this data situations. For the simplicity of
analysis, we employ one of the them through the simulation and real data analysis.
2.2.1 The Model without Ramet Random Effect
First of all, a remete is the smallest observed individual in the data set and any two
individuals are called ramets when they share exact the same genetic information, not
only parents. We later use “clone” to mean a group of ramets in one family, which is
different as the common understanding of “clone”. For this Subsection and throughout the
dissertation, the phenotypic responses are averaged over remeters for each clone except
when specially pointed out.
The model is
Y = Xβ + Zγ + ε (2–5)
39
where
Yn×1 = phenotypic trait,
Xn×p = design matrix for family covariates,
βp×1 = coefficients for the family effect,
Zn×s = design matrix for SNPs (genotypes),
γs×1 = coefficient of additive and dominant effect for SNPs,
εn×1 ∼ N(0, σ2R).
The subscripts of these variables denote the dimensions. The variance covariance
matrix R is for the numerator relationship matrix, which describes the degree of kinship
between different individuals. Details about how to calculate the numerator relationship
matrix R will be given later.
Each row of the matrix Z, Zi, i = 1, . . . n, corresponds to the SNP genotype informa-
tion of one individual. Some of this information may be missing, and we write
Zi = (Zoi , Z
mi )
where Zoi are the observed genotypes for the ith individual, and Zm
i are the missing
genotypes. Note two things:
1. The values of Zmi are not observed. Thus, if ∗ denotes one missing SNP, a possible
Zi isZi = (1, ∗, 0, 0, ∗, ∗, 1).
2. Individuals may have different missing SNPs. So for 2 different individuals, we mighthave
Zi = (1, ∗, 0, 0, ∗, ∗, 1)
Zi′ = (∗, ∗, 1, 0, 0, 1, 1),
which can make a heavy burden for computation.
40
For a Bayesian model, we want to put a prior distribution on the parameters. We
put a noninformative uniform prior for β, which essentially leads us to least squares
estimation. For γ, we use the normal prior γ ∼ N(0, σ2φ2I). Here φ2 is a scale parameter
for the variance and σ2 is the variance parameter. For σ2 and φ2, we use an inverted
gamma priors: σ2 ∼ IG(a, b) and φ2 ∼ IG(c, d), where IG stands for the inverted gamma
distribution, and a, b, c, and d are constants used in the priors. We say σ2 has IG(a, b)
distribution if
f(σ2) =ba
Γ(a)
exp(− bσ2 )
(σ2)a+1.
According to Hobert and Casella (1996), we take a, b, c, d to be specified values, which
ensures a proper posterior distribution.
Now let us specify the values of a, b, c, d and make sure that the posterior distribu-
tions are proper. First of all, we can write the model specification as the follows
p(Y |β, γ, Z, σ2, φ2) ∼ N(Xβ + Zγ, σ2R), (2–6)
π(β) ∝ 1,
π(γ) ∼ N(0, σ2φ2I),
π(σ2) ∼ IG(a, b),
π(φ2) ∼ IG(c, d).
Let Px = I−X(X ′X)−1X ′ and t = rank(Px×Z). According to Theorem 1 of Hobert
and Casella (1996), the following inequalities must be satisfied
41
c < 0
q > q − t− 2c
n + 2c + 2a− p > 0
c > −s
2b > −n,
where s is the number of SNPs in the data, q = 2s, p is the dimension of β and n is the
number of observations.
Further simplification we lead to the following equations:
0 > c > −s
a > −n
2
n + 2c + 2a− p > 0
As the number of observations in the data set is about one thousand, the number
of parameters for SNPs are about one hundred and the dimension of β is 61, we take
c = −1, a = 3, b = 1, and d = 1. The above specification makes sure that the posterior
distributions are proper and the priors have wide flat ranges.
As our data set is from the loblolly pine genome, as with many other agriculture
genomes, it is not fully sequenced and there is not enough prior information in terms of
the frequency of genotypes for each SNP. Also, the physical positions of the SNPs are
not fully recorded and the SNPs may not be tightly linked as clusters of haplotypes. So
noninformative priors are used for the missing SNPs. As the SNPs considered here are
biallelic, each missing SNP, has 3 possible genotypes: homozygous, heterozygous and
mutant homozygous. For example, if the two alleles in the loci are A and G, the missing
42
SNP could be either AA, AG, or GG. So the prior assumes the SNP has an equal chance
to be AA, AG, or GG.
For the missing SNPs in the data, we assume that they are missing at random, MAR.
In another words, we assume that the probability that whether the SNP is missing is
related to the observed data, such as the phenotypic trait or other observed SNPs, and
is independent of the unobserved information. Conditional on this assumption, we will
impute the missing SNPs based on the correlation between SNPs within individuals and
between individuals, and use the phenotypic trait information to improve the power of
imputation. MAR is a reasonable assumption and not as strict as missing completely at
random, MCAR.
In this model, the covariance matrix, R, models the covariance between individuals
within the same family, and the covariance between individuals across families. Phenotypic
traits of related individuals are alike because they share a large fraction of genetic
material. Genotypes of relatives are similar because they share the same parents or
grandparents (in some degree). Genotyped SNPs may explain some part of the phenotypic
traits, still the un-typed genetic information contributes to the phenotypic traits as well
as other factors, such as environmental effects. Simply using the SNP marker information
without considering the family pedigree or history is not a wise approach. We expect that
incorporating the family structure information will increase the power of capturing the
underling nature of the data set and help to detect the significant SNPs.
In the literature, there are different methods to calculate the relationship matrix, such
as using a co-ancestry matrix, a kinship matrix, etc. The basic idea is to calculate the
probability that 2 individuals share one gene or SNP, which is passed down from the same
ancestry. Some of the methods employ pairwise calculation and thus do not guarantee a
positive definite relationship matrix, which is usually not satisfactory when the relation-
ship matrix is used as covariance matrix. We use the recursive calculation method due to
Henderson (1976). This method gives a numerator relationship matrix which quantifies the
43
probability of sharing one gene from same ancestry, based on known family pedigree and
parent pedigree in the population. Through calculating this relationship matrix we obtain
0.5 probability for the case that two siblings are within same family and share the same
copy of a gene from one ancestor. We will have a 0.25 probability if two individuals share
one parent. For the loblolly pine data set, there are a total of 9 categories of relatedness,
and some details can be seen in the Appendix B. Notice that even if the family pedigree
or parent pedigree is not available, we can still use the proposed method to calculate the
relationship matrix. In the following, we have Section 2.5 dedicated to categorizing the
relationships for any pedigree history.
Same as Equation 2–6, we have the following model specification
p(Y |β, γ, Z, σ2, φ2) ∼ N(Xβ + Zγ, σ2R),
π(β) ∝ 1,
π(γ) ∼ N(0, σ2φ2I),
π(σ2) ∼ IG(a, b),
π(φ2) ∼ IG(c, d).
So the conditionals are
β ∼ N((
X ′R−1X)−1
X ′R−1(Y − Zγ), σ2(X ′R−1X
)−1)
(2–7)
γ ∼ N
((Z ′R−1Z +
I
φ2
)−1
Z ′R−1(Y −Xβ), σ2
(Z ′R−1Z +
I
φ2
)−1)
σ2 ∼ 1
(σ2)n/2+s/2+a+1exp
(−
(Y −Xβ − Zγ)′ R−1 (Y −Xβ − Zγ) + |γ|2φ2 + 2b
2σ2
)
φ2 ∼ 1
(φ2)s2+c+1
exp−( |γ|2
σ2 + 2d)
2φ2.
In the above conditionals, the SNPs are generally denoted as the Z matrix, which
actually includes the observed SNPs and missing SNPs. As for the missing SNPs, we use
a Gibbs sampler to impute the missing SNPs. The Gibbs sampler for the missing data
44
simulates the samples of Zmi according to the distribution of each missing SNP conditional
on the rest of observed SNPs and sampled missing SNPs. For a particular SNP Zmij , the
jth missing SNP in the ith individual, the conditional distribution given the rest of the
vector Zmi(−j) and all other parameters in the model is
P (Zmij = c|Zm
i(−j)) =exp−K′
cΣ−1Kc
2σ2
∑3`=1 exp
(−K′
lΣ−1Kl
2σ2
) , (2–8)
where
Kc = (Yi −Xiβ − Zoi γ
oi − Zm
i(−j)γmi(−j) − cγm
ij ),
and
Kl = (Yi −Xiβ − Zoi γ
oi − Zm
i(−j)γmi(−j) − clγ
mij ).
The value c is the genotype currently being considered for that missing SNP, and cl
represents any one of the possible genotypes for the SNP. Notice there are only 3 terms
in the denominator sum for each SNP and this is the key point why Gibbs sampling is
feasible for our situation with many SNPs of the observations.
2.2.2 The Model with Remete Random Effect
For the remetes within a clone, although they share exactly the same genetic infor-
mation, they still have the environmental influences and also environment and genotype
interaction. The original recorded phenotypic responses are traits per remete. We are
interested in the environment variance and hope to find out how big is the influence. In
this subsection, we propose a model to further describe the random effect of remetes.
The model is
Y = Xβ + Zγ + Wu + ε (2–9)
45
where
Yn×1 = phenotypic trait
Xn×p = design matrix for family structure
βp×1 = coefficients for parents, fixed effect
Zn×s = design matrix for SNPs (genotypes)
γs×1 = parameter of the SNP effects
Wn×r = design matrix for remetes
ur×1 = random effects of remetes
εn×1 ∼ N(0, σ2I)
The variance term is specified as the identity matrix right now for the simplicity. The full
specification of model 2–9 includes the following prior distributions:
ε ∼ N(0, σ2I
)
γ ∼ N(0, σ2φ2I
)
u ∼ N(0, σ2τ 2I
)(2–10)
σ2 ∼ π(σ2)
φ2 ∼ π(φ2)
τ 2 ∼ π(τ 2)
46
One thing to note is for the design matrix W , we do not have missing values since the
remetes information is available. The joint distribution of all of the parameters is
π(β, γ, u, σ2, φ2, τ 2) ∝ 1
(σ2)(n+s+r)/2(φ2)s/2(τ 2)r/2
× exp
(− 1
2σ2|Y −Xβ − Zγ −Wu|2
)
× exp
(− 1
2σ2φ2|γ|2
)(2–11)
× exp
(− 1
2σ2τ 2|u|2
)
× π(σ2)π(φ2)π(τ 2)
According to Hobert and Casella (1996), the following inequalities need to be satisfied to
make sure that the posterior distributions are proper:
2c > −rank(Z) (2–12)
2e > −rank(W )
2a > −n
c < 0
e < 0
rank(Z) > rank(K)− t− 2c
rank(W ) > rank(K)− t− 2e
n + 2e + 2c + 2a− p > 0,
where rank() is the rank function, K is the concatenated matrix [Z|W ], σ2 has the prior
σ2 ∼ IG(a, b), φ2 has the prior φ2 ∼ IG(c, d) , τ 2 has the priorτ 2 ∼ IG(e, f). Also
t = rank(Px × K) where Px = I − X(X ′X)−1X ′ and p is the number of families in the
data. So we take a = −1, c = −1, e = −1, b = 0, d = 0, f = 0 and they satisfy the
inequalities 2–12.
47
Based on Equation 2–11, with a flat prior on σ2, φ2 and τ 2 as we showed in above
that they ensure the properties of posterior distributions, we have the following full
conditionals distributions.
β ∼ N((X ′X)−1X ′(Y − Zγ −Wu), σ2(X ′X)−1
)
γ ∼ N
((
1
φ2I + Z ′Z)−1Z ′(Y −Xβ −Wu), σ2(
1
φ2I + Z ′Z)−1
)
u ∼ N
((
1
τ 2I + W ′W )−1W ′(Y −Xβ − Zγ), σ2(
1
τ 2I + W ′W )−1
)
σ2 ∝ 1
(σ2)(n+s+r)/2(2–13)
× exp
(− 1
2σ2
(|Y −Xβ − Zγ −Wu)|2 +
1
φ2|γ|2 +
1
τ 2|u|2
))
τ 2 ∝ 1
(τ 2)r/2exp
(− 1
2τ 2
( |u|2σ2
))
φ2 ∝ 1
(φ2)s/2exp
(− 1
2φ2
( |γ|2σ2
))
we generate Zi by using
P (Zmij = c|(Zo
i , Zmi(−j))) (2–14)
=ωij((Z
oi , Z
mi(−j), c)) exp
(− 1
2σ2 (Yi −Xiβ −Wiu− Zoi γ
oi − Zm
i(−j)γmi(−j) − cγm
ij )2)
∑g`=1 ωil((Zo
i , Zmi(−j), c`)) exp
(− 1
2σ2 (Yi −Xiβ −Wiu− Zoi γ
oi − Zm
i(−j)γmi(−j) − c`γm
ij )2) ,
where ωil is the prior probability, which is assigned for the missing SNP to be the geno-
type cl. We let the priors of the missing SNPs to be equally distributed, and in other
words, all ωil = 1/3.
Using the above method, we need to sample u in each iteration, and this is feasible
theoretically. However, the dimension of u is really big and we do not really want the
samples of u. Now we want to try to integrate u out and get the marginal distribution.
48
When we do this we get
π(β, γ, σ2, φ2, τ 2) ∝ |A|1/2
(σ2)(n+s)/2(φ2)s/2(τ 2)r/2
× exp
(− 1
2σ2(Y −Xβ − Zγ)′B(Y −Xβ − Zγ)
)
× exp
(− 1
2σ2φ2|γ|2
)(2–15)
× π(σ2)π(φ2)π(τ 2),
with
A =1
τ 2I + W ′W
B = I −W ′A−1W
As we showed above, with flat priors on σ2 φ2 and τ 2, the posterior distributions from the
joint distribution 2–11 are ensured to be proper. With the same flat priors, the posterior
distributions from the Equation 2–15 will still be proper because the only difference
made here is having u being integrated out and this does change the properties of other
posterior distributions.
Based on Equation 2–15, we have the following full conditionals
β ∼ N((X ′BX)−1X ′B(Y − Zγ), σ2(X ′BX)−1
)
γ ∼ N
((
1
φ2I + Z ′Z)−1Z ′(Y −Xβ), σ2(
1
φ2I + Z ′Z)−1
)(2–16)
σ2 ∝ 1
(σ2)(n+s)/2
× exp
(− 1
2σ2
((Y −Xβ − Zγ)′B(Y −Xβ − Zγ) +
1
φ2|γ|2
))
φ2 ∝ 1
(φ2)s/2exp
(− 1
2φ2
( |γ|2σ2
)),
again giving normals on β and γ and an inverted gamma on σ2 and φ2.
We ran some simulations on this data set and we found out the computation time is
especially long. The reasons for that is when we consider observations based on remetes
49
Table 2-2. The percentages of SNP categories in the generated data sets.
Percentages SNP1 SNP2 SNP3 SNP4 SNP5
Double homozygous: 13% 30% 91% 77% 39%Heterozygous: 53% 38% 7% 19% 54%Mutant homozygous: 33% 32% 2% 4% 7%
instead of averages of remetes, the number of observations is about 4 times bigger.
Consequently, the computation time is much longer, in terms of matrix inversion and
matrix determinant. We decide to use the first model of the analysis and do not consider
the variation within the remetes. Notice that this does not change the inference on the
SNPs.
2.2.3 Increasing Computation Speed
For a data set containing hundreds of observations and hundreds of SNPs, or even
more, thousands of SNPs, the computation speed of the Gibbs sampler can be generally
a big issue. Furthermore, if the number of SNPs is increased, then for each iteration, the
number of missing SNPs to be updated will also increase. To speed up calculation, we
show that instead of updating all the SNPs at each iteration, updating only one column of
SNPs each cycle will still conserve the target stationary distribution and ergodicity.
Theorem 2.2.1. For the Gibbs sampler corresponding to (2–7) and (2–8), if instead
of updating all the parameters(β(t), γ(t), Z
(t)n×p, σ
2(t), φ2(t)
)in each cycle, we just update
(β(t), γ(t), Z
(t)j , σ2(t)
, φ2(t))
in each iteration, the Markov Chain achieves the same target
stationary distribution and ergodicity also holds. Here Zj denotes the jth column in the
Zn×p matrix for SNPs (the jth SNP for all observations) and j changes according to the
iteration index.
The proof will be given in Appendix A. The actual meaning of this theorem is that
instead of updating tens or hundreds of SNPs in one cycle, we just need to update one
SNP in each cycle. This will dramatically speed up computation, especially when there are
large numbers of SNPs in the data.
50
2.3 Results for Simulated Data
Before the actual application of the methodology to a real data set, we apply it to
simulated data. We simulated a data set with 6 families, 20 observations in each family
and 5 SNPs per observations. The 5 SNPs are independent of each other. The six families
are also independent, so that the parents of the 6 families are not related and individuals
across families are independent. The genotypes of the SNPs were generated according to
the percentages of SNP categories for the first 5 SNPs in the loblolly pine data set. On the
other hand, the individuals within each family share the same parents and they are related
and this relationship will be detailed in the numerator relationship matrix. The SNPs in
the simulated data are generated according to the percentages of SNP categories of the
observed SNPs in loblolly pine data. The sum of probabilities for double homozygous,
heterozygous, and mutant homozygous for each SNP is 1. From this data set, four data
sets with different percentages of missing values: 5%, 10%, 15%, and 20% were randomly
generated. The family effects, β, which were used to simulate the data, are listed in Table
2-5 as the actual values. The SNP effects (additive and dominant effects), which were
used to simulate the data, are listed as actual values in Table 2-5. We let the variance
parameter σ2 be 1. The proposed methodology was applied to analyze the data without
missing values and was also applied to data with different percentages of missing values.
We want to see whether it could identify the significant SNPs and check its performance
against different percentages of missing values, as well as its performance for different
probabilities of SNP genotype category.
Our ultimate goal is to find the significant SNPs from the candidate SNPs. Since
we believe that imputation is a tool to obtain better estimates of the parameters, we are
not particularly interested in recovering the actual imputed values for the missing SNPs.
With that being said, the simulation results in Table 2-3 show that when the probability
of one genotype for a certain SNP is dominantly high, the imputed SNPs are correctly
identified with substantially high probability. This could be seen in Table 2-3, where
51
Table 2-3. The percentages of correctly imputed SNPs for different probabilities of SNPcategories. 10% missing values exist.
SNP values:SNP1 SNP2 SNP3 SNP4 SNP5
a=-2, d=1 a=1, d=-1 a=3, d=0.5 a=2.5, d=0.1 a=0.3, d=3Probabilities to be “gg” 0.1309 0.3012 0.8181 0.7719 0.3983Probabilities to be “gc” 0.5307 0.3875 0.0796 0.1950 0.5425Probabilities to be “cc” 0.3384 0.3113 0.1023 0.0331 0.0592Correctly imputed probabilities: 0.55004 0.54788 0.63372 0.85075 0.65159Probabilities to be “gg” 0.0309 0.3012 0.3181 0.3719 0.0983Probabilities to be “gc” 0.8307 0.3875 0.0796 0.1950 0.8425Probabilities to be “cc” 0.1384 0.3113 0.6023 0.4331 0.0592Correctly imputed probabilities: 0.7589 0.35052 0.97621 0.64829 0.85301
SNPs have high probabilities (in boldface) for one of the SNP genotype categories and
they have higher probabilities of correct imputation (in boldface) than those do not. If
the probabilities of the missing SNP being either one of the candidate genotypes are
very close, the imputation tends to put the imputed SNP in either one of the candidate
genotype categories. This is a consistent result as the simulated SNPs are independent and
it is expected to get the most information from the marginal SNP itself. Table 2-4 and
Table 2-5 list the parameter estimation for family effect and SNP effects.
Table 2-4. The estimated means of family effects for different data sets with differentpercentages of missing values. This methodology give accurate estimates as thepercentage of missing values goes up to 20%.
Estimated means family1: β1 family 2: β2 family3: β3
Actual values: 15 20 25No missing SNPs 15.45 20.65 25.485% missing SNPs 15.16 20.74 25.4610% missing SNPs 16.18 21.38 25.6515% missing SNPs 15.45 19.63 24.5920% missing SNPs 14.87 20.18 24.68Estimated means family 4: β4 family 5: β5 family 6: β6
Actual values: 30 35 40No missing SNPs 29.84 34.76 40.405% missing SNPs 28.29 33.43 38.6210% missing SNPs 30.71 35.86 40.8115% missing SNPs 30.18 35.38 40.1820% missing SNPs 30.08 34.88 40.13
52
Table 2-5. The estimated means of SNP effects for the data set without missing valuesand for data sets with different percentages of missing values.
Actual SNP value:SNP1:a SNP1:d SNP2:a SNP2:d SNP3:a-2.00 1.00 1.00 -1.00 3.00
Means for no missing SNPs : -2.16 1.00 0.82 -0.75 2.59Means for 5% missing : -1.86 1.14 1.16 -1.05 3.00Means for 10% missing : -1.95 0.77 1.18 -1.52 2.74Means for 15% missing : -1.80 0.78 0.99 -0.96 2.48Means for 20% missing -2.08 1.29 1.21 -0.76 3.10
Actual SNP value:SNP3:d SNP4:a SNP4:d SNP5:a SNP5:d
0 2.50 0.10 0.30 3.00Means for no missing SNPs : 0.30 2.43 0.60 -0.04 2.38Means for 5% missing : 0.05 2.21 -0.20 0.48 2.88Means for 10% missing : 0.18 2.51 0.13 0.00 2.53Means for 15% missing : 0.67 2.43 0.47 0.73 3.20Means for 20% missing 1.32 1.87 -0.20 0.47 3.30
All the calculations were based on samples obtained after the initial 20000 steps
of burn-in. The results from Table 2-4 and Table 2-5 show that when the percentage
of missing values is not too high, less than 15%, the proposed methodology gives good
estimates for the parameters that we are interested in. When the percentages of missing
values is beyond 15% percentage, we need to be very careful with interpreting the results.
Take SNP3 as example, the dominant effect for SNP3 is actually 0 and the estimate was
1.32 when the percentage of missing values is 20%, although the estimates are accurate
when the percentages of missing values is less than 10%. The reason for that, we believe,
is one category of genotype for SNP3 has substantially higher probability and it over
dominates the other two categories and this over dominance. When the percentage of
missing values goes up, the dominated genotype category has only a small chance to be
well represented and thus may have unreliable estimates. Generally, as most microarray
data have less than 10% missing values, the methodology performs well.
2.4 Results for Loblolly Pine
In a previous loblolly pine project, SNP discovery was done for about 50 genes
which are involved in disease resistance and water-deficit response. SNPs for these genes
53
were genotyped by microarray and they are scattered over the genome. Also, as loblolly
pine is a tree species with rapid linkage disequilibrium property, the genotypes are not
closely linked. This is a totally different situation compared with human association
genetics, where linkage disequilibrium information is heavily relied on by haplotype
clustering modeling. The goal of the research presented in this dissertation is to detect the
significant SNPs which have strong influence for the quantitative traits from large amount
of potential SNPs using valid statistical procedures without assuming SNP markers are
clustered.
For this project, we are specifically interested in detecting the relationship between
lesion length and the genotyped SNPs, as lesion length is one of the most important
quantitative traits of pitch canker disease. We also have phenotypic data of carbon isotope
discrimination values from loblolly pines which are grown in Paltaka, Florida and Cuth-
bert, Georgia. The carbon isotope trait is related to the water use efficiency, thus has an
important role in the size growth of loblolly pine and further has substantial economical
values. Genetically speaking, the loblolly pines in Paltaka, Florida are replications of
loblolly pines in Cuthbert, Georgia, except they have different environment, which might
lead to different genetic-environment interaction. As for lesion length,with replication of
genetic information, it is a different phenotypic trait compared to the carbon isotope dis-
crimination trait. So we have three phenotypic traits data: Carbon isotope from Paltaka,
Carbon isotope from Cuthbert and lesion length data. The three share one genetic SNP
data.
For the design of the experiment, there are 61 loblolly pine families from a circular
design with some off-diagonal crossings. For example, family 00 is generated from parent
24 and parent 23, while family 01 is from parent 24 and parent 40. Family 00 and family
01 are not independent since they share one parent. Originally, this circular design had 70
families from 44 parents although our data sets just contain part of the experiment design.
More details of the experiment design can be found in Kayihan et al. (2005). There also
54
is a family pedigree file recording the family pedigree and a parent file which records the
parent pedigree. These family histories provide information for constructing the numerator
relationship matrix. In each family there are a certain number of clones, and the number
of clones range from 10 to 18. Note the term “clone” is borrowed from the geneticists in
the loblolly pine project and it does not imply that two clones would share the exact same
genetic DNA sequences, instead, two clones from the same family are just like two siblings.
We have 46 genotyped SNPs in the loblolly population for association discovery. About
10% of the 46 SNPs have missing values and, on average, each observation has more than
one missing SNP value. If we employ conventional listwise deletion to delete the missing
data, almost all observations will be deleted. The last 2 SNPs among the 46 SNPs, have
substantially higher percentages of missing values than the other SNPs, 69% and 54%
specifically. We are not sure of the reason of this high percentage of missing values, it
might be due to microarray measurement error, or experimental error or it might due to
some other underlying genetic reason. At this stage, we decided to not use the last 2 SNPs
in this analysis, and by doing that, overall, the percentage of missingness is decreased from
10.07% to 7.74%.
For the 61 families, we have family pedigree and parent pedigree information. By
using the Henderson (1976) method, we calculated the covariance numerator relationship
matrix. The details of the calculation are in Appendix B. We use a uniform noninfor-
mative prior for the family effect and a normal prior for the SNP effects which we are
interested in. As for the variance parameters, we use inverted gamma distributions and
make sure that the posterior distributions are proper according to Hobert and Casella
(1996). Each SNP is parameterized with additive and dominant effects. For example,
if this SNP is composed by “A” and “T” nucleotides, the additive effect is one half of
the difference of the SNP effect between homozygous “AA” and mutant homozygous
“TT”. If we use (γa, γd)T to parameterize the SNP effect, SNP “AA” would have effect
(1, 0)(γa, γd)T and SNP “TT” would have effect (−1, 0)(γa, γd)
T . The dominant effect is
55
Figure 2-1. The first trace plot is for the first 2 family effect parameters for the lesionlength data. The second plot is for one of SNP parameter for the carbonisotope data from Paltaka, Florida. The samples are taken after initial 40000steps of burnin.
the difference of the heterozygous effect and the average of homozygous effects. We use the
effect (0, 1)(γa, γd)T to parameterize the dominant effect for SNP “AT”. We updated the
missing SNPs by columns as detailed in Section 2.2.3, so the computation is fast (about
3 hours for 30000 iterations). We use 40000 iterations as initial burn in and record the
samples of further 2000 iterations for the 3 data analysis.
Two trace plots are shown in Figures 2-1. The first one is for one of the first family
effect parameters for the carbon isotope discrimination data from Paltaka, Florida. The
second plot is the trace plot of variance parameter for the carbon isotope discrimination
data from Cuthbert, Georgia. Both plots suggest that the Gibbs sampling simulation
converges to the stationary distribution after burn in.
Figures 2-2, 2-3, and 2-4 show the confidence intervals for the 44 SNP effects of
different data sets. We found that the 22nd SNP for the lesion length data, the 16th SNP
for the Paltaka data and the 8th, 35th, 36th SNPs for the Cuthbert data are significant
56
with 95% confidence intervals. We constructed these confidence intervals in Figures 2-2,
2-3, and 2-4 based on the samples from the Gibbs sampling. We did not employ standard
multiple test correction procedure for the tests as there are not many significant SNPs and
also our goal is to find some statistically significant SNPs and let the biologists to further
explore the biological pathway. These confidence interval plots point out the specific SNPs
to further investigate and we could have more SNPs to follow if we loosen the confidence
limits a lit bit. In Figures 2-2, 2-3 and 2-4, the x axis is for the index of SNPs. As each
SNP has two parameters, with 44 SNPs, the index ranges from 1 to 88. The y axis is for
the value of SNP effect. For each parameter of each SNP, there is a small red line in the
top of a blue line and a small red line on the bottom of the blue line. The small red line
on the top represents the upper bound of the 95% confidence interval and the small red
line on the bottom represents the lower bound of the 95% confidence interval. The small
green line in the middle represents the mean of the estimated parameter. Each SNP has
two parameters, additive and dominant, correspondingly, it has two lines in all the plots.
2.5 Quantifying the Covariance and Variance
In order to impute the missing SNPs more accurately, we want to incorporate the
family structure into the model as we mentioned before. For our loblolly pine project,
there are two data sets containing the family information: family pedigree and parent
pedigree. The family pedigree has 70 rows, which means there are 70 families in the data,
and 3 columns. The first column is the family ID, the second column denotes the female
parent ID and the third column denotes the male parent ID. The parent ID varies from 1
to 44 and we do not distinguish between female and male parent. In the data file of parent
pedigrees there are 44 rows which correspond to the 44 parents in the family pedigrees and
3 columns. The first column is the parent ID and the second and the third columns are
the IDs of the grandparents which are the parents of parents in family pedigree. One thing
to notice is: for the parents whose IDs are less than 35, the grandparents’s information
is assumed to be independent with any other grandparents without further specification;
57
Figure 2-2. 95% Confidence intervals for the additive and dominant SNP effect of the 44SNPs for lesion length. The 22th SNP has significant dominant effects with95% confidence, while the other SNPs are not significant. Other SNPs such asthe 2nd and the 23th SNPs are approximately significant with 95% confidence.These are good candidates for further biological exploration.
on the other hand if the parent’s ID is bigger or equal to 35 we know the grandparents’
ID. So 35 is an important number and we need to pay attention to it when trying to fill
in the covariance matrix of families. Another thing is for the parent ID 38, originally its
grandparents were recorded as 13 and 0, out of convenience for programming, I changed
them into 13 and −1 without changing the nature of the covariance. According to the
design of the experiment, the female and male parent are not the same for all the families.
When calculating the covariance between individuals from different families, one way
is to divide the combinations of two families into 11 big categories and further partition
inside the categories. In the following I will explain each of the 11 categories.
Category 1: All Parents’ IDs Less Than 35, None Equal
Suppose I use “a” and “b” to denote the parents IDs of the first family, “c” and “d”
to denote the IDs of the parents in the second family. So in this category it must satisfy:
a < 35, b < 35, c < 35, d < 35 and a 6= b 6= c 6= d. No parents’ ID are equal. This actually
58
Figure 2-3. 95% Confidence intervals for the additive and dominant SNP effect of the 44SNPs for the carbon isotope data from Paltaka, Florida. The 16th SNP hassignificant dominant effects with 95% confidence, while the other SNPs are notsignificant. Other SNPs such as the 6nd and the 40th SNPs are approximatelysignificant with 95% confidence. These are good candidates for furtherbiological exploration.
is the simplest case and if it is satisfied the two families have zero covariance. In our data
set, we need to compute total 2415 (2415 = 1 + 2 + · · · + 69) covariance cells and it turns
out there are 507 cases belong to this category.
Category 2: All IDs Less Than 35, One Equality
In this category, we still have a < 35, b < 35, c < 35 and d < 35, further, it must
satisfy one of the following equations: a = c or a = d or b = c b = d. When the above
situation is satisfied, the two families share one parent and they are half siblings. The
program results show there 88 cases in our data set.
Category 3: Three IDs Less Than 35, One Equality
When the parent ID is ≥ 35, we can trace down it’s grandparents’ ID. For example, if
a ≥ 35, we can trace it’s grandparents and use (a1, a2) to denote them. In this category, it
59
Figure 2-4. 95% Confidence intervals for the additive and dominant SNP effect of the 44SNPs for carbon isotope data from Cuthbert, Georgia. The 8th, 35th and the36th SNP have significant dominant effects with 95% confidence, while theother SNPs are not significant. Other SNPs such as the 13nd and the 28thSNPs are approximately significant with 95% confidence. These are goodcandidates for further biological exploration.
can be partitioned into 8 subcategories as the following:
Family 1 : (a1, a2) , b
Family 2 : b, d
,
Family 1 : (a1, a2) , b
Family 2 : c, b
,
Family 1 : a, (b1, b2)
Family 2 : a, d
,
Family 1 : a, (b1, b2)
Family 2 : c, a
,
Family 1 : a, d
Family 2 : (c1, c2) , d
,
Family 1 : d, b
Family 2 : (c1, c2) , d
,
Family 1 : a, c
Family 2 : c, (d1, d2)
,
Family 1 : c, b
Family 2 : c, (d1, d2)
.
60
Out of simplicity, I will just use
(a1, a2) , b
b, d
to denote
Family 1 : (a1, a2) , b
Family 2 : b, d
.
By default, in all later illustrations, the first row represents the first family and the second
row represents the second family.
Now we take
(a1, a2) , b
b, d
at a close look, in this situation, we know that the two families share one parent: b, but
we need to be a little bit careful since some of the following equations might hold and if it
does we want to know how many hold:
a1 = b,
a2 = b,
a1 = d,
a2 = d.
We found that out of the 84 cases of this category for our dataset, 83 cases do not
satisfy any one of the above equations, that is, all the 83 families are just half siblings.
For the last case in this category, when considering the covariance between family 38 and
family 70, it falls into the following situation,
a, d
(a, c2) , d
.
Category 4: Three Parents Less Than 35, None Equal
61
We can further partition this category into four subcategories and denote them as the
following:
(a1, a2) , b
c, d
,
a, (b1, b2)
c, d
,
a, b
(c1, c2) , d
,
a, b
c, (d1, d2)
.
Take the case of
(a1, a2) , b
c, d
as an example, we know the parents are distinct with each other, but we still need to
consider whether some of the following equations hold:
a1 = c,
a2 = c,
a1 = d,
a2 = d.
Note we do not need to consider about whether a1 = b or a2 = b since right now we
are just considering the covariance between 2 families.
The results show that for all the 896 cases of this category, none of the above equa-
tions is satisfied. So the covariance for these 896 cases are all 0.
Category 5: Two Parents Less Than 35 and From One Family, None Equalities
In this category, the first thing to notice is when the two parents bigger or equal
to 35 compose one family, and the other two less than 35 compose the another family,
the number of equal parent IDs is zero. This category can be further partitioned into 2
subcategories
(a1, a2) , (b1, b2)
c, d
,
a, b
(c1, c2) , (d1, d2)
.
62
Take the subcategory
(a1, a2) , (b1, b2)
c, d
as an example, we need to verify whether some of the following equations hold
a1 = c,
a2 = c,
b1 = c,
b2 = c
a1 = d,
a2 = d,
b1 = d,
b2 = d.
Note we do not need to worry whether some of the equations hold for a1 = b1, or
a1 = b2, or a2 = b1 or a2 = b2, since that part matters for the variance calculation for
individuals within the first family. We are considering the covariance between individuals
from the first family and the second family and they do not need to be considered right
now. We found out that in the 245 cases of this category, 223 cases do not have any
equation satisfied and they are independent. In the remaining 22 cases only one equation
holds for each case, which means one tree is the parent of one family and grandparent of
the other family in these 22 cases.
Category 6: Each Family Has One Parent Less Than 35, One Equality
Similar to above, we first partition this category into 8 subcategories and they are
illustrated as the following
(a1, a2) , b
(a1, a2) , d
,
(a1, a2) , b
c, (a1, a2)
,
(a1, a2) , b
(c1, c2) , b
,
(a1, a2) , b
b, (d1, d2)
,
63
a, (b1, b2)
a, (d1, d2)
,
a, (b1, b2)
(b1, b2) , d
,
a, (b1, b2)
(c1, c2) , a
,
a, (b1, b2)
c, (b1, b2)
.
These 8 subcategories can be represented by 2 situations. The first one is
(a1, a2) , b
(a1, a2) , d
,
in this situation, we need to verify the following equations
a1 = b,
a1 = d,
a2 = b,
a2 = d.
The second representation is
(a1, a2) , b
(c1, c2) , b
,
for which we need to verify the following 2 groups of equations.
Group 1
a1 = b,
a2 = b,
c1 = b,
c2 = b.
Group 2
a1 = c1,
a1 = c2,
a2 = c1,
a2 = c2.
Note the above two groups of equations have slight different meanings and it is
reasonable to believe that the covariance is bigger when one equation holds in the first
64
group than when one holds in the second group. The results show that there are 49 cases
in this category and all of them except one do not satisfy any of the above equations and
are just half siblings. For the case of family 1 and family 16, one equation holds as in
the group 2, so the two families not only share one parent but also share another great
grandparent.
Category 7: Each Family Has One Parent Less Than 35, No Equalities
In this category, there are four subcategories as the follows by using the same
notation as above:
(a1, a2) , b
(c1, c2) , d
,
(a1, a2) , b
c, (d1, d2)
,
a, (b1, b2)
(c1, c2) , d
,
a, (b1, b2)
c, (d1, d2)
.
We need to verify some equations for each of the four subcategories and we could just take
the first subcategory
(a1, a2) , b
(c1, c2) , d
as an example, and check whether the following equations from the two groups hold or
not:
Group 1:
c1 = b,
c2 = b,
a1 = d,
a2 = d.
Group 2:
a1 = c1,
a1 = c2
a2 = c1,
a2 = c2.
65
Note, the equation from group 1 is not the same as that from group 2 since they
have different weights in terms of covariance. The results shows there are 329 cases in this
category, 5 of them satisfy one equation from group 1, 24 of them satisfy one equation
from group 2, and 3 of them satisfy one equation from each group. The remaining of the
cases do not satisfy any equation from either group and their covariance is 0.
Category 8: One Parent Less Than 35, One Equality
Same as above, there are 8 subcategories here:
(a1, a2) , (b1, b2)
(a1, a2) , d
,
(a1, a2) , (b1, b2)
(b1, b2) , d
,
(a1, a2) , (b1, b2)
c, (a1, a2)
,
(a1, a2) , (b1, b2)
c, (b1, b2)
,
(a1, a2) , b
(a1, a2) , (d1, d2)
,
(a1, a2) , b
(c1, c2) , (a1, a2)
,
a, (b1, b2)
(b1, b2) , (d1, d2)
,
a, (b1, b2)
(c1, c2) , (b1, b2)
.
We will just take one subcategory to show the equations which need to be verified. For
example, for the subcategory:
(a1, a2) , (b1, b2)
(a1, a2) , d
,
the following equations need to be checked:
Group 1:
a1 = d,
a2 = d,
b1 = d,
b2 = d.
66
Group 2:
a1 = b1,
a1 = b2
a2 = b1,
a2 = b2.
As above, we need to notice that the equations from these 2 groups have different
weights when calculating the covariance.
The program results show that in our data set 27 out 28 cases in this category do not
have any equations holding in either group, that is, these cases are half siblings. Only for
the cases of family 65 and family 68, additionally, one equation holds from group 1.
Category 9: Only One Parent Less Than 35, No Equality
Same as above, we partition this category into 4 subcategories.
(a1, a2) , (b1, b2)
(c1, c2) , d
,
(a1, a2) , (b1, b2)
c, (d1, d2)
,
(a1, a2) , b
(c1, c2) , (d1, d2)
,
a, (b1, b2)
(c1, c2) , (d1, d2)
.
For these 4 subcategories we need verify some equations. Take the subcategory
(a1, a2) , (b1, b2)
(c1, c2) , d
as an example. The following 2 groups need to be checked where the equations in different
groups have different weights for covariance:
67
Group 1
a1 = d,
a2 = d,
b1 = d,
b2 = d.
Group 2
a1 = c1,
a1 = c2
a2 = c1,
a2 = c2
b1 = c1,
b1 = c2
b2 = c1,
b2 = c2.
In our data set, there are 168 cases that belong to this category and 126 cases do not
have any equation satiated in either group, that is, these cases have 0 covariance. Five
cases have one equation holding from group one, that is, one parent of a family is the
grandparent of the other family. Thirty six cases have 1 equations holding from group
2, that is, they have 1 tree being the grandparents for both families. One case has one
equation holding from either group.
Category 10: All Parents greater or equal to 35, One Equality
As usual, to clearly illustrate the thought flow, we partition this category into 4
subcategories:
(a1, a2) , (b1, b2)
(a1, a2) , (d1, d2)
,
(a1, a2) , (b1, b2)
(c1, c2) , (a1, a2)
,
(a1, a2) , (b1, b2)
(b1, b2) , (d1, d2)
,
(a1, a2) , (b1, b2)
(c1, c2) , (b1, b2)
.
68
Certainly we need to verify some of the equations. We take
(a1, a2) , (b1, b2)
(a1, a2) , (d1, d2)
as an example, and check the following equations
a1 = d1, a1 = d2,
a2 = d1, a2 = d2,
a1 = b1, a1 = b2
a2 = b1, a2 = b2,
b1 = d1, b1 = d2,
b2 = d1, b2 = d2.
In our data set, there are 9 cases belonging to this category and 7 of them do not
have any equations holding and they are just half siblings. The remaining 2 cases not only
share one parent but also share one grandparent.
Category 11: All Parents Greater or Equal to 35, No Equalities
In this last case, we cannot further partition and we can trace down all the grandpar-
ents’ ID. The following is an illustration:
(a1, a2) , (b1, b2)
(c1, c2) , (d1, d2)
.
69
For this, we still need to verify the following equations:
a1 = c1, a1 = c2,
a2 = c1, a2 = c2,
a1 = d1, a1 = d2
a2 = d1, a2 = d2,
b1 = c1, b1 = c2,
b2 = c1, b2 = c2,
b1 = d1, b1 = d2
b2 = d1, b2 = d2.
The data set has 12 cases in this category and 4 of them do not have any equation
holding, that is, they have 0 covariance. In the other 8 cases, they have 1 equation
holding, which means they share one grand parent in the two families.
Variance
Now I will explain how the variance for the 70 families is calculated. We know the
parents in each family are distinct from each other, and we need to check whether the
parents are still related in some degree. The followings are the possible situations for the
parents ID:
(a, b
),
((a1, a2) , b
),
(a, (b1, b2)
),
((a1, a2) , (b1, b2)
).
The first bracket means both the parents’ ID are less than 35 and thus automatically they
are independent. For the second bracket, the first parent’s ID is bigger than or equal to
35 and thus we are able to trace down it’s grandparents’ ID. With that, we can check
whether the other parent is the same as one of the grandparents. We need check the same
relationship for the third bracket. For the last bracket, both parents’ IDs are bigger or
equal to 35, thus we know the grandparents’ ID for both of them. With this information,
we need to check whether they share any grandparents or not. The program results
70
show that in all the 70 families, their parents are independent and they do not share any
grandparent.
Summary of Variance Covariance Calculation
In the above for our loblolly data, we partition all cases into 11 categories and we
further detail the relationships inside each category. As a check, for our loblolly pine data,
the sum of cases in the 11 categories is:
507 + 88 + 84 + 896 + 245 + 49 + 329 + 28 + 168 + 9 + 12 = 2415,
which equals to the number of covariances we need to calculate.
In the above, we found that in any family, the two parents are not related and
standard procedure can be used to calculate the variance for each family. In terms of
covariance, in category 1, all 507 cases have zero covariance. For category 2, all 88 cases
are half siblings. Eighty three out of 84 cases in category 3 are half siblings and when
the case comes to the covariance between family 38 and family 70, it has the following
representation,
a, d
(a, c2) , d
.
All 896 cases in category 4 have 0 covariance. In category 5, 223 out of 245 cases have 0
covariance and the remaining 22 cases have the following representation:
(a1, a2) , (b1, b2)
a1, d
.
In category 6, 48 out of 49 cases are half siblings. For the covariance between family 1 and
family 16, it has the following representation:
a, (b1, b2)
a, (d1, b2)
.
71
In category 7 there are a total of 329 cases and 297 cases have 0 covariance. Five of them
have the following representation:
(d, a2) , b
(c1, c2) , d
,
while 24 cases have the following representation:
(a1, a2) , b
(a1, c2) , d
.
The remaining 3 cases have the following illustration:
(a1, a2) , b
(a2, b) , d
.
In category 8, 27 out 28 cases have are half siblings and only the covariance of family 65
and 68 has a little bit stronger covariance with the following representation:
(a1, a2) , (b1, b2)
(a1, a2) , a1
.
In category 9, 126 out of 168 cases have 0 covariance. In other 5 cases one tree is the
parent of one family and also grandparent of the other family. The representation is:
(a1, a2) , (b1, b2)
(c1, c2) , a1
.
For the other 36 cases, they have the following representation:
(a1, a2) , (b1, b2)
(a1, c2) , d
.
72
The last case in this category has the following representation:
(a1, a2) , (b1, b2)
(a1, c2) , b1
.
Nine cases belong to category 10. Seven out of Nine are half siblings and the remaining
two have the following representation:
(a1, a2) , (b1, b2)
(b1, c2) , (a1, a2)
.
In the last category, 4 cases have 0 covariance and the remaining 8 cases share one
grandparent for the two families with the following representation
(a1, a2) , (b1, b2)
(c1, c2) , (a1, d2)
.
One thing needs to be mentioned. The above discussed method for categorizing the
variance covariance of individuals within some pedigree can be easily adapted to other
pedigree files and we have the programming codes. As for the values assigned to each
category, we received the help from a gleetiest after we quantified the categories.
2.6 Discussion
The Gibbs sampler simulates samples of Zmi according to the distribution of each
missing element, conditional on the remaining vector parameters and other imputed SNPs.
For a particular element Zmij , the conditional distribution given the rest of the vector Zm
i(−j)
is
P (Zmij = c|Zm
i(−j)) =exp
(− 1
2σ2 (Yi −Xiβ − Zoi γ
oi − Zm
i(−j)γmi(−j) − cγm
ij )2)
∑3`=1 exp
(− 1
2σ2 (Yi −Xiβ − Zoi γ
oi − Zm
i(−j)γmi(−j) − c`γm
ij )2) (2–17)
where there are only 3 terms in the denominator sum and this makes the Gibbs sampler
computational feasible.
73
In the area of human genetic association, there are some articles using the EM
algorithm to impute the missing SNPs, but these methods typically rely on linkage dise-
quilibrium, and assume that either haplotypes or alleles are clustered. For our situation,
without the assumption of clustering or dependence, an EM algorithm needs to calcu-
late all the possible combinations of missing SNPs and the number of combinations can
easily increase beyond the realm of computational feasibility. Using a Monte Carlo EM
approach would lead to the Gibbs sampler of 2–8, but rather than the one random variable
generation per iteration that is used here, Monte Carlo EM would need thousands of Z
generations per iteration, again precluding computational feasibility.
As for the prior distribution for the SNP genotype categories, one alternative is
to use the available genome information or the observed genotype frequencies as prior
information. For our situation, we do not have a sequenced genome library for the loblolly
pine, and we do not want to use the observed SNP information as the prior, so we use a
uniform prior for the missing SNPs.
In this chapter, the responses are assumed to be continuous, but the method can be
made to adapt to discrete cases. For example, in a case control study, the response would
be either case or control status. By employing a probit model, we add a truncated latent
variable in the Gibbs sampling cycle and the latent variable would act as the response in
our previous data set examples. This method can also be modified to handle the situation
of a multiple category response.
Let us assume that the responses y is a vector with its elements to be either 0 or 1 for
the case control study. We employ a continuous latent variable l and assume that l has a
multivariate normal distribution. Then we can specify the whole model specification as the
follows:
74
Y = I(l < 0) (2–18)
l ∼ N(Xβ + Zγ, σ2I
)
β ∼ π(β)
γ ∼ N(0, σ2φ2I
)
σ2 ∼ π(σ2)
φ2 ∼ π(φ2)
Then we can write the joint likelihood as:
L(l, β, γ, σ2, φ2|y) ∝n∏
i=1
I(li > 0)I(yi = 1) + I(li ≤ 0)I(yi = 1) (2–19)
× exp
−|l −Xβ − Zγ|2
2σ2
exp
− |γ|2
2σ2φ2
π(β)π(σ2)π(σ2)π(φ2)
With the above joint likelihood, we can easily write the conditional distributions as
before, except that now we have conditional distribution of l given other parameters and
y. The conditional of li is
li ∼ N(Xiβ + Ziγ, σ2)
truncated at the left by 0, if y1 = 1
li ∼ N(Xiβ + Ziγ, σ2)
truncated at the right by 0, if y1 = 0.
So with all the conditionals and we can run a Gibbs sampler to sample through the
parameter space and further do statistical inference. When y is the multivariate response,
we can generate a latent variable and adapt similarly.
For the SNP effect, we use [1, 0][γa, γd]T to denote the effect of SNP “AA”, [0, 1][γa, γd]
T
to denote the effect of SNP “AC”, and [−1, 0][γa, γd]T to denote the effect of SNP “CC”.
75
The parameter γa is the additive SNP effect and γd is the dominant SNP effect. When
the values of γa and γd are not close, the algorithm will give different probabilities to the
three possible candidates, and impute the missing SNP accordingly. When the value of
γd is close to γa, the algorithm will give close probabilities for the two candidate SNP
genotypes, as it cannot distinguish between “AA” or “AC” in the missing SNP. When
the value of γd is close to −γa the algorithm could not distinguish between “CC” and
“AC”. However, our interest is not focused on recovering the actual genotype, we are more
interested in estimating the additive effect and dominant effect of each SNP. When one of
the above situations occur, we know that either SNP “AA” and “AC” have almost equal
effects on the response, fully dominant, or “CC” and “AC” have equal effect.
For population association, much research work has been focused on using haplotypes.
However, in the plant, or other agriculture genome associations, it is often the case
that the whole genome is not fully sequenced and not much prior linkage disequilibrium
information is available. Block-wise haplotypes or clusters of haplotypes are impossible to
construct. The proposed method has wide application for the field of plant or agriculture
association without constructing haplotypes, and it has reasonable calculation speed.
Based on our simulations, the proposed method can adequately identify the significant
SNPs.
76
CHAPTER 3BAYESIAN VARIABLE SELECTION FOR GENOMIC DATA WITH MISSING
COVARITES
3.1 Introduction
In Chapter 2, we employed a Bayesian hierarchical model and discovered some SNPs
according to the Bayesian credible intervals. Furthermore, it is biologically possible that
there are some subsets of SNPs which are interacting with each other and are responsible
for the phenotypic traits we are interested in. So in this chapter, we are interested in
selecting the “good” subsets, or the “good” models, which have high posterior probabili-
ties given the observed data. Our model choice is restricted to the range of linear mixed
models with the subsets of SNP variables chosen from the total s candidate SNP variables.
To be more general, there are s variables to be considered in the data set. When s is
finite, and the values of the covariates in the experiment are observed, much progress has
been made. The review of the progress has been detailed in Chapter 1. However, for the
situation that the candidate variables have a certain percentage of missing values, it is a
novel research field and not much work has been done.
For each candidate variable, there are two choices, either being included in the
“good model” or not. So with s candidate variables, there are 2s model choices in the
model sample space. When the number s is moderately large, the number 2s could be
huge. Typically when the number s is bigger than 15, it is unrealistic to calculate and
compare all the 2s posterior model probabilities. For this case, a feasible method is to do
a stochastic search. In theory, the stochastic search algorithm has the chance to search
all the sample space. This stochastic search chain is expected to stay longer with models
having high posterior probabilities, and occasionally, it visits models with low posterior
probabilities. By doing that, it avoids getting stuck in local modes and has chance to visit
the model sample space globally.
The following is our plan for the model selection:
• Run 2 parallel chains together.
77
• The first chain is a Gibbs sampler for the full model. This full model includes all the
parameters, the parameter of family effects, the variance parameters, the parameters
of the SNP effects and the missing SNPs.
• Simultaneously, run a hybrid Metropolis-Hastings (M-H) chain to search the good
subsets of variables, in our case, SNPs, according to the posterior probabilities of
models.
a) We will estimate the Bayes factors using samples from the parallel Gibbs
sampling.
b) The stochastic search chain is driven by estimated Bayes factors.
• According to the frequency of the model visits, report the good models.
Now let us introduce the Baye factor. For two models Mδ and M1 with parameters
θδ and θ1, the Bayes factor is defined as
BFδ, δ=1 =mδ(Y )
mδ=1(Y )=
∫p(θδ|Mδ)p(Y |θδ,Mδ) dθδ∫p(θ1|M1)p(Y |θ1,M1) dθ1
, (3–1)
where p(θδ|Mδ) is the parameter distribution for model Mδ and p(θ1|M1) is the parame-
ter distribution for model M1. Obviously p(Y |θδ,Mδ) is the probability distribution under
model Mδ and p(Y |θ1,M1) is the probability distribution under model M1. Here θδ and
θ1 are scalars, but they can be extended to be parameter vectors. The constant mδ(Y ) is
the marginal likelihood for model Mδ and the constant mδ=1(Y ) is the marginal likelihood
for model M1.
In this chapter, first we will discuss how to estimate the Bayes factors as no closed
form exists. Then we will propose a hybrid stochastic search algorithm to search the
sample space of the models and we will show that the ergodic properties of the derived
Markov chain. Computation speed is always a concern of variable selection, so finally we
scale the procedures up to handle large data sets.
78
3.2 Bridge Sampling Extension
We declare our notation first. We are doing Bayesian analysis and all the models
involve the prior specification of the parameters in the models. The full model is:
Yn×1 ∼ N(Xβ + Zγ, σ2I) (3–2)
βp×1 ∼ noninformative uniform from−∞ to ∞
γs×1 ∼ N(0, σ2φ2I)
σ2 ∼ IG(a, b)
φ2 ∼ IG(c, d)
Zmis ∼ uniform distribution
where IG(a, b) represents inverted Gamma distribution with the following parameteriza-
tion
f(σ2) =ba
Γ(a)
exp− bσ2
(σ2)a+1,
and a, b, c, d are pre-specified constants which make sure the posterior distributions are
proper.
If we use M1, . . . ,M2s to represent all the models in the model sample space, the
goal of this project is to select good models from these candidate models. For each one
of the models in the model sample space, there is a corresponding indicator vector δ and
each model is defined by a family of distributions with parameters to be β, δ, γδ, σ2, φ2. For
any two different models Mj and Mj′ , they share the same parameterizations of β, σ2, φ2,
except for a different indicator vector δ, thus also γδ.
Several things to be noticed:
• The matrix Zn×s is the design matrix of the SNPs in the full model and it con-
tains missing values in it, as about 10% of SNPs are missing in the loblolly data.
Corresponding to the δ for different models, the design matrix of SNPs Zδ changes
also.
79
• The parameter γδ parameterizes the SNP effects in each model and we are trying
to select subsets of SNPs which have significant effects on the response. In the full
model, implicitly, γ has a corresponding vector δ, δ = (1, 1, . . . 1)T . Since the δ vector
is all ones for the full model, we neglect δ implicitly and hope the notation is still
clear.
• The β vector parameterizes the effects of families in the data set. For all the can-
didate models, the family effect parameter β is always included. X is an incidence
matrix which records which family the observations belong to.
• There are 2s candidate models in the model sample space and the sample space is
finite.
As we are doing Bayesian variable selection, we use Bayes factor as a criterion to
select models. The difficult part of this project is that we are selecting subsets with
different dimensions and the missing values exist in the covariates of SNPs. To overcome
the problem, we proposed a Bayes factor approximation formula which takes care of
the different dimensions and the missing values. We plan to employ a stochastic search
algorithm to search the good subsets according to the Bayes factors. For each candidate
subset, we calculate the Bayes factor for the candidate subset versus the full model. Here
the full model acts as the reference model. Depending on the value of the current Bayes
factor and the previous Bayes factor, we decide whether take the candidate subset as the
current subset or not. So the full model acts as a reference model when calculating the
Bayes factors in the above plan. On the other hand, we may use the simplest model as a
reference model when calculating the Bayes factors, which does not include any variables
in it. We will show that using the simplest model does not gain anything for us in terms
of computation. More discussion of using the simplest model as a reference model will be
given later.
Before going to the calculation of Bayes factor, we will introduce bridge sampling
proposed by Meng and Wong (1996).
80
Let pi, i = 1, 2, be two probability densities and qi, i = 1, 2, be the corresponding
unnormalized densities with the constants ci, i = 1, 2, where the following equation holds
for i = 1, 2,
pi(ω) =qi(ω)
ci
, ω ∈ Ω ⊂ Rd,
here Ω is the support of pi(ω).
Under some general conditions, for any α(ω), Meng and Wong utilized the following
equation: ∫Ω2
q1(ω)α(ω)p2(ω)d ω∫Ω1
q2(ω)α(ω)p1(ω)d ω=
c1
c2
×∫
Ω1∩Ω2α(ω)p1(ω)p2(ω)d ω∫
Ω1∩Ω2α(ω)p1(ω)p2(ω)d ω
,
and this yields their key identity of
c1
c2
=E2 q1(ω)α(ω)E1 q2(ω)α(ω) ,
where E1 and E2 are the expectations which were taken with respect to the probabil-
ity densities p1 and p2. To apply this to Bayes factors with the definition of Equation 3–1,
we have
BFδ, δ=1 =mδ(Y )
mδ=1(Y ).
We know that
pδ(θ|Mδ, Y ) =p(Y |θ,Mδ)p(θ|Mδ)
mδ(Y ),
and
p(δ=1)(θ|M(δ=1), Y ) =p(Y |θ,M(δ=1))p(θ|M(δ=1))
mδ=1(Y ).
Using Meng and Wong’s terminology, c1 = mδ(Y ) , c2 = mδ=1(Y ),
q1 = p(Y |θ,Mδ)p(θ|Mδ),
and
q2 = p(Y |θ,M(δ=1))p(θ|M(δ=1)).
81
Also
p1 = pδ(θ|Mδ, Y ),
and
p2 = p(δ=1)(θ|M(δ=1), Y ).
Taking
α =1
q2(θ),
we will have
BFδ, δ=1 =mδ(Y )
mδ=1(Y )=
c1
c2
=
∫q1 α p2 dθ∫q2 α p1 dθ
=
∫q1
q2
p2(θ)d θ. (3–3)
When have samples of parameter θ from the distribution p2(θ), we can estimate the
Bayes factor by∑ q1(θ
(i))
q2(θ(i)),
which is called the bridge sampling estimate of the Bayes factor by Meng and Wong.
For our situation, we generally consider two models with different dimensions, and
we can not directly apply this formula as it assumes the same parameterization for the
two models. We will propose a bridge sampling type of Bayes factor estimation, which
accommodates models with different dimensions and missing values as well. Next we
will detail the Bayes factor calculation for our situation. First we show that a special
function g is needed for the proper calculation of Bayes factor. Then we will show what
this function could be for our situation.
3.2.1 General Formula
First we will give a general theorem which will be extended and be used to estimate
the Bayes factor later. In this theorem we will show why a g function is needed and what
condition it has to satisfy.
Suppose θa is the parameter for model Ma and (θa, θb) is the parameters for model
Mb. The likelihood for model Ma is fa(Y |θa) and the likelihood for model Mb is
fb(Y |θa, θb). Let πa(θa) and πb(θa, θb) denote the priors for model Ma and model Mb.
82
Using Meng and Wong’s bridge sampling method, with the samples of (θ(i)a , θ
(i)b ) from the
posterior distribution πb(θa, θb|Y ), the estimator for BFa, b would be
∑i
πa(θ(i)a )
πb(θ(i)a , θ
(i)b )
.
However, this is not a consistent estimator and a function g is needed to have the follow-
ing consistent estimator of the Bayes factor:
∑i
πa(θ(i)a ) g(θ
(i)a , θ
(i)b )
πb(θ(i)a , θ
(i)b )
. (3–4)
Theorem 3.2.1. With the above notations, a g function g(θa, θb) is needed for the
following equation to hold.
∫πa(θa)
πb(θa, θb)g(θa, θb)πb(θa, θb|Y )d θad θb =
ma(Y )
ma(Y ), (3–5)
where πb(θa, θb|Y ) is the posterior distribution of (θa, θb) given Y , ma(Y ) is the marginal
likelihood of model Ma and mb(Y ) is the marginal likelihood of model Mb. Furthermore,
the g(θa, θb) function must satisfy the following condition:
∫fb(Y |θa, θb)g(θa, θb)d θb = fa(Y |θa). (3–6)
Now let us prove the Theorem 3.2.1.
Proof. As mentioned before, with the samples (θ(i)a , θ
(i)b ) from the posterior distribution
πb(θa, θb|Y ), we will be able to use
∑i
πa(θ(i)a )g(θ
(i)a , θ
(i)b )
πb(θ(i)a , θ
(i)b )
to consistently estimate the Bayes factors. The following derivation shows why the above
estimator is consistent.
83
∫ ∫πa (θa) g(θa, θb)
πb(θa, θb)πb(θa, θb|Y ) dθa dθb (3–7)
=
∫ ∫πa (θa) g(θa, θb)
πb(θa, θb)
fb(Y |θa, θb)πb(θa, θb)
mb(Y )dθa dθb
=1
mb(Y )
∫ ∫πa(θa)g(θa, θb)fb(Y |θa, θb) dθa dθb
So if we have Equation 3–6 holding, then
1
mb(Y )
∫ ∫πa(θa)g(θa, θb)fb(Y |θa, θb) dθa dθb =
∫πa(θa)fa(Y |θa) dθa
mb(Y )
=ma(Y )
mb(Y ).
For our situation, we are considering two models with different dimensions. Model
Mδ has the parameters (β, γδ, σ2, φ2) and an associated model indicator vector δ, then
the design matrix for SNPs of the model Mδ is Zδ. Let M1 be the full model containing
all the SNPs. Both model Mδ and model M1 may have a certain percentage of missing
SNPs. With the observed data Yn×1, the likelihood for model Mδ is
fδ(Y |β, γδ, σ2, Zδmis
)
and the likelihood for model M1 is
f1(Y |β, γ, σ2, Zmis),
where (β, γδ, σ2, φ2, Zδmis
) are parameters in model Mδ and (β, γ, σ2, φ2, Zmis) are pa-
rameters for model M1. Zδmis is the missing values in the model Mδ and Zmis is the
missing values in the model M1. As model M1 includes all SNPs and model Mδ just
includes part of SNPs, Zδmis represents part of the missing values and Zmis represents
all the missing values. The marginal likelihood for for Y under model Mδ is denoted as
mδ(Y ) and the constant mδ=1(Y ) is the marginal likelihood for model M1. We use the
84
prior πδ(β, γδ, σ2, Zδmis
) for Mδ and π1(β, γ, σ2, Zmis) for M1. Suppose we have samples
(β(i), γ(i), σ2(i), φ2(i)
, i = 1, 2, . . . , ) from M1 and no closed form of Bayes factor for model
Mδ against M1 is available. Similarly as Equation 3–4, a g function is needed and the
modified bridge sampling estimator is
1
n
M∑i=1
πδ(β(i), σ2(i)
, γ(i)δ , Z
(i)δmis
) · g(β(i), σ2(i), φ2(i)
, γ(i)δ , Z
(i)δmis
)
π1(β(i), γ(i), σ2(i), Z(i)mis)
−→ mδ(Y )
mδ=1(Y )= BFδ, δ=1.
(3–8)
Corollary 3.2.1. Suppose we have models Mδ and M1 and the priors are defined as the
above for models Mδ and M1. A closed form of Bayes factor for model Mδ versus M1 is
not available. To get a consistent bridge sampling type Bayes factor estimator as Equation
3–8, we must find a g function that satisfies the following equation.
∑Zlmis
∫f1(Y |β, γ, σ2, φ2, Zmis)× g(β, γ, σ2, Zmis) dγl
= fδ(Y |β, γδ, σ2, Zδmis
). (3–9)
where the g function satisfies the above Equation 3–9 and contains a factor P (Zlmis=
zlmis). Also γ is composed of γδ and γl. The vector γδ parameterizes the SNPs which are
included in model Mδ and γl parameterizes the SNPs which are excluded in model Mδ.
The matrix Z is composed of Zδ and Zl, where Zl is the design matrix of SNPs which
are not included in model Mδ and Zlmisrepresents the missing SNPs within Zl. The
distribution P (Zlmis= zlmis
), for the discrete random vector Zlmis, could be any legitimate
discrete distribution and it is taken to be the uniform distribution, which is the same as the
prior for Zlmisin model M1.
The parameter φ2 is a variance parameter and it is used to specify the prior of the
SNP effect, γ. It does not appear in the likelihood f1(Y |β, γ, σ2, Zmis), but from the
Bayesian perspective, it appears in the joint likelihood of all the parameters. As we are
doing Bayesian analysis, we let the g function include the parameter φ2.
85
Next we will prove the above Corollary 3.2.1. Applying Theorem 3.2.1, θa is
(β, γδ, σ2, φ2Zδmis
), and θb is (Zlmis, γδ). The likelihood for (θa, θb) is
fb(θa, θb) = f1(Y |β, γ, σ2, Zmis),
where f1(Y |β, γ, σ2, Zmis) is the continuous density function of Y given all the parameters
including the missing values, and in our case, it is the normal density function. Also we
can consider that f1 is constant with respect to φ2 here. Notice here Zmis is composed
of (Zlmis, Zδmis
), and both Zlmisand Zδmis
are vector of missing values. Similarly, the
likelihood fa(θa) is
fa = fδ(Y |β, γδ, σ2, Zδmis
),
where fδ(Y |β, γδ, σ2, Zδmis
) is the continuous normal density function of Y given all the
parameters (β, γδ, σ2, Zδmis
) in our case. As the g function contains a factor
P (Zlmis= zlmis
),
so
∫fb(Y |θa, θb) g(θa, θb) dθb
=∑Zlmis
∫ [f1(Y |β, γ, σ2, φ2, Zmis) g(β, γ, σ2, φ2, Zmis)
]dγl
.
As we know θb = (γl, Zlmis), and Zlmis
is a vector of missing values. According to Theorem
3.2.1, the following equation must hold:
∑Zlmis
∫ [f1(Y |β, γ, σ2, φ2, Zmis)× g(β, γ, σ2, φ2, Zmis)
]dγl
= fδ(Y |β, γδ, σ2, φ2, Zδmis
).
So with samples (β(i), γ(i), σ2(i), φ2(i)
, Z(i)mis, i = 1, 2, . . . , ) from M1, as in (3–8)
1
n
M∑i=1
πδ(β(i), σ2(i)
, γ(i)δ , Z
(i)δmis
) · g(β(i), σ2(i), φ2(i)
, γ(i)δ , Z
(i)δmis
)
π1(β(i), γ(i), σ2(i), Z(i)lmis
)(3–10)
86
is a consistent estimator.
Theorem 3.2.2. When the conditions of model Mδ and model M1 are defined as above,
the following g function satisfies the condition of Equation 3–9 and further we have a
consistent Bayes factor estimator. The g is
g(β, γ, σ2, φ2, Zmis) = (2πσ2)−sl2 |Z ′
lZl| 12 (3–11)
× exp
(−(Y −Xβ − Zδγδ)
′Pl(Y −Xβ − Zδγδ)
2σ2
)× P (Zlmis
= zlmis),
where
Pl = Zl(Z′lZl)
−1Z ′l ,
With that,the Bayes factor estimator is
1
n
n∑i=1
(φ2(i))
sl2 |Z(i)
l
′Z
(i)l |
12 exp
(− (Y−Xβ(i)−Z
(i)δ γ
(i)δ )′P (i)
l (Y−Xβ(i)−Z(i)δ γ
(i)δ )
2σ2(i)
)
exp
(− |γ(t)
l |22σ2(i)φ2(i)
) . (3–12)
Notice P (Zlmis= zlmis
) could be any legitimate distribution, and it is taken as the uni-
form prior in its sample space, which is the same prior for Zlmisin π1(β, γ, σ2, φ2, Zδmis
, Zlmis).
So during the calculation, the P (Zlmis= zlmis
) within the g function is canceled with the
prior for Zlmisin π1(β, γ, σ2, φ2, Zδmis
, Zlmis).
Now we want to show why Equation 3–12 is a consistent estimator. We have two
models, Mδ and M1, and the following are the details of a Bayesian specification for these
two models.
87
For Mδ:
Yn×1 ∼ N(Xβ + Zδγδ, σ2I) (3–13)
βp×1 ∼ noninformative uniform from−∞ to ∞
γδ(sδ×1) ∼ N(0, σ2φ2I)
σ2 ∼ IG(a, b)
φ2 ∼ IG(c, d)
Zδmis∼ uniform distribution,
where Zδ has dimension n × sδ and sδ = sum(δ). sl denotes s − sum(δ). We use γδ to
denote the parameters of SNPs in model Mδ and it has dimension sδ × 1. We also use
γl to denote the parameters of SNPs which were excluded from model Mδ. Zδmisis the
missing values in the Zδ.
For M1, the model specification is described in 3–2 and it is almost the same as 3–13,
except that the design matrix for SNPs is changed from Zδ to Z as the model indicator
vector is changed from δ to δ = 1. Correspondingly, γδ is changed to γ.
For both models, the design matrix for SNPs, Zδ and Z, contain missing SNPs and we
do not explicitly mention that right now. Z is a matrix concatenated by the columns of Zδ
and Zl. The mathematical formula Z = Zδ + Zl does not hold, one obvious reason for that
is the number of columns of Z equals to the sum of the numbers of columns of matrix Zδ
and Zl. The vector γ, γδ, and γl have similar relationship.
Same as before, we use πδ (β, γδ, φ2, σ2, Zδmis
) as the prior for model Mδ and let
π1(β, γ, σ2, φ2, Zmis) as the prior for model M1.
88
Proof. Next we will show how to find our g function. By applying the Equation 3–9 in
Corollary 3.2.1, we have:
∑Zlmis
∫g(β, γ, σ2, φ2, Zmis)f1(Y |β, γ, σ2, Zmis) dγl
(3–14)
=∑Zlmis
∫g(β, γ, σ2, φ2, Zmis)
1
(2πσ2)n2
exp
(−|Y −Xβ − Zδγδ − Zlγl|2
2σ2
)dγl
=∑Zlmis
∫g(β, γ, σ2, φ2, Zmis)
1
(2πσ2)n2
exp
(−|Y −Xβ − Zδγδ|2
2σ2
)
× exp
(−γ′lZ
′lZlγl
2σ2+
(Y −Xβ − Zδγδ)′Zlγl
σ2
)dγl
Now if we restrict g(β, γ, σ2, φ2, Zmis) to be constant with respect to φ2, we will have
∑Zlmis
∫g(β, γ, σ2, φ2, Zmis)f1(Y |β, γ, σ2, Zmis) dγl
=∑Zlmis
∫g(β, γ, σ2, φ2, Zmis)
1
(2πσ2)n2
exp
(−|Y −Xβ − Zδγδ|2
2σ2
)
× exp
(−γ′lZ
′lZlγl
2σ2+
(Y −Xβ − Zδγδ)′Zlγl
σ2
)dγl
Further, if we restrict g(β, γ, σ2, φ2, Zmis) to be constant with respect to γl and integrate γl
out of Equation 3–14, we will have
∑Zlmis
∫g(β, γ, σ2, φ2, Zmis)f1(Y |β, γ, σ2, Zmis) dγl
(3–15)
=∑Zlmis
exp(− |Y−Xβ−Zδγδ|2
2σ2
)
(2πσ2)n2
g(β, γ, σ2, φ2, Zmis)(2πσ2)sl2 |Z ′
lZl|− 12
× exp
((Y −Xβ − Zδγδ)
′Zl(Z′lZl)
−1Z ′l(Y −Xβ − Zδγδ)
2σ2
),
where sl = sum(1)− sum(δ).
89
So if we take g be Equation 3–11, that is,
g(β, γ, σ2, φ2, Zmis) = (2πσ2)−sl2 |Z ′
lZl| 12
× exp
(−(Y −Xβ − Zδγδ)
′Pl(Y −Xβ − Zδγδ)
2σ2
)P (Zlmis
= zlmis),
with Pl = Zl(Z′lZl)
−1Z ′l , then we will have
∑Zlmis
∫g(β, γ, σ2, φ2, Zmis)f1(Y |β, γ, σ2, Zmis) dγl
=∑Zlmis
exp(− |Y−Xβ−Zδγδ|2
2σ2
)
(2πσ2)n2
P (Zlmis
= zlmis) (3–16)
=exp
(− |Y−Xβ−Zδγδ|2
2σ2
)
(2πσ2)n2
∑Zlmis
P (Zlmis= zlmis
)
=exp
(− |Y−Xβ−Zδγδ|2
2σ2
)
(2πσ2)n2
= fδ(β, γδ, Zδmis, σ2, φ2).
In above we use the fact that
∑Zlmis
P (Zlmis= zlmis
) = 1,
which holds when Zlmisis a legitimate discrete random vector. In our later calculation, we
take Zlmisto be uniformly distributed.
So in above we showed that with the chosen g function,
∑Zlmis
∫g(β, γ, σ2, φ2, Zmis)f1(Y |β, γ, σ2, Zmis) dγl
=exp
(− |Y−Xβ−Zδγδ|2
2σ2
)
(2πσ2)n2
= fδ(β, γδ, Zδmis, σ2, φ2)
90
and the condition of Equation 3–9 is satisfied.
We take the prior for Mδ to be
πδ(β, γδ, σ2, φ2, Zδmis
) = 1×exp
(− |γδ|2
2σ2φ2
)
(2πσ2φ2)sδ2
× ba
Γ(a)
exp(− bσ2 )
(σ2)a+1
× dc
Γ(c)
exp(− dφ2 )
(φ2)c+1× P (Zδmis
= zδmis),
where the prior for β is taken to be 1 and P (Zδmis) is the uniform distribution on Zδmis
.
The prior for M1 is taken to be:
π1(β, γ, σ2, φ2, Zδmis, Zlmis
) = 1×exp
(− |γ|2
2σ2φ2
)
(2πσ2φ2)sδ2
× ba
Γ(a)
exp(− bσ2 )
(σ2)a+1
× dc
Γ(c)
exp(− dφ2 )
(φ2)c+1× P (Zδmis
= zδmis)× P (Zlmis
= zlmis),
where the prior for β is taken to be 1. Both the prior for P (Zδmis) and the prior for
P (Zlmis) are the discrete uniform distributions on the sample spaces.
With the above specification, if we have samples (β(i), γ(i), σ2(i), φ2(i)
, Z(i)mis) from
model M1, and follow the Equation 3–10, we can consistently estimate the Bayes factor as
the following:
1
n
m∑i=1
(2πσ2(i))−
sl2
∣∣∣Z(i)l
′Z
(i)l
∣∣∣12exp
(− (Y−Xβ(i)−Z
(i)δ γ
(i)δ )′P (i)
l (Y−Xβ(i)−Z(i)δ γ
(i)δ )
2σ2(i)
)
(2πσ2(i)φ2(i))−sl2 exp
(− |γ(i)
l |22σ2(i)φ2(i)
) −→ BFδ,1,
where P (Z(i)lmis
= z(i)lmis
) in the g function is canceled with the prior for Zlmisin
π1(β(i), γ(i), σ2(i)
, φ2(i), Z
(i)δmis
, Z(i)lmis
), and the prior for Zδmisfor both models model M1 and
model Mδ are canceled.
91
3.2.2 How to Choose g(β, γ, Zmis, σ2, φ2) ?
In the above we show that as long as we found a g function that satisfies the Equation
3–9, we can use
1
n
i=n∑i=0
πδ
(β(i), γ
(i)δ , σ2(i)
, φ2(i), (Zδmis
)(i))
g(β(i), γ(i), σ2(i)
, φ2(i), Z
(i)mis
)
π1
(β(i), γ(i), σ2(i), φ2(i), Z
(i)mis
)
to consistently approximate the Bayes factor BFδ, δ=1 with the samples (β(i), γ(i), σ2(i), φ2(i)
, Z(i)mis)
from posterior distribution π1(β(i), γ(i), σ2(i)
, φ2(i), Z
(i)mis|Y ). So the g function is not unique.
In the following we will give another g function and also give some direction on finding g.
We will show that the following g(β, γ, Zmis, σ2, φ2) also satisfies the Equation 3–9.
g(β, γ, Zmis, σ2, φ2) = |R|1/2(2πσ2)−sl/2 exp
(− |γl|2
2σ2φ2
)(3–17)
× exp
(−(Y −Xβ − Zδγδ)
′ZlR−1Z ′
l(Y −Xβ − Zδγδ)
2σ2
)× P (Zlmis
= zlmis),
where
R =I
φ2+ Z ′
lZl.
By applying Corollary 3.2.1, we need show that Equation 3–9 is satisfied. The left
side of Equation 3–9 is equal to:
∑Zlmis
∫g(β, γ, Zmis, σ
2, φ2)f1(Y |β, γ, Zmis, σ2) dγ
=∑Zlmis
∫|R|1/2(2πσ2)−sl/2 exp
(− |γl|2
2σ2φ2
)(3–18)
× exp
(−(Y −Xβ − Zδγδ)
′ZlR−1Z ′
l(Y −Xβ − Zδγδ)
2σ2
)
×exp
(− |Y−Xβ−Zγ|2
2σ2
)
(2πσ2)n/2dγ
P (Zlmis
= zlmis)
92
In the above calculation, we just put the g function back into the integral. Next we will
first integrate γl out of Equation 3–18 and we have
∑Zlmis
∫g(β, γ, Zmis, σ
2, φ2)f1(Y |β, γ, Zmis, σ2) dγ
(3–19)
=∑Zlmis
(2πσ2)−sl/2|R|1/2 exp
(−(Y −Xβ − Zδγδ)
′ZlR−1Z ′
l(Y −Xβ − Zδγδ)
2σ2
)
×(2πσ2)sl/2|R|−1/2 exp
((Y −Xβ − Zδγδ)
′ZlR−1Z ′
l(Y −Xβ − Zδγδ)
2σ2
)
× exp (|Y −Xβ − Zδγδ|2)(2πσ2)2/n
P (Zlmis
= zlmis)
After the cancelation, we will have:
∑Zlmis
∫g(β, γ, Zmis, σ
2, φ2)f1(Y |β, γ, Zmis, σ2) dγ
=∑Zlmis
exp (|Y −Xβ − Zδγδ|2)
(2πσ2)2/n
P (Zlmis
= zlmis) (3–20)
=exp (|Y −Xβ − Zδγδ|2)
(2πσ2)2/n
∑Zlmis
P (Zlmis= zlmis
)
= fδ(Y |β, γδ, Zδmis, σ2).
So in the above, Equation 3–17 and Equation 3–11 are two different g functions and
they both satisfy the Equation 3–9. That shows the g function is not unique. However,
basically we can find a g function by following the directions:
• Take any gcand(β, γ, Zmis, σ2, φ2) function and this gcand could be chosen out of the
convenience of computation or any other reason.
• First calculate
h(β, γδ, Zδmis, σ2, φ2) =
∫ ∫f1(β, γ, Zmis, σ
2, φ2)gcand(β, γ, Zmis, σ2, φ2) dγl dZlmis
fδ(β, γδ, Zδmis, σ2, φ2)
,
and then we will have
g =gcand(β, γ, Zmis, σ
2, φ2)
h(β, γδ, Zδmis, σ2, φ2)
× P (Zlmis= zlmis
),
93
and this g satisfies the condition of Equation 3–9.
So looking back, for the g function in Equation 3–17, the corresponding gcand function
is gcand = exp(− |γ|2
2σ2φ2
). Using the gcand function, we can calculate the h function and
further get the same g function as in Equation 3–17. As for the Equation 3–11, we just let
the gcand function be 1, and follow the same procedure: Calculate the h function, for this
case
h =
∫ ∫f1(β, γ, Zmis, σ
2, φ2)× 1 dγl dZlmis
fδ(β, γδ, Zδmis, σ2, φ2)
;
further get the
g =fδ(β, γδ, Zδmis
, σ2, φ2)∫ ∫f1(β, γ, Zmis, σ2, φ2) dγl dZlmis
× P (Zlmis= zlmis
),
which is the same function as Equation 3–11.
3.2.3 Comparison with the Simplest Model
In Section 3.1, we stated that we plan to run a hybrid M-H chain to search the good
subsets of variables according to the posterior probabilities of models. Later we will
show that this is equivalent to searching for the good subsets of variables according to
the Bayes factors. In the above subsections, we discussed the Bayes factor estimation
for a submodel versus the full model. So when comparing any two submodels, they are
implicitly compared with each other through the ratio of Bayes factors of submodels
versus the full model. For example, if we consider two submodels with SNP indicators δ1
and δ2, the ratio of Bayes factors,
BFδ1, δ=1
BFδ2, δ=1
will act as a criterion to decide whether the stochastic search chain stays in the current
sub-model or moves to the candidate sub-model. An alternative approach is to consider
the ratio of Bayes factors
BFδ1, δ=0
BFδ2, δ=0
,
94
where δ = 0 is the indicator vector for the simplest model which does not have SNPs. We
can give an intuitive explanation for that. When two subsets are compared, it does not
matter which subset is used as reference model, either δ = 0 or δ = 1 can be used.
Casella, Giron, Martinez, and Moreno (Casella et al.) discussed that there are two
ways to do Baysian model selection when using intrinsic priors: encompassing the models
from above, which means comparing all the models to the full model; encompassing the
models from below, that is, comparing all the models to the simplest model. They showed
when the number of subsets are finite, these two methods will essentially give equivalent
results.
In the following, we will give details of the Bayes factor approximation when the
simplest model is used as a reference. Also, we will provide some explanations as to
why we choose to use the full model as a reference. So for the Bayes factor of submodel
verses the simplest model, we are interested in calculating BFδ, δ=0, where δ = 0 means
the associated model has no variables in it, also as the simplest model. By using similar
approximation methods as above, we are able to find the Bayes factor approximation for
small model versus big model, and also as
BFδ, δ=0 =1
BFδ=0, δ
,
we are going to discuss the calculation of BFδ=0, δ instead of BFδ, δ=0.
Now the model M0 has the likelihood f0(Y |β, σ2) and the model Mδ has the likeli-
hood fδ(Y |β, γδ, σ2, φ2, Zδmis
).
Next we also show how we find the g function for this situation. Applying Theorem
3.2.1, the θa is (β, σ2, φ2) and θb is (γδ, Zδmis). For the equation (3–6) to be satisfied we
need ∫fb(Y |θa, θb)g(θa, θb)d θb = fa(Y |θa).
95
Then we should have the following equation holding:
∑Zδmis
∫ exp(− |Y−Xβ−Zδγδ|2
2σ2
)
(2πσ2)n2
g(β, σ2, γδ, φ2, Zδmis
) dγδ
= (2πσ2)−(n)2 exp
−|Y −Xβ|2
2σ2
. (3–21)
We will find the g function from the above Equation 3–21.
∑Zδmis
∫ exp(− |Y−Xβ−Zδγδ|2
2σ2
)
(2πσ2)n2
g(β, σ2, γδ, φ2, Zδmis
) dγδ
=∑
Zδmis
∫ exp(− |Y−Xβ|2
2σ2 − γ′δZ′δZδγδ
2σ2 + (Y−Xβ)′Zδγδ
σ2
)
(2πσ2)n2
g(β, σ2, γδ, φ2, Zδmis
) dγδ
.
If we make a restriction that g(β, γδ, σ2, φ2, Zδmis
) is constant with respect to γδ and
further integrate γδ out of Equation 3–22, then we will have
∑Zδmis
∫ exp(− |Y−Xβ−Zδγδ|2
2σ2
)
(2πσ2)n2
g(β, σ2, γδ, φ2, Zδmis
) dγδ
=∑
Zδmis
(2πσ2)−
(n−sδ)
2 exp
((Y −Xβ)′Zδ(Z
′δZδ)
−1Z ′δ(Y −Xβ)
2σ2
)
× g(β, σ2, γδ, φ2, Zδmis
) exp
(−|Y −Xβ|2
2σ2
)
So if we take
g(β, γδ, σ2, φ2, Zδmis
) (3–22)
= (2πσ2)−sδ2 exp
(−(Y −Xβ)′Zδ(Z
′δZδ)
−1Z ′δ(Y −Xβ)
2σ2
)× P (Zlmis
= zlmis),
96
We will have:
∑Zδmis
∫ exp(− |Y−Xβ−Zδγδ|2
2σ2
)
(2πσ2)n2
g(β, σ2, γδ, φ2, Zδmis
) dγδ
=∑
Zδmis
(2πσ2)−
n2 exp
(−|Y −Xβ|2
2σ2
)P (Zδmis
= zδmis)
= (2πσ2)−n2 exp
(−|Y −Xβ|2
2σ2
) ∑Zδmis
P (Zδmis= zδmis
)
= (2πσ2)−n2 exp
(−|Y −Xβ|2
2σ2
)
= f0(β, σ2)
In above, we showed that the Equation 3–21 holds and we found the g function in
Equation 3–22. To make this method work, we need samples (β(i), γ(i)δ , σ2(i)
, φ2(i), Zδmis
(i))
from the full model M1 which has a likelihood of (Y |β, γ, σ2, φ2, Zmis). As we are planning
to run a Gibbs sampler for the full model which contains all the SNPs, from there we
will get samples of (β(i), γ(i)δ , σ2(i)
, φ2(i), Zδmis
(i)) and it has the correct marginalized joint
distribution, according to Theorem 10.6. of Robert and Casella (2004).
So we can approximate the Bayes factor BFδ=0, δ by
1
n
i=n∑i=1
π0(β(i), σ2(i)
) g(β(i), σ2(i), γδ
(i), φ2(i), Zδmis
(i))
π1(β(i), σ2(i), γδ(i), Zδmis
(i))−→ BFδ=0, δ,
following the similar calculation in last subsection, we can use the following to estimate
BFδ=0, δ
1
n
i=n∑i=1
(φ2(i))
sδ2 · exp
(− (Y−Xβ(i))′Pδ
(i)(Y−Xβ(i))
2σ2(i)
)
ba exp
(− b
φ2(i)
)
Γ(a)(φ2(i))(a+1)· exp
(− |γδ
(i)|22σ2(i)φ2(i)
) , (3–23)
where n is the number of pairs of samples from the Gibbs sampling and
P(i)δ = Z
(i)δ
(Z
(i)δ
′Z
(i)δ
)−1
Z(i)δ
′.
In terms of computation, for formula of Equation 3–23, which use the simplest model
as the reference model, and Equation 3–12, which use the full model as the reference
97
model, most of the time would be spend on calculating P(i)δ or P
(i)l . In a simulation study
we did not find obvious advantage of either method. From now on, we will only use full
model as reference model.
3.2.4 Marginal Likelihood for mδ(Y )
According to the definition of Bayes factor in Equation 3–1, for two models Mδ and
M1 with parameters θδ and θ1, the Bayes factor is defined as
BFδ, δ=1 =mδ(Y )
m1(Y )=
∫p(θδ|Mδ)p(Y |θδ,Mδ) dθδ∫p(θ1|M1)p(Y |θ1,M1) dθ1
.
So at first we were interested in calculating the Bayes factors directly for our situation. In
this subsection we will detail the calculations of marginal likelihood and show why there is
no closed form.
Let Mδ denote the model
Y = Xβ + Zδγδ + ε,
where γδ is a vector with length equal to sum(δ), and the elements of this vector are
composed of the elements of vector γ with the corresponding element of indicator vector δ
to be 1. Zδ is the is the matrix with columns being taken from the Z matrix according to
the corresponding indicator vector δ.
In the above model specification, Zδ =(Zδ(obs), Zδ(mis)
), and for simplicity we
just write Zδ. We let sδ to be the total number of elements in γ equal to 1. The joint
likelihood under the model specification of Mδ is:
Lδ(Y ) ∝ (2πσ2)−n2 (2πσ2φ2)−
sδ2 exp
(−|Y −Xβ − Zδγδ|2
2σ2
)(3–24)
× exp
(− |γδ|2
2σ2φ2
)× π
(σ2
)× π(φ2
).
98
With the likelihood for the joint distribution, the marginal likelihood of mδ(Y ) with
parameters β, γδ, σ2, φ2 and Zδmis being integrated out of joint likelihood Lδ(Y ) is:
mδ(Y ) =
∫· · ·
∫(2πσ2)−
n2 (2πσ2φ2)−
sδ2 exp
(−|Y −Xβ − Zδγδ|2
2σ2
)(3–25)
× exp
(− |γδ|2
2σ2φ2
)× π
(σ2
)× π(φ2
)
dβ dγδ dσ2 dφ2 dZδmis.
The first step of the above formula is to integrate β out. If we just look at the exponential
term with β in the above equality we can write:
exp
(−|Y −Xβ − Zδγδ|2
2σ2
)(3–26)
= exp
(−|Xβ − (Y − Zδγδ) |2
2σ2
)
= exp
(−β′X ′Xβ
2σ2
)exp
((Y − Zδγδ)
′ Xβ
σ2
)exp
(−|Y − Zδγδ|2
2σ2
).
By using the above organization, we can integrate β out of Equation 3–25 and get:
mδ (Y ) =
∫· · ·
∫(2πσ2)−
n2 (2πσ2φ2)−
sδ2 (2πσ2)
p2 |X ′X|− 1
2 (3–27)
× exp
(−(Y − Zδγδ)
′P (Y − Zδγδ)
2σ2
)exp
(− |γδ|2
2σ2φ2
)
×π(σ2
)× π(φ2
)dγδ dσ2 dφ2 dZδmis
,
where P is defined as P = I −X (X ′X)−1 X ′ and p is the dimension of β. The next step is
to integrate γδ out. We can write the exponential term of γδ as
exp
(−(Y − Zδγδ)
′P (Y − Zδγδ)
2σ2
)exp
(− |γδ|2
2σ2φ2
)(3–28)
= exp
(−
γδ′(Zδ
′PZδ + Iφ2 )γδ
2σ2+
Y ′Zδγδ
σ2
)exp
(−Y ′PY
2σ2
)
Now we can integrate γδ out of Equation 3–27 and get:
99
mδ (Y ) =
∫· · ·
∫(2πσ2)−
n2 (2πσ2φ2)
sδ2 (2πσ2)
p2 |X ′X|− 1
2 (2πσ2)−sδ2 (3–29)
×∣∣∣∣Zδ
′PZδ +I
φ2
∣∣∣∣− 1
2
exp
(Y ′Zδ(Zδ
′PZδ + Iφ2 )
−1Zδ′Y
2σ2
)
× exp
(−Y ′PY
2σ2
)π
(σ2
)× π(φ2
)dσ2 dφ2 dZδmis
,
Next let us reorganize the above Equation 3–29 a little bit and we will have:
mδ (Y ) =
∫· · ·
∫(2π)−
n−p2 (σ2)−
n−p2 (φ2)−
sδ2 |X ′X|− 1
2 |Pδ, φ2|− 12 (3–30)
× exp
(Y ′Zδ(Pδ, φ2)−1Zδ
′Y2σ2
)exp
(−Y ′PY
2σ2
)
×π(σ2
)× π(φ2
)dσ2 dφ2 dZδ(mis),
where Pδ, φ2 = Zδ′PZδ + I
φ2 .
The next step is to integrate σ2 out, and as we know the prior is:
π(σ2) =ba
Γ(a)
exp− bσ2
(σ2)a+1.
The above integration will become:
mδ (Y ) =
∫ ∫(2π)−
n−p2 (σ2)−
n−p2 (φ2)−
sδ2 |X ′X|− 1
2 |Pδ, φ2|− 12 (3–31)
× baΓ(n−p2
+ a)
Γ(a)(
Y ′PY2
− Y ′Zδ(Pδ, φ2 )−1Zδ′Y
2+ b
)(n−p2
+a)
×π(φ2
)dφ2 dZδmis
,
The next step is to integrate the missing data out of the above marginal distribution,
however the number of missing SNPs combination is huge and there is no explicit form
for it. As for the integration of φ2, we can not directly integrate it out and will leave the
integration here. So the above derivation shows that we can not directly calculate the
marginal likelihood, and further we do not have a closed form of the Bayes factor.
100
3.3 Markov Chain Monte Carlo Property
Right now we want to apply a variable selection method to detect good subsets of
SNPs. Suppose in the data set, each individual observation has s number of SNPs. For
each subset of variables, we associate it with an index vector δ. Each element of the δ
vector is either 1 or 0. If it is 1, the corresponding variable is included in the model; 0
otherwise. An example of a δ vector looks like this
δ =
1
0
...
1
s×1
,
and in this example, the first variable is included in the model, the second variable is not,
and the last one is included.
Our plan is to set up a Markov chain which is driven by Bayes factors such that the
stationary distribution of the Markov chain is proportional to the distribution of δ, in
other words, the distribution of models in the model sample space.
There are s variables in the data, in our case, s SNPs, so there are 2s models in the
model space. Each of the 2s models has an associated index vector δi, i = 1, · · · , 2s. For a
model associated with δi, , it has marginal likelihood
mδi(X) =
∫f(X|θδi
)π(θδi) dθδi
,
where θδiis the parameter vector of model δi. Let B(δi) denote the probability of model δi
in the model sample space, then the probability of model δi is
B(δi) =mδi
(X)∑2s
j=1 mδj(X)
=
mδi(X)
m1(X)
∑2s
j=1
mδj(X)
m1(X)
=BFδi, δ=1
1 +∑
j36=1 BFδj , δ=1
.
So B(δi) is the target distribution and it has finite sample space with the above probabil-
ity distribution.
101
The following is the plan to sample the target distribution.
• Run two parallel Markov chains. One chain is a Gibbs sampler for the full model
and samples all the parameters in the full model, including all the missing SNPs.
• Concurrently, run a hybrid M-H chain to select good models, in other words, good
subsets of SNPs. When running this chain, we will employ samples from the parallel
Gibbs chain to estimate the Bayes factors. The embedded list details the M-H steps.
1. Suppose the current state is δ1, and we will use a mixture distribution togenerate the candidate δ2. The mixture distribution is: with probability p,draw a sample using a random walk algorithm; with probability (1− p), draw asample from the uniform distribution on the sample space. The draws from theuniform sample space are i.i.d. and the tuning parameter p will be set as 0.75to insure random walk most of the time and also having the ability to jump outof local modes.
2. Using the samples from the previous Gibbs sampler, calculate the approximateBayes factors for the candidate state and calculate the acceptance probabilityρ (δ1, δ2).
3. Draw u from uniform distribution U (0, 1), if u < ρ (δ1, δ2) take δ2 as the nextstate, otherwise stay at δ1.
4. Repeat the steps, until the chain arrives its stationary distribution.
The above is the chain we designed. But in actual computation, the steps are a little
bit different. We run the Gibbs sampler first. Then with the samples from the Gibbs
chain, we run the M-H chain to find good subsets.
We plan to use the following candidate random walk distribution. Suppose the
current chain is at state δ1. Randomly choose one number r from the uniform discrete
distribution (1, 2, ..., s), and flip the rth element in δ1 from 1 to 0 or from 0 to 1, and at
the same time keep all other elements in δ1 untouched. Denote the new candidate sample
from random walk as δ′2. Let δ′′2 denote the candidate state from the other part of the
mixture distribution, the uniform distribution in the sample space. Each element of δ′′2 is
independent and randomly taken as either 0 or 1.
102
3.3.1 Candidate Distribution
Now the conditional distribution can be written as :
q (δ2|δ1) = p∑
j=1,...,s
Iδ1
(δ2 = δ′2j
) 1
s+ (1− p)
∑
k=1,...,2s
I (δ2 = δ′′2k)1
2s.
Also we have
q (δ1|δ2) = p∑
j=1,...,s
Iδ2
(δ1 = δ′1j
) 1
s+ (1− p)
∑
k=1,...,2s
I (δ1 = δ′′1k)1
2s.
Obviously, δ′′2k and δ′′1k have same uniform i.i.d distribution which does not depend on
the previous states. So for any state δ1 and δ2, we have
(1− p)∑
k=1,...,2s
I (δ2 = δ′′2k)1
2s= (1− p)
∑
k=1,...,2s
I (δ1 = δ′′1k)1
2s=
(1− p)
2s
Now take any states δ1 and δ2 for consideration. If δ1 and δ2 have more than one element
different or they are exactly the same, then δ1 and δ2 can not directly transfer to each
other from the random walk distribution. On the other hand, if δ1 and δ2 are only
different by one element, the random walk part probability is p/s. It can be summarized
as the follows:
p∑
j=1,...,s
Iδ1
(δ2 = δ′2j
) 1
s= p
∑j=1,...,s
Iδ2
(δ1 = δ′1j
) 1
s
=
0 : more than one element different or the same
ps
: just one element difference
So the above argument says, for any state of δ1 and δ2, the following always holds
q (δ2|δ1) = q (δ1|δ2) .
Then with the actual Bayes factors, the acceptance probability is:
ρ (δ1, δ2) = min
BFδ2, δ=1
BFδ1, δ=1
q (δ1|δ2)
q (δ2|δ1), 1
= min
BFδ2, δ=1
BFδ1, δ=1
, 1
,
103
where δ = 1 is the indicator vector for the full model.
Using samples from the Gibbs sampler and applying Theorem 3.2.1, we will have
BF
(n)δ2, δ=1 as a consistent Bayes factor estimate for BF(δ2, δ=1) and n is the number of pairs
of samplers from the Gibbs sampler. We further denote the ratio of estimated Bayes factor
as
BF(n)δ2, δ=1
BF
(n)δ1, δ=1
,
then the estimated acceptance probability would be
ρn (δ1, δ2) = min
BF
(n)δ2, δ=1
BF
(n)δ1, δ=1
q (δ1|δ2)
q (δ2|δ1), 1
= min
BF
(n)δ2, δ=1
BF
(n)δ1, δ=1
, 1
. (3–32)
3.3.2 Convergence of Bayes Factors
To run the Metropolis-Hastings Markov chain, we need to calculate BF(δ, δ=1), which
normally involves parts of the total SNPs. From the previous sections, we know that the
Bayes factors are intractable and we plan to approximate them. In the previous chapter,
we already devised a Gibbs sampler chain for the full model, which has all SNPs and all
parameters in the model. With that, we can use samples from the full Gibbs sampler to
approximate the Bayes factor. The approximation can be justified by the Theorem 10.6
from Robert and Casella (2004) The theorem states:
For the Gibbs Sampler of with the algorithm of [A.40] in Robert and Casella (2004),
if(Y (t)
)is ergodic, then the distribution g is a stationary distribution for the chain
(Y (t)
)and f is the limiting distribution of the subchain
(X(t)
). Here g is the full joint
distribution for X.
[A.40] in Robert and Casella (2004) is the following algorithm:
104
Given (y(t)1 , · · · , y
(t)p ),
Y(t+1)1 ∼ g1(y1|y(t)
2 , · · · , y(t)p )
Y(t+1)2 ∼ g2(y2|y(t+1)
1 , y(t)3 , · · · , y(t)
p )
...
Y (t+1)p ∼ gp(yp|y(t+1)
1 , · · · , y(t+1)p−1 )
where g1(y1|y(t)2 , · · · , y
(t)p ), · · · , gp(yp|y(t+1)
1 , · · · , y(t+1)p−1 ) are the conditional distributions.
Our sub-chain from the Gibbs sampler satisfies the regular conditions and is Harris-
recurrent, aperiodic, so it is ergodic. Further more, by the ergodic theorem, we will have
BF
(n)(δ2, δ=1)
BF
(n)(δ1, δ=1)
−→ BFδ2, δ=1
BFδ1, δ=1
, as n →∞,
where
BF(n)δ2, δ=1 is the estimator of BFδ2, δ=1 from the Gibbs sampler with sample size n.
3.3.3 Ergodicity Property of This M-H Chain
Ideally, when having the exact Bayes factors, we will run a M-H chain on the model
sample space to search for good subsets of variables. But with no closed form for the
Bayes factors, we approximate the Bayes factors using the samples from the Gibbs
sampler. We plan to run the M-H chain on the model sample space and the Gibbs sampler
for the full model concurrently. As we use estimated Bayes factors in the M-H chain, we
call this M-H chain a empirical chain. In practice, we use n pairs of samples from the
Gibbs sampler to estimate the Bayes factors and run the M-H chain t steps and n and
t are big numbers. In this section, our goal is to prove the ergodicity of the empirical
chain when both n and t go to infinity. Our proof has two steps. First we will consider
the situation where n is a fixed large number, and then we will consider the situation that
both n and t go to infinity.
105
3.3.3.1 Fixed n, uniformly ergodic converges to the distribution B(n)
With n fixed, the empirical M-H algorithm with estimated Bayes factors is a regular
M-H chain with finite sample space. We will show that it satisfies the detailed balance
condition with a stationary distribution,
Bn(δi) =mδi
(X)∑2s
j=1 mδj(X)
=
BF
(n)δi, δ=1
1 +∑
j3j 6=1
BF
(n)δj , δ=1
, n is fixed.
Also since the sample space is finite, this M-H chain is uniformly ergodic.
Now, let us show it satisfies the detailed balance condition. As the transition kernel
associated with this Markov chain is
K (δ1, δ2) = ρ (δ1, δ2) q (δ2|δ1) I (δ2 6= δ1) + (1− r (δ1)) I (δ1 = δ2) ,
We need to verify
ρ(δ1, δ2)q(δ2|δ1)I (δ2 6= δ1) P (δ = δ1) = ρ(δ2, δ1)q(δ1|δ2)I (δ2 6= δ1) P (δ = δ2),
and
(1− r(δ1))Iδ1(δ2)I (δ1 = δ2) P (δ = δ1) = (1− r(δ2))Iδ2(δ1)I (δ1 = δ2) P (δ = δ2).
In the above equations, ρ(δ1, δ2) is the acceptance probabilities, q(δ2|δ1) is the candidate
distribution given currently at the state of δ1. P (δ) stands for the target distribution of
the M-H chain with the above acceptance probabilities and candidate distribution. In our
situation, with n fixed, the target distribution is Bn(δi).
The above second equation is obvious, when δ1 = δ2, the left side is equal to the right
side. When δ1 6= δ2, both sides equal to 0. As we showed before that q(δ2|δ1) = q(δ1|δ2),
for the first equation, by applying the equation 3–32, we need to verify that
min
BF
(n)δ2, δ=1
BF
(n)δ1, δ=1
, 1
I (δ2 6= δ1) Bn(δ = δ1) = min
BF
(n)δ1, δ=1
BF
(n)δ2, δ=1
, 1
I (δ2 6= δ1) Bn(δ = δ2).
106
Here we assume δ1 6= δ2, otherwise, the above equation automatically holds. That is, we
need verify
min
BF
(n)δ2, δ=1
BF
(n)δ1, δ=1
, 1
BF
(n)δ1, δ=1
1 +∑
j3j 6=1
BF
(n)δj , δ=1
= min
BF
(n)δ1, δ=1
BF
(n)δ2, δ=1
, 1
BF
(n)δ2, δ=1
1 +∑
j3j 6=1
BF
(n)δj , δ=1
.
(3–33)
If
BF(n)δ2, δ=1 <
BF
(n)δ1, δ=1, the left side of Equation 3–33 is equal to
BF
(n)δ2, δ=1
1 +∑
j3j 6=1
BF
(n)δj , δ=1
.
In this situation,
min
BF
(n)δ1, δ=1
BF
(n)δ2, δ=1
, 1
= 1
and the right side of Equation 3–33 is equal to
BF
(n)δ2, δ=1
1 +∑
j3j 6=1
BF
(n)δj , δ=1
.
So the right side of Equation 3–33 is equal to the left side of (3–33). Follow the same
rational, Equation 3–33 holds when
BF
(n)δ2, δ=1 >
BF
(n)δ1, δ=1.
So the Equation 3–33 is satisfied. Then by using the Theorem 6.46 of Robert and Casella
(2004), B(n) is the stationary distribution of this chain.
In the above we showed that the empirical M-H Markov chain converges uniformly
to the distribution B(n) when n is fixed. However, our ultimate goal is to show this chain
converges to the distribution B(δi). That is,
B(δi) =mδi
(X)∑2s
j=1 mδj(X)
=BFδi, δ=1
1 +∑
j3δj 6=1 BFδj , δ=1
. (3–34)
107
3.3.3.2 Ergodic convergence to B
We will use KBnto denote the kernel density for our empirical M-H chain with the
estimated Bayes factor. We use B to denote the distribution of models with exact Bayes
factors and use Bn to denote the distribution of the models with estimated Bayes factors.
Theorem 3.3.1. Consider two parallel chains, a Gibbs sampler on the full model with all
parameters, and a M-H chain in the model space as defined in Section 3.3. The empirical
chain converges to the target stationary distribution with ergodic property.
‖ K(t)
Bn(δ0, ·)−B ‖TV−→ 0, (3–35)
where ‖ · ‖TV is the notation for the total variation norm, δ0 is the initial state of the
chain, and B is defined in Equation 3–34.
Now we will prove the above claim.
Proof.
‖ K(t)
Bn(δ0, ·)−B ‖TV = ‖ K
(t)
Bn(δ0, ·)− Bn + Bn −B ‖TV (3–36)
≤ ‖ K(t)
Bn(δ0, ·)− Bn ‖TV + ‖ Bn −B ‖TV
= sup∆‖K(t)
Bn(δ0, ∆)− Bn(∆)‖+ sup
∆‖Bn(∆)−B(∆)‖
=1
2
∑
δ
‖K(t)
Bn(δ0, δ)− Bn(δ)‖+
1
2
∑
δ
‖Bn(δ)−B(δ)‖.
Notice in the above equations, the first inequality is by the Triangle inequality of total
variation norm. The second equation is by the definition of total variation norm and the
third equation is by Scheffe’s lemma. By the ergodic property of the Gibbs sampler, we
know that
1
2
∑
δ
‖Bn(δ)−B(δ)‖ −→ 0, as n →∞. (3–37)
So to show Equation 3–35 we must show
1
2
∑
δ
‖K(t)
Bn(δ0, δ)− Bn(δ)‖ −→ 0, as n, t →∞.
108
For any ε > 0, we always can find a pair (n, tn) such that:
‖K(tn)
Bn(δ0, δ)− Bn(δ)‖ < ξ,
where ξ = 2ε2s > 0, and s is the number of variables in the data set. The above is true since
from the previous proof for any n,
‖K(t)
Bn(δ0, δ)− Bn(δ)‖ −→ 0, as t →∞.
Thus we can always find a tn, which satisfies the condition ‖K(tn)
Bn(δ0, δ) − Bn(δ)‖ < ξ. By
the Theorem 13.3.2 of Meyn and Tweedie (2008), we know that with n fixed, we have
‖K(t)
Bn(δ0, δ)− Bn(δ)‖ > ‖K(t+1)
Bn(δ0, δ)− Bn(δ)‖.
So for any pair of (n, t) 3 t > tn and any δ, we always have
‖K(t)
Bn(δ0, δ)− Bn(δ)‖ < ξ.
Furthermore, from the definition of ξ, for any n, we can always find (n, t), such that when
t > tn,
1
2
∑
δ
‖K(t)
Bn(δ0, δ)− Bn(δ)‖ < ε.
So when n →∞ and tn < t →∞,
1
2
∑
δ
‖K(t)
Bn(δ0, δ)− Bn(δ)‖ −→ 0. (3–38)
Combining Equation 3–37 and Equation 3–38 we have
‖ K(t)
Bn(δ0, ·)−B ‖TV−→ 0, as n, t →∞. (3–39)
The above proof shows that our M-H chain will converge to the target stationary
distribution with the ergodic property.
109
3.4 Computation Speed
Computation speed is always a problem in variable selection. For our problem, in
every step of Markov chain, for one candidate model with index δ, we need to calculate an
estimated Bayes factor for the model δ verses the reference model.
From the above sections, we know that if we have samples (β(i), γ(i), σ2(i), φ2(i)
, Z(i)mis)
generated by Gibbs sampling from model M1, the reference model, also the full model, we
can approximate the Bayes factor by:
1
n
n∑i=1
(2πσ2(i))−
sl2 d
∣∣∣Z(i)l
′Z
(i)l
∣∣∣12exp
(− (Y−Xβ(i)−Z
(i)δ γ
(i)δ )′P (i)
l (Y−Xβ(i)−Z(i)δ γ
(i)δ )
2σ2(i)
)
(2πσ2(i)φ2(i))−sl2 exp
(− |γ(i)
l |22σ2(i)φ2(i)
) −→ BFδ,δ=1,
where Z(i)δ is a matrix with its columns coming from the corresponding columns of the
updated matrix Z(i) and the corresponding columns are those with δ elements equal to
1. On the other hand, Z(i)l is composed of columns of Z(i), which do not compose the Z
(i)δ
matrix. Also Pl = Zl(Z′lZl)
−1Z ′l .
In the above formula, for one Bayes factor estimation, we need to repeat n times
of∣∣∣Z(i)
l
′Z
(i)l
∣∣∣ and n times of (Z(i)l
′Z
(i)l )−1. As we know the determinant calculation is
closely related to the inverse calculation and if the inverse is available, the determinant
calculation is much easier. So we want to take advantages of this property without directly
calculating the determinants.
In this section, we will talk about two methods to speed up the computation. First we
will introduce a method to update the inverse based on inverse calculation from the last
step without calculating the inverse from the beginning. Second, we will talk about direct
replacement of Z(i) with Z, and the justification of doing this.
3.4.1 Matrix Inversion
3.4.1.1 Two columns of parameters for one column of SNPs
Suppose we have i = 1, ..., n, samples from the Gibbs sampler. In the Gibbs sampling
program, originally, we use two parameters to code one SNP. So for the design matrix
110
of the SNPs, we use two columns to code one column of SNPs. Previously in Chapter 2,
we showed that updating one column of SNPs in each Gibbs sampler cycle still preserves
the target stationary distribution, meanwhile dramatically speed up the computation.
So in the Gibbs sampler, at the ith iteration step, we update 2 columns of matrix Z(i−1)
from the previous iteration. Note the update is just with respect to the missing SNPs
and the parts of the design matrix with the observed SNPs are always untouched. In each
iteration of the Gibbs sampler, we only stored the updated SNPs out of the consideration
of the storage space. For the Bayesian variable selection problem, during each step of the
Markov chain, we need to recover 2 columns of the design matrix Z(i) first, since in the
Gibbs sampler, we only saved 2 column of SNP imputation, and not the whole matrix
Z(i). This is an enormous step because for each one of the n pairs of samples, we need to
first find the corresponding index of the updated SNPs, then find the missing positions
of the indexed SNPs, and finally update the corresponding missing SNPs. Considering
the size of the problem, this is the only feasible method and the time spent on this step is
unavoidable and we do not have any better solution.
Another place where most calculation time is spent is to find the |Z(i)l
′Z
(i)l | and
(Z(i)l
′Z
(i)l )−1. Right now we are dealing with a dataset with 44 SNPs and if we have
sum(1 − δ) = 30, then there are 30 SNPs not included in the submodel, to calculate
one Bayes factor, we need to do n times matrix inversion with dimension 60 × 60, as each
column of SNPs uses 2 columns for coding.
Now, we will discuss the difference between the just updated matrix Z(i) and the
previous design matrix Z(i−1). Then one method of calculating the inverse of matrix Z(i)
based on the inverse of matrix Z(i−1) will be given. These discussion will give us a flavor of
the final method we employ to calculate the inverse matrix.
Suppose in last step we have matrix Z(0) = [Z1, ..., Zi, Zi+1, ..., Z2s], and in the
current step the jth and the (j + 1)th column of Z(0) will be updated to Zj and Zj+1,
as for each iteration only one column of SNPs are updated, and we use 2 columns of the
111
design matrix to record one column of SNPs. That is, the current updated SNP matrix
is Z(1) = [Z1, ..., Zj, Zj+1, ..., Z2s]. As showed in above, we are interested in calculating(Z
(i)l
′Z
(i)l
)−1
and∣∣∣Z(i)
l
′Z
(i)l
∣∣∣. Although Zl is not exactly the same as Z, here we will just
discuss how to obtain(Z ′
(1)Z(1)
)−1
and we make the adjustments when programming. So
adapt to the discussion here, we are interested in the finding(Z ′
(1)Z(1)
)−1
. Let us first
write out Z ′(1)Z(1) here.
We can write
Z ′(1)Z(1) =
Z1′
...
Z ′j
Z ′j+1
...
Z2s′
(Z1 · · · Zj Zj+1 · · · Z2s
), (3–40)
and doing the multiplication we have
Z ′(1)Z(1) =
Z1′Z1 · · · Z1
′Zj Z1′Zj+1 · · · Z1
′Z2s
......
Z ′jZ1 · · · Z ′
jZj Z ′jZj+1 · · · Z ′
jZ2s
Z ′j+1Z1 · · · Z ′
j+1Zj Z ′j+1Zj+1 · · · Z ′
j+1Z2s
......
Z2s′Z1 · · · Z2s
′Zj Z2s′Zj+1 · · · Z2s
′Z2s
.
Next we will use A to denote Z ′(0)Z(0) and use A1 to denote Z ′
(1)Z(1). Further we can
write
A1 = (A + Ha + Hb),
112
where
(Ha)2s×2s =
0 · · · 0 (Z1′Zj − Z1
′Zj) (Z1′Zj+1 − Z1
′Zj+1) 0 · · · 0
......
0 · · · 0 (Z ′jZj − Zj
′Zj) (Z ′jZj+1 − Zj
′Zj+1) 0 · · · 0
0 · · · 0 (Z ′j+1Zj − Zj+1
′Zj) (Z ′j+1Zj+1 − Zj+1
′Zj+1) 0 · · · 0
......
0 · · · 0 (Z2s′Zj − Z2s
′Zj) (Z2s′Zj+1 − Z2s
′Zj+1) 0 · · · 0
,
and
(Hb)2s×2s =
0 · · · 0 0 · · · 0
......
0 · · · 0 0 · · · 0
(Zj′Z1 − Zj
′Z1) · · · 0 0 · · · (Zj′Za − Zj
′Z2s)
( ˙Zj+1′Z1 − Zj+1
′Z1) · · · 0 0 · · · ( ˙Zj+1′Z2s − Zj+1
′Z2s)
0 · · · 0 0 · · · 0
......
= 0 · · · 0 0 · · · 0
.
The above equations just show that the difference between the previous matrix A and the
updated matrix A1 is just 2 rows and 2 columns. This is a special structure and we want
to take advantage of this structure.
We already have(Z ′
(0)Z(0)
)−1
, that is, have A−1 from last step and notice rank(Ha) =
2 and rank(Hb) = 2. We can write the following
A1 = (A + Ha + Hb) = AB, (3–41)
where
B = I + A−1H1 + A−1H2. (3–42)
113
According to Miller (1981), if H is a rank 2 matrix, we have the following formula
(I + H)−1 = I − 1
a + b(aH −H2), (3–43)
where
a = 1 + tr(H)
and
2b = tr(H)2 − tr(H2).
For our situation, if we are able to calculate B−1, then A1−1 = B−1A−1. We can write
B = (I + A−1Ha + A−1Hb) = (I + A−1Ha)(I +
(I + A−1Ha
)−1A−1Hb
). (3–44)
Then the problem becomes to find
(I + A−1Ha)−1
and(I +
(I + A−1Ha
)−1A−1Hb
)−1
.
Suppose we are able to calculate
C = (I + A−1Ha)−1,
then the problem is to find (I + CA−1Hb)−1. Then we will have
B−1 = (I + CA−1H2)−1C. (3–45)
The question boils down to calculating C and (I + CA−1Hb)−1. For both problems, we can
apply Equation 3–43 to solve it.
As for calculating A−11 , we will have the following sub-steps:
1. Calculate Ha and Hb
2. • calculate ac = 1 + tr(A−1Ha),
114
Table 3-1. Comparison of time spent on inverse calculation using standard software andMiller’s method with 2 columns and 2 rows updated.
Matrix dimension: 100× 100 500× 500 1000× 1000 2000× 2000
Matlab’s direct calculation 0.000s 0.235s 1.609 11.985sMiller’s method 0.016s 1.454s 11.937s 99.703s
• calculate bc =(tr(A−1Ha))
2−tr((A−1Ha)
2)
2,
• and C = I − 1ac+bc
(ac (A−1Ha)− (A−1Ha)
2)
3. • calculate D = CA−1Hb,• calculate ad = 1 + tr(D),
• bd = (tr(D))2−tr(D2)2
,• calculate (I + D)−1 = I − 1
ad+bd(adD −D2),
4. calculate B−1 = (I + D)−1C,
5. finally, A1−1 = B−1A−1.
The above steps are needed to get one matrix inversion A−11 given that we have A−1
from last step. To get a Bayes factor estimation, we need repeat these steps n times. We
did a simulation for the above method and it shows that when two columns of matrix
are updated, the actual calculation time is not shortened, compared with the standard
software such as R and Matlab. The following Table 3-1 records the time spent on matrix
inversion with different methods.
To speed up the computation, we change the coding of SNPs from 2 parameters per
SNP to 1 parameter per SNP. Now we assume that the effect of a SNP is linear to the
number of copies of certain allele within that SNP, instead of parameterizing by additive
and dominant effect. By changing the coding method, we decrease the steps needed for
one Bayes factor to one half, since now there is only one column update in Z(i) instead of
2 columns.
As mentioned in the above, we decide to code the SNPs in another way: using one
column of parameters for one column of SNPs. In that way, the design matrix for the
SNPs records the number of copy of allele in that SNP. For example, we use γ to represent
the allele “C”’s effect. Then for an individual with SNP genotype “CC”, the SNP effect
115
would be 2γ, 1γ for SNP with genotype “CG”, and 0γ for SNP with genotype “GG”.
With the above coding, we went back to re-run the Gibbs sampling for the full model. we
used “1”, “0” and “− 1” instead of “2”, “1” and,“0”, which is equivalent.
The following method is directly due to Sadighi and Kalra (1988) and it is an
application of the Sherman-Morrison-Woodbury formula; see Woodbury (1950) and
Bartlett (1951). A review paper about the Sherman-Morrison-Woodbury formula is
written by Hager (1989).
Suppose we have the inverse of a matrix A, A−1, and we update the jth column
of A to be v. Let (A)j denote the jth column of A. Let ej to be a vector of all zeros,
except the jth element to be 1. Also let A1 to be the updated matrix. To find A−11 =
(A + (v − (A)j)eTj )−1, the following steps are needed:
• calculate b = A−1(v − (A)j)
• calculate b′ = − 11+(b)j
b, where (b)j is the jth element of vector b.
• then A−11 = (I + b′eT
j )A−1 = A−1 + b′eTj A−1
Obviously b′eTj is equivalent to the product of a column vector and a row vector. The
calculation for b is the product of a matrix and a vector. The calculation of b′ is a vector
multiplying a scale. All these calculations are not matrix production and they decrease the
computation a lot.
Next we will show why the above method works. First write
A1 = AB,
then
(A1)−1 = B−1A−1.
As we already have A−1, we need to find B and B−1, such that A1 = AB. If we let
B = I + beTj ,
116
⇒ A + (v − (A)j) eTj = A · (I + beT
j ) (3–46)
⇒ (v − (A)j) eTj = A · beT
j
⇒ v − (A)j = Ab
⇒ b = A−1(v − (A)j)
Furthermore,
B−1 = (I + beTj )−1 =
1 · · · 0 − b11+bj
0 · · · 0
.... . .
......
......
0 · · · 1 − bj−1
1+bj0 · · · 0
0 · · · 0 11+bj
0 · · · 0
0 · · · 0 − bj+1
1+bj1 · · · 0
......
......
. . ....
0 · · · 0 − bn
1+bj0 · · · 1
= I − b
(b)j
· eTj (3–47)
Similarly if the jth row of matrix A1, A(j)1 , is updated to be µ, and A2 is the updated
matrix. Then
c = (µ− A(j)1 )A−1
1 ,
C = (I + ej cT ),
and
A2 = CA1.
⇒ C−1 = I − 1
(c)j
ej c,
where (c)j is the jth element of vector c and
⇒ (A2)−1 = A−1
1 C−1 = A−11 (I − 1
(c)j
ej c) = A−11 − A1
1 + (cj)ej c. (3–48)
117
Table 3-2. Comparison of time spent in calculation of matrix inversion for differentmethods.
Matrix dimension: 500× 500 1000× 1000 1500× 1500 2000× 2000
Matlab’s direct calculation 0.266s 1.547s 5.016 11.703sMiller’s method 0.031s 0.188s 0.32s 0.719s
Sanighi & Kalra’s direct calculation 0.375s 2.735s 9.453s 21.657sSanighi & Kalra’s simplification 0.032s 0.11s 0.235s 0.531s
For our situation, when we use one parameter for one SNP, in each Gibbs sampler
iteration, we just update the jth column and jth row of the matrix Z ′Z and keep the
other parts untouched. Next we want to show a time table comparison for the calculation
of matrix inversion with different dimensions. For Sanighi & Kalra’s direct calculation
is I just follow their formula and let Matlab calculate the matrix multiplication directly.
Sanighi & Kalra’s simplification means to directly specify the cells and vectors without
matrix multiplication. Obviously, we will take the last method, and it is 20 times faster
than the default calculation of Matlab. We did the test in R and the results are similar.
3.4.1.2 Determinant calculation
Next question is to calculate the determinant of matrix A when the jth row and jth
column of A is updated to v and uT . We assume that we have A−1 and |A| at hand. Let
us first just try to get the determinant of the matrix with one column updated. After
that, we can apply the same method to get the determinant with both row and column
updated.
We write
A1 = A + veTj , (3–49)
where ej is the vector with only the jth element to be 1 and all others to be 0. As shown
before, we can write
A1 = AB,
with
B = I + beTj ,
118
and
b = A−1(v − Aj),
where Aj is the jth column of A. According to the theory of determinants of product of
matrices, when two square matrices G and H have the same dimensions, we have
|GH| = |G| × |H|. (3–50)
So for our situation,
|A1| = |A| × |B| = |A| × |I + beTj |.
Next we will show
|I + beTj | = (1 + bj),
where bj is the jth element of vector b. First we write
|I + beTj | =
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
1 · · · 0 b1 0 · · · 0
.... . .
......
......
0 · · · 1 bj−1 0 · · · 0
0 · · · 0 1 + bj 0 · · · 0
0 · · · 0 bj+1 1 · · · 0
......
......
. . ....
0 · · · 0 bn 0 · · · 1
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
(3–51)
119
If we multiply the jth row with (− bn
1+bj) and add it to the nth row for the above matrix,
we will get
|I + beTj | =
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
1 · · · 0 b1 0 · · · 0
.... . .
......
......
0 · · · 1 bj−1 0 · · · 0
0 · · · 0 1 + bj 0 · · · 0
0 · · · 0 bj+1 1 · · · 0
......
......
. . ....
0 · · · 0 bn 0 · · · 1
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
=
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
1 · · · 0 b1 0 · · · 0
.... . .
......
......
0 · · · 1 bj−1 0 · · · 0
0 · · · 0 1 + bj 0 · · · 0
0 · · · 0 bj+1 1 · · · 0
......
......
. . ....
0 · · · 0 0 0 · · · 1
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
.
(3–52)
Similarly, for all other rows in the above matrix we can do the similar operations. For
example, we can multiply (− bj
1+bj) to the jth row and add it to the jth row. Then the jth
row will become eTj .
120
So
|I + beTj | =
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
1 · · · 0 b1 0 · · · 0
.... . .
......
......
0 · · · 1 bj−1 0 · · · 0
0 · · · 0 1 + bj 0 · · · 0
0 · · · 0 bj+1 1 · · · 0
......
......
. . ....
0 · · · 0 bn 0 · · · 1
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
=
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
1 · · · 0 b1 0 · · · 0
.... . .
......
......
0 · · · 1 bj−1 0 · · · 0
0 · · · 0 1 + bj 0 · · · 0
0 · · · 0 bj+1 1 · · · 0
......
......
. . ....
0 · · · 0 0 0 · · · 1
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
= · · · =
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
1 · · · 0 0 0 · · · 0
.... . .
......
......
0 · · · 1 0 0 · · · 0
0 · · · 0 1 + bj 0 · · · 0
0 · · · 0 0 1 · · · 0
......
......
. . ....
0 · · · 0 0 0 · · · 1
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
= (1 + bi). (3–53)
So we have
|A1| = |A| |I + beTj | = |A|(1 + bj), (3–54)
where bj = (A−1)(j)(v − Aj) and (A−1)(j) is the jth row of A−1 .
When we update the kth row of matrix A1 to be uT , we write
A2 = A1 + ek(u− A(k)1 )T = CA1, (3–55)
with
C = (I + ekcT ),
and
c = (u− A(k)1 )A−1
1 .
121
So doing the same calculation as above, we will see
|A2| = (1 + ck)|A1|,
where
ck = (u− A(k)1 )(A−1
1 )k,
with A(k) is the kth row of A and (A−11 )k is the kth column of A−1
1 .
Finally
|A2| = |A|(1 + bj)(1 + ck), (3–56)
with ck and bj defined as above.
3.4.2 Replace Z with an Average
The above matrix inversion method did speed up the computation. However, we
would like to see some further improvement of the computation. In this subsection, we try
to avoid all the n times matrix inversion and determinant calculation by substituting the
Z(i) with Z.
In the previous work we showed that the following formula can be used for the Bayes
factor approximation.
1
n
n∑i=1
(2πσ2(i))−
sl2 |Z(i)
l
′Z
(i)l |
12 exp
(− (Y−Xβ(i)−Z
(i)δ γ
(i)δ )′P (i)
l (Y−Xβ(i)−Z(i)δ γ
(i)δ )
2σ2(i)
)
(2πσ2(i)φ2(i))−sl2 exp
(− |γ(t)
l |22σ2(i)φ2(i)
) → BFδ, δ=1,
(3–57)
where all the samples are from the posterior distribution
π1(β, γ, σ2, φ2, Zmis|Y ).
Theoretically the above method is applicable, however when the method is applied to large
data set with many SNPs, the computation speed is slow. Suppose we want to calculate a
122
Bayes factor BFδ, δ=1 with n samples of
(β(i), γ(i), σ2(i), φ2(i)
, Z(i)mis),
we need to calculate n times the determinant for |Z(i)l
′Z
(i)l | and n times the matrix inverse
(Z(i)l
′Z
(i)l )−1. In the above notation, Zl and Zδ contain both observed SNPs and missing
SNPs. When we use the notation Z(i)l , we implicitly mean the ith sample of missing SNPs
in the matrix of Zl and the observed SNPs are automatically known in matrix Zl. The
same thing is implied in the notation of Zδ and we hope the meanings of the notation are
clear in different situations. Although in the last subsection we showed some methods
which avoid direct calculation of (Z(i)l
′Z
(i)l )−1 based on the previous samples, still the
computation time is a concern and in this subsection we want explore some other method
to improve it.
If the Z(i) = (Z(i)δ , Z
(i)l ) matrix does not change with i, then for each Bayes factor
estimation, instead of calculating the determinant and matrix inverse n times, we just
need to do one time matrix determinant and matrix inverse. The method proposed here
is to replace all the samples of (Z(i)δ , Z
(i)l ) with the expectation (EZδ, EZl), EZ. In the
following we justify the replacement with EZ.
Originally we want to calculate the Bayes factor BFδ, δ=1 and we showed that it is
equal to
∫· · ·
∫ (2πσ2)−sl2 |Zl
′Zl| 12 exp(− (Y−Xβ−Zδγδ)′Pl(Y−Xβ−Zδγδ)
2σ2
)
(2πσ2φ2)−sl2 exp
(− |γl|2
2σ2φ2
) (3–58)
×π1(β, γ, σ2, φ2, Zmiss|Y ) dθ dγ dσ2 dφ2 dZmis.
Now simplify the notation. We will use θ to represent the parameters (β, γ, σ2, φ2) and let
h (θ, Z) to denote the function
h (θ, Z) =(2πσ2)−
sl2 |Zl
′Zl| 12 exp(− (Y−Xβ−Zδγδ)′Pl(Y−Xβ−Zδγδ)
2σ2
)
(2πσ2φ2)−sl2 exp
(− |γl|2
2σ2φ2
) . (3–59)
123
Also use f(θ, Z) to denote the posterior distribution π1(β, γ, σ2, φ2, Zmis|Y ). So our target
integral becomes ∫ ∫h(θ, Z)f(θ, Z) dθ dZ, (3–60)
here the integral is implicitly with respect to the missing data in Z. As mentioned in the
above, we want to approximate Equation 3–60 by
∫ ∫h(θ, EZ)f(θ, Z) dθ dZ. (3–61)
Rewrite the above integrals as
∫ ∫h(θ, Z)f(θ, Z) dθ dZ =
∫ [∫h(θ, Z)f(Z|θ) dZ
]f(θ) dθ,
and ∫ ∫h(θ, EZ)f(θ, Z) dθ dZ =
∫[h(θ, EZ)] f(θ) dθ.
To fully justify the action of replacing Z with EZ in Equation 3–61, we need to check the
following:
∫ ∫h(θ, Z)f(θ, Z) dθ dZ −
∫ ∫h(θ, EZ)f(θ, Z) dθ dZ
=
∫ [∫h(θ, Z)f(Z|θ) dZ
]f(θ) dθ −
∫ ∫[h(θ, EZ)f(Z|θ) dZ] f(θ) dθ
=
∫ [∫h(θ, Z)f(Z|θ) dZ −
∫h(θ, EZ)f(Z|θ) dZ
]f(θ) dθ. (3–62)
Further, if we can show that∫
(h(θ, Z)f(Z|θ) dZ − ∫h(θ, EZ)f(Z|θ) dZ is close to zero, it
is sufficient to say that Equation 3–62 is close to zero under very general conditions. Now
124
expand (h(θ, Z)− h(θ, EZ)) in Taylor series at EZ
h(θ, Z)− h(θ, EZ) (3–63)
≈ (h(θ, EZ)− h(θ, EZ)) + (Z − EZ)
(∂h(θ, Z)
∂Z|Z=EZ
)
+1
2(Z − EZ)′
∂2[h(θ, Z)]
∂Z2|Z=EZ (Z − EZ)
= (Z − EZ)
(∂h(θ, Z)
∂Z|Z=EZ
)+
1
2(Z − EZ)′
∂2 [h(θ, Z)]
∂Z2|Z=EZ (Z − EZ)
So we can write
∫h(θ, Z)f(Z|θ) dZ −
∫h(θ, EZ)f(Z|θ) dZ (3–64)
=
(∂h(θ, Z)
∂Z|Z=EZ
) ∫(Z − EZ)f(Z|θ) dZ
+
∫1
2(Z − EZ)T ∂2[h(θ, Z)]
∂Z2|Z=EZ (Z − EZ)f(Z|θ) dZ
If we can argue that under certain conditions, either
EZ ≈ EZ|θ(Z) ⇐⇒∫
(Z − EZ)f(Z|θ) dZ ≈ 0
or (∂h(θ, Z)
∂Z|Z=EZ
)≈ 0,
then (∂h(θ, Z)
∂Z|Z=EZ
) ∫(Z − EZ)f(Z|θ) dZ ≈ 0.
Further if we neglect the second term in Equation 3–64, then we can say
∫h(θ, Z)f(Z|θ) dZ −
∫h(θ, EZ)f(Z|θ) dZ ≈ 0.
One step further,
∫ ∫h(θ, Z)f(Z|θ) dZ −
∫h(θ, EZ)f(Z|θ) dZ
f(θ) dθ ≈ 0.
125
As we know that the Z matrix represents the design matrix for the SNPs, and about
90% SNPs are observed, that is, the expectation of the observed SNPs of Z matrix is
always fixed as the actual observed values. The actual expectation calculation would occur
only in the 10% of missing SNPs. So we may say that EZ is very close to EZ|θ(Z) and EZ
is calculated by the Z.
We did some numerical simulation to compare the Bayes factor calculations. In this
simulated data set, there are 15 SNPs 20 families and 800 observations.
Table 3-3 records the different model indicators used for the simulation studies in
Table 3-4 and 3-5. We let the δs vary so that they could represent the possible different
model indicators that might occur.
Table 3-3. Records of subsets indicators with actual values of γ for Table 3-4 and Table3-5.
Indicator vector δActual values of γ: 2 0 1 −1 −3 0.1 0 0 0 0 −1 3 0 0 −3
δ1: 1 0 1 1 1 0 0 0 0 0 1 1 0 0 1δ2: 1 0 1 1 1 0 0 1 0 0 0 0 0 0 0δ3: 1 0 1 1 0 1 0 0 0 0 1 1 0 0 0δ4: 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0δ5: 1 0 0 0 0 1 0 0 0 0 1 1 0 0 1δ6: 1 0 1 1 1 0 0 0 0 0 0 1 0 0 1δ7: 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1δ8: 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1δ9: 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0δ10: 1 1 1 1 1 1 0 0 1 0 1 1 0 0 1δ11: 1 0 1 1 1 1 0 1 0 1 1 1 1 0 1δ12: 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1δ13: 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1δ14: 1 0 1 1 1 1 0 1 0 0 1 1 1 0 1δ15: 1 0 1 1 1 1 1 1 0 1 1 1 0 1 1
Table 3-4 shows the calculation results of Bayes factors by using different formula
of Bayes factor approximation. The upper part of the table uses samples from the Gibbs
sampler where only one column of missing SNPs are updated per cycle while the bottom
part of the table uses samples from the Gibbs sampler where all the missing SNPs are
updated in one cycle. The first column of Table 3-4 is the indicator vector and the actual
126
Table 3-4. Bayes factor calculation approximation. Use first 20000 iteration as burnin andtake the following 400 samples for the calculation.
sample one column of SNP per iterationdifferent δ applying formula (3–57) Zbar original Z
δ1 4.0057e + 009 7.6668e + 009 5.6653e + 009δ2 6.055e− 036 2.1169e− 052 4.0221e− 062δ3 1.7917e− 042 2.1869e− 057 2.3773e− 065δ4 8.2841e− 014 2.558e− 019 2.4601e− 021δ5 6.2531e− 016 1.9721e− 021 1.5186e− 031δ6 6.0943e + 006 1.6279e + 007 9.3191e + 006δ7 2.3314e + 008 2.8145e + 008 3.8463e + 008δ8 7.4575e + 005 1.1799e + 006 1.1003e + 006δ9 4.8807e− 032 2.0811e− 043 1.3484e− 052δ10 7.6769e + 005 1.2058e + 006 1.3248e + 006δ11 2.9626e + 004 2.5816e + 004 3.8642e + 004δ12 8.2784e + 005 1.1381e + 006 9.546e + 005δ13 8.2753e + 005 6.0808e + 005 9.0081e + 005δ14 5.8732e + 005 5.8733e + 005 6.8705e + 005δ15 4.2755e + 003 4.3482e + 003 3.1384e + 003
sample all SNPs together in one cycledifferent δ applying formula (3–57) Zbar original Z
δ1 2.2281e + 008 1.6784e + 008 3.253e + 008δ2 2.6563e− 047 2.5197e− 059 2.6204e− 066δ3 6.7539e− 047 1.8281e− 063 1.2993e− 068δ4 1.6489e− 017 1.0475e− 024 2.7792e− 028δ5 5.4427e− 024 1.6637e− 033 2.0014e− 042δ6 3.411e + 009 3.5043e + 008 1.9252e + 007δ7 4.8621e + 006 4.6095e + 006 1.2051e + 007δ8 2.7196e + 004 1.2130e + 004 5.5845e + 004δ9 9.0771e− 027 1.3545e− 043 2.1829e− 049δ10 1.1878e + 005 5.2192e + 004 1.2118e + 005δ11 6.8893e + 003 4.6568e + 003 1.4376e + 004δ12 6.6379e + 004 3.1702e + 004 7.7065e + 004δ13 2.3339e + 004 1.7211e + 004 5.5039e + 004δ14 1.028e + 005 1.0241e + 005 3.2299e + 005δ15 228.4 175.74 562.88
values of δ are from Table 3-3. The second column is the Bayes factor estimates using
the formula we originally planned to use and it ensures the Bayes factor estimates are
consistent. The third column is the Bayes factor estimates, where the missing SNPs are
replaced with the averages of imputed values Z. The fourth column is the Bayes factor
127
estimates while using the true genotypes for the missing SNPs. In real data analysis, it is
impossible to know the actual genotypes of the missing SNPs. In simulation study, we are
able to take the advantage of the simulated data set and compare the influence of different
methods. We could see that, the estimates using the Zbar are accurate when the Bayes
factors are big. When the Bayes factors are very small, the difference of estimates are due
to the random simulation errors. As long as they are very small, they have almost similar
effects when compared with big Bayes factors and it almost does not matter how small the
actual values are.
Table 3-5 again are the Bayes factor estimates using different methods for the missing
values. The second column assumes there is no missing values in the data set. that is,
I did not generate the missing data for this analysis. In the third and fourth columns, I
assign the missing values either all 1 or all −1 to test the effect of the extreme handling of
missing values. The upper part of the table using samples from the Gibbs sampler where
the missing SNPs were imputed 1 column per cycle while the bottom part of the table
are the estimates of the Bayes factors where all the missing SNPs were imputed in each
cycle. It shows, these extreme handling of missing SNPs do not give proper estimates of
Bayes factors. Combining Table 3-4 and Table 3-5, we conclude that using Z gives proper
estimates and yet it provides fast calculation.
3.5 Simulation and Real Data Analysis
3.5.1 Simulation
In the above sections, we talked about the Bayes factor approximation, the stochastic
variable search algorithm, the egodic property of the search algorithm, and the problem
of computation. In this section, we will show some simulation results when the above
techniques are applied.
In these simulation studies, we have 15 SNPs and the values of the SNPs are listed
in Table 3-6. 10% of missing SNPs were generated at random. γ is the parameter for the
SNP effect and γ1 is for SNP1, γ2 is for SNP2, etc.
128
The rows of the Table 3-6 represent the different models that the stochastic chain
visited. Any model that has less than 1% frequency of visits is not listed in the table.
From the first column of Table 3-6, we could see that two rows have higher frequencies
of visits than all other rows. Obviously, the stochastic chain spends 25% of its time on a
model which captures all the significant variables, and the chain spends about one half
of its time on a model which captures all the significant variables plus SNP7, SNP8, and
SNP8. The effects of SNP7, SNP8, and SNP8 are not as big as other variables in the real
model, but their effects are not insignificant according the criterion of Bayes factor. One
thing that needs to be mentioned is that the stochastic chain only spent about 4% of its
time on the model which includes all the variables, which is a desirable property since our
goal is to discover simple subsets which represent the essential combinations of variables
instead of taking the complicated models such as the full model.
3.5.2 Real Data Analysis
After some simulation studies, we applied the methodology to the loblolly data. The
response here is the lesion length. We use the average of imputed missing data to calculate
the Bayes factor estimates. The first column of Table 3-7 are the frequencies of visits the
chain spent on different models. As the model sample space is huge, 244, it is very likely
that once the chain leaves the visited model, it never comes back to the previous model.
In the table are listed several models that the chain spent more time on than other models
that are not listed. Although none of the frequencies of visits are substantially higher than
others, the analysis provides some subsets of variables to be further investigated at. One
thing to notice is that when we average the visits per SNP, several SNPs have more than
50% visit frequencies, and these SNPs appear often in the selected models too. If we are
interested in investigate the SNPs one by one, these SNPs definitely are good candidates.
129
Table 3-5. Bayes factor calculation comparisons. Use first 40000 iteration as burnin andtake the following 400 samples for the calculation.
sample one column of SNP per iterationdifferent without assign all assign all missing
δ missing SNP missing values one values minus one
δ1 8.5596e + 006 8.3541e + 006 5.9273e + 006δ2 0 4.1587e + 009 5.7762e + 009δ3 0 8.0566e + 007 8.3811e + 007δ4 0 9.5483e + 008 7.0391e + 008δ5 0 9.2959e + 008 1.073e + 009δ6 3.1286e− 090 1.0311e + 008 8.463e + 007δ7 1.0003e + 008 1.2594e + 006 9.3706e + 005δ8 4.1181e + 005 46386 30518δ9 0 2.0545e + 012 1.7827e + 012δ10 2.4772e + 005 20359 13880δ11 22953 3.6538e + 003 4.0996e + 003δ12 3.2352e + 005 2.9033e + 004 1.6675e + 004δ13 1.0803e + 006 1.2841e + 004 1.1535e + 004δ14 3.8579e + 005 4.2515e + 004 3.8305e + 004δ15 8331.8 3.1579e + 002 1.6113e + 002
sample all SNPs together in one cycledifferent without assign all assign all missing
δ missing SNP missing values one values minus one
δ1 4.2922e + 008 1.7333e + 006 1.49e + 006δ2 0 2.9384e + 008 3.7949e + 008δ3 0 4.2674e + 006 5.7925e + 006δ4 0 1.0262e + 008 1.4015e + 008δ5 0 5.8757e + 006 1.2619e + 007δ6 2.5633e− 102 1.483e + 007 1.5062e + 007δ7 6.8181e + 007 1.6452e + 005 1.2833e + 005δ8 9.6396e + 005 4.3144e + 003 4.392e + 003δ9 0 1.4913e + 010 1.7881e + 010δ10 2.3864e + 006 1.5517e + 003 1.8624e + 003δ11 11910 1.6863e + 003 1.0575e + 003δ12 3.2788e + 005 4.467e + 004 5.9156e + 003δ13 1.8485e + 005 2.1826e + 004 1.6741e + 004δ14 2.6493e + 005 11770 6394.3δ15 371.79 326.27 396.23
130
Table 3-6. Simulation results of Bayesian variable selection for 15 SNPs and 450observations, 10% random missing values. Using the Bayes factor estimationformula
posterior Actual values of the γmodel γ1 = 2, γ2 = 0, γ3 = 1, γ4 = −1, γ5 = −3,γ6 = 0, γ7 = 0.5, γ8 = 0.1
probability γ9 = 0.3, γ10 = 0, γ11 = −1, γ12 = 3, γ13 = 0, γ14 = 0, γ15 = −3
0.0036 no variables0.0001 γ15
0.001 γ11, γ15
0.0001 γ7, γ11, γ15
0.0008 γ1, γ7, γ11, γ15
0.002 γ1, γ4, γ7, γ11, γ15
0.0001 γ1, γ4, γ5, γ7, γ11, γ15
0.0001 γ1, γ4, γ5, γ7, γ11, γ12, γ15
0.2533 γ1, γ3, γ4, γ5, γ7, γ11, γ12, γ15
0.0608 γ1, γ3, γ4, γ5, γ7, γ8, γ11, γ12, γ15
0.5496 γ1, γ3, γ4, γ5, γ7, γ8, γ9, γ11, γ12, γ15
0.0045 γ1, γ3, γ4, γ5, γ7, γ8, γ9, γ11, γ12, γ13, γ15
0.0506 γ1, γ3, γ4, γ5, γ6, γ7, γ8, γ9, γ11, γ12, γ13, γ15
0.0116 γ1, γ2, γ3, γ4, γ5, γ6, γ7, γ8, γ9, γ11, γ12, γ13, γ15
0.0428 γ1, γ2, γ3, γ4, γ5, γ6, γ7, γ8, γ9, γ11, γ12, γ13, γ15
0.0428 γ1, γ2, γ3, γ4, γ5, γ6, γ7, γ8, γ9, γ10, γ11, γ12, γ13, γ14, γ15
Table 3-7. Bayes variable selection for lesion length data set. Using the average ofimputed missing SNPs as if observed. 20000 steps of burnin and another 20000steps as samples.
posterior modelSubsets of variablesprobability
>= 0.29%
0.4% γ9, γ10, γ12, γ14, γ15, γ16, γ22, γ30, γ31, γ40, γ43
0.39% γ8, γ10, γ12γ15, γ19, γ21, γ22, γ23, γ24, γ27, γ28, γ29, γ31, γ34, γ36, γ39, γ42
0.35% γ3, γ9, γ12γ13, γ14, γ15, γ21, γ27, γ31, γ34, γ38, γ40, γ42, γ43
0.32% γ12, γ15, γ16γ18, γ22, γ29, γ31, γ32, γ41, γ42, γ43, γ43
0.31% γ1, γ9, γ13γ18, γ19, γ22, γ23, γ27, γ28, γ31, γ34, γ36, γ38, γ39, γ42, γ43
0.30% γ1, γ8, γ12, γ13γ18, γ22, γ23, γ27, γ28, γ31, γ32, γ33, γ34, γ36, γ39, γ42,0.29% γ1, γ8, γ12, γ13γ18, γ21, γ22, γ23, γ27, γ28, γ31, γ32, γ33, γ34, γ36, γ39, γ42,0.29% γ15γ19, γ22, γ23, γ24, γ31, γ39, γ42,
variables selected more than 50% times on the average for all model
γ12, γ13, γ15, γ18, γ22, γ23, γ27, γ28, γ31, γ34, γ39, γ42
131
CHAPTER 4SUMMARY AND FUTURE WORK
4.1 Summary
Studies of the association between the candidate SNPs and phenotypic traits have
been a rapid developing research area. Most research has been focused on the human
genetics association by taking advantage of the sequenced human genome. For a not
fully sequenced genomes, not much progress has been made and suitable methods are
developing. In this dissertation, we proposed methods to test the SNP effects and also
select the groups of SNPs which are interactively responsible for the phenotypic traits,
and meanwhile we proposed methods to properly handle the missing values. This method
is better than other methods as it is a simultaneous solution, handles missing values,
provides unbiased estimation, has the flexibility of multiple imputations.
In Chapter 2, we proposed a Bayesian hierarchical model for the data structure. We
took advantage of the population structure by calculating a numerator relationship matrix,
which quantifies the covariance between two individual loblolly pines in this population.
For missing data, we imputed the possible values according to the posterior distribution
of the missing values. That is, after adjusting the observed values, we used the posterior
distribution of the missing values instead of the one mean value substitution or multiple
value averages.
One novel contribution is we proposed a nonstandard Gibbs sampler procedure and
proved that this Gibbs procedure ensures the target stationary distribution. By employing
this proposed procedure, the computation speed is dramatically increased by decreasing
the number of imputed columns in each updating cycle. The power of this procedure
will be more obvious when there are more observed SNPs in the data set and so this is
a desirable property since we will have many more SNPs toward the end of ADEPT 2
project.
132
In Chapter 3, we proposed a Bayesian variable selection method to select the good
subsets of variables, SNPs. The Bayes factor was used as a model comparison criterion
and we employed a hybrid stochastic search algorithm to search the good subsets in the
model sample space. We made a novel contribution by proposing a consistent Bayes factor
approximation for models of different dimensions, while properly handled the missing
values. We also proved that the proposed Metropolis-Hastings search algorithm with
approximated Bayes factors still has the ergodic property.
In terms of computation, we took advantage of the Bayes factor approximation
formula and the special Gibbs sampler updating procedure by applying the Sherman-
Morrison-Woodbury formula for our situation. The method decreases the matrix inversion
20 times for a matrix with dimension 2000 × 2000. In the end of the dissertation, we
also proposed to use Z to replace the imputed values for the calculation of Bayes factors.
We showed that it gives accurate estimates and meanwhile fundamentally scales up the
procedure.
4.2 Future Work
We tested the significance of SNP effect in Chapter 2. As we only had 44 SNPs
as candidates in the data set and we were hoping to discover some SNPs to further
investigate, we did not adjust the multiple tests. If we want, we could do some standard
adjustment, such as Bonferroni’s adjustment. We anticipate that our method will be
applied to thousands of SNPs when the data set is available. So one area we need to
further investigate is the adjustment for multiple tests, and meanwhile attention need
to be paid that these tests are not necessary independent. The multiple test adjustment
should be able to balance the power and false discovery rates. One possible direction
might be permutation studies.
In chapter 3, we used a stochastic search algorithm to search the good group of
subsets and we spend much time in speeding up the computation. The bottleneck of
the computation is due to the Bayes factor estimates, especially the many inversions
133
and determinant calculation which involve missing SNPs. We proposed to replace the
imputed SNPs with the average of the imputed SNPs and we gave some justification for
it. Other improvements are possible here and certainly it would be an interesting topic as
computation is always a problem for variable selection.
Another future work topic is to investigate other prior specifications for our model in
Chapter 2 and Chapter 3. We used the conjugate priors and that ensures the conditionals
to be standard distributions. We would like to see the difference of the theoretical results
and the difference of real data analysis when different priors are applied, such as g priors,
or intrinsic priors.
134
APPENDIX AERGODICITY OF GIBBS SAMPLING WHEN UPDATING Z MATRIX BY COLUMNS
Suppose we are running a Gibbs sampler in which one cycle consists of updating
(X,Y, Z), and the Zn×p matrix can be decomposed as (Z1, Z2, . . . , Zp). Standard Gibbs
sampling updates Zn×p all together in one cycle, in contrast, we can just systematically
update one column of Z at each cycle. By updating one column of Z in each iteration, the
computation time can decrease dramatically, especially when the number of columns of the
Z matrix is in the order of hundreds or thousands.
First let us write down our updating scheme. Suppose we have starting value
(X(0), Y (0), Z(0)1 , ..., Z(0)
p ).
In the first cycle, first, conditional on
(Y (0), Z(0)1 , ..., Z(0)
p ) update X(0) and get (X(1), Y (0), Z(0)1 , ..., Z(0)
p ).
Then conditioned on
(X(1), Z(0)1 , ..., Z(0)
p ), update Y (0) and get (X(1), Y (1), Z(0)1 , ..., Z(0)
p ).
Finally conditioned on
(X(1), Y (1), Z(0)2 , ..., Z(0)
p ), update Z(0)1 and get (X(1), Y (1), Z
(1)1 , Z
(0)2 , ..., Z(0)
p ).
The above is one cycle of the Gibbs sampling. Follow the same fashion and we will update
(X(1), Y (1), Z(0)2 ) in the second cycle. Continue doing that, after pth cycle, we will get
(X(p), Y (p), Z(1)1 , Z
(1)2 , ..., Z(1)
p ).
135
To prove that the above updating scheme conserves the ergodicity property, we will
show first that the kernel of the Gibbs sampler satisfies the target stationary distribu-
tion condition and, furthermore, we will show that with any initial condition the chain
converges to the target stationary distribution.
To prove that the Gibbs sampler has the stationary distribution, we need to show
that the following equation holds:
f(X(p), Y (p), Z
(1)1 , . . . , Z(1)
p
)(A–1)
=
∫. . .
∫f
(X(1)|Y (0), Z
(0)1 , . . . , Z(0)
p
)f
(Y (1)|X(1), Z
(0)1 , . . . , Z(0)
p
)
×f(Z
(1)1 |X(1), Y (1), Z
(0)2 · · ·Z(0)
p
)f
(X(2)|Y (1), Z
(1)1 , . . . , Z(0)
p
)
× · · · × f(Y (p)|X(p), Z
(1)1 , . . . , Z
(1)p−1, Z
(0)p
)f
(Z(1)
p |X(p), Y (p), . . . , Z(1)(p−1)
)
×f(Z
(0)1 , . . . , Z(0)
p , X(0), Y (0))
dX(0) dY (0) dZ(0)1 , . . . , dZ(0)
p
dX(1) dY (1), . . . , dX(p−1) dY (p−1).
First we integrate X(0) and Y (0) out, then the right hand side of (A–1) becomes:
f(X(p), Y (p), Z
(1)1 , . . . , Z(1)
p
)(A–2)
=
∫. . .
∫f
(X(1), Z
(0)1 , . . . , Z0
p
)f
(Y (1)|X(1), Z
(0)1 , . . . , Z(0)
p
)
×f(Z
(1)1 |X(1), Y (1), Z
(0)2 . . . , Z(0)
p
)f
(X(2)|Y (1), Z
(1)1 , . . . , Z(0)
p
)
× · · · × f(Y (p)|X(p), Z
(1)1 , . . . , Z
(1)p−1, Z
(0)p
)f
(Z(1)
p |X(p), Y (p), . . . , Z(1)(p−1)
)
dZ(0)1 , dZ
(0)2 , . . . , dZ(0)
p dX(1) dY (1), . . . , dX(p−1) dY (p−1).
136
Next we integrate Z(0)1 out and we have:
f(X(p), Y (p), Z
(1)1 , . . . , Z(1)
p
)(A–3)
=
∫. . .
∫f
(X(1), Y (1), Z
(0)2 . . . , Z(0)
p
)f
(Z
(1)1 |X(1), Y (1), Z
(0)2 . . . , Z(0)
p
)
×f(X(2)|Y (1), Z
(1)1 , . . . , Z(0)
p
). . . f
(Y (p)|X(p), Z
(1)1 , . . . , Z
(1)p−1, Z
(0)p
)
×f(Z(1)
p |X(p), Y (p), . . . , Z(1)(p−1)
)dZ
(0)2 , . . . , dZ(0)
p dX(1) dY (1), . . . ,
dX(p−1) dY (p−1).
Continuing by integrating X(1), Y (1) out, so we have:
f(X(p), Y (p), Z
(1)1 , . . . , Z(1)
p
)
=
∫. . .
∫f
(X(2), Z
(1)1 , Z
(0)2 , . . . , Z(0)
p
). . . f
(Y (p)|X(p), Z
(1)1 , . . . , Z
(1)p−1, Z
(0)p
)
×f(Z(1)
p |X(p), Y (p), . . . , Z(1)(p−1)
)dZ
(0)2 , . . . , dZ(0)
p
dX(2) dY (2), . . . , dX(p−1) dY (p−1). (A–4)
Doing the similar integration, further we get:
f(X(p), Y (p), Z
(1)1 , . . . , Z(1)
p
)(A–5)
=
∫ ∫ ∫f
(X(p−1), Y (p−1), Z
(1)1 , . . . , Z
(1)(p−1), Z
(0)p
)f
(X(p)|Y (p−1), Z
(1)1 , . . . , Z
(1)(p−1), Z
(0)p
)
×f(Y (p)|X(p), Z
(1)1 , . . . , Z
(1)(p−1), Z
(0)p
)f
(Z(1)
p |X(p), Y (p), Z(1)1 , . . . , Z
(1)(p−1)
)
dX(p−1) dY (p−1) dZ(0)p .
Now doing the final integrations, we will have
f(X(p), Y (p), Z
(1)1 , . . . , Z(1)
p
)= f
(X(p), Y (p), Z
(1)1 , Z
(1)2 , . . . , Z(1)
p
).
The above proof shows that the chain of
(X(pt), Y (pt), Z(t)1 , Z
(t)2 , . . . , Z(t)
p ), t = 1, 2, 3, · · ·
137
satisfies the stationary distribution.
In above, we showed that the Gibbs sampler chain updating one column per cycle still
preserves the target stationary condition. To prove the chain has ergodicity property, we
can cite Theorem 10.10 in Robert and Casella (2004). The theorem states:
If the transition kernel of the Gibbs chain(X(pt), Y (pt), Z
(t)1 , . . . , Z
(t)p
)is absolutely
continuous with respect to the measure µ, and in addition,(X(pt), Y (pt), Z
(t)1 , . . . , Z
(t)p
)is
aperiodic, then for every initial distribution µ, the Gibbs Chain converges to the target
stationary distribution.
Obviously in our situation the dominating measure µ is the product of Les-
besgue measure with counting measures and all the conditional distributions are abso-
lutely continuous with respect to it. All our conditionals are aperiodic. So the chain(X(pt), Y (pt), Z
(t)1 , . . . , Z
(t)p
)has ergodicity.
Now consider the marginal distribution of the sub vector (X(t), Y (t)). We will show
that this marginal distribution satisfies the stationary condition. That is, we want to show
the following equation holds.
f(X(1), Y (1)
)=
∫ ∫ [∫f
(X(0), Y (0), Z
(0)1 , . . . , Z(0)
p
)dZ
(0)1 , . . . , dZ(0)
p
](A–6)
×f(X(1)|Y (0), Z
(0)1 , . . . , Z(0)
p
)f
(Y (1)|X(1), Z
(0)1 , . . . , Z(0)
p
)
×f(Z
(1)1 |X(1), Y (1), Z
(0)2 , . . . , Z(0)
p
)dZ
(1)1 dX(0) dY (0).
To show that, we calculate the right side of (A–6) and it equals to the left side of
(A–6). The following are the calculations:
138
∫ ∫ [∫f
(X(0), Y (0), Z
(0)1 , . . . , Z(0)
p
)dZ
(0)1 , . . . , dZ(0)
p
](A–7)
×f(X(1)|Y (0), Z
(0)1 , . . . , Z(0)
p
)f
(Y (1)|X(1), Z
(0)1 , . . . , Z(0)
p
)
×f(Z
(1)1 |X(1), Y (1), Z
(0)2 , . . . , Z(0)
p
)dZ
(1)1 dX(0) dY (0)
=
∫. . .
∫f
(Y (0), Z
(0)1 , . . . , Z(0)
p
)f
(X(1)|Y (0), Z
(0)1 , . . . , Z(0)
p
)
×f(Y (1)|X(1), Z
(0)1 , . . . , Z(0)
p
)f
(Z
(1)1 |X(1), Y (1), Z
(0)2 , . . . , Z(0)
p
)
dZ(0)1 , . . . , dZ(0)
p dZ(0)1 dY (0)
=
∫. . .
∫f
(X(1), Z
(0)1 , . . . , Z(0)
p
)f
(Y (1)|X(1), Z
(0)1 , . . . , Z(0)
p
)
×f(Z
(1)1 |X(1), Y (1), Z
(0)2 , . . . , Z(0)
p
)dZ
(0)1 , . . . , dZ(0)
p dZ(0)1
=
∫. . .
∫f
(Y (1)X(1), Z
(0)2 , . . . , Z(0)
p
)(Z
(1)1 |X(1), Y (1), Z
(0)2 , . . . , Z(0)
p
)
dZ(0)2 , . . . , dZ(0)
p dZ(0)1
= f(X(1), Y (1)
)
The above derivation shows that the stationary distribution of the chain
(X(t), Y (t)), t = 1, 2, · · ·
will converge to the marginal distribution of f(X, Y ). So if we are just interested in the
estimates of X,Y only, it is legitimate to use samples
(X(t), Y (t)), t = 1, 2, · · · ,
instead of
(X(pt), Y (pt)), t = 1, 2, · · ·
In all, compared with standard Gibbs updating algorithm, we are able to update the
missing SNPs, Z, less frequently and still obtain the desired property.
139
APPENDIX BAN ALGORITHM FOR CALCULATING THE NUMERATOR RELATIONSHIP
MATRIX R
The calculation algorithm is due to Henderson (1976) and Quass (1977). The individ-
uals within 61 families and the parents for the the 61 families are ordered together such
that the first 1, ..., a subjects are unrelated and are used as a “base” population. Let the
total number of subjects within families and parents of the 61 families to be n, and we
will get a numerator relationship matrix with dimension n × n. As the first a subjects
(being part of the parents of the 61 families ) are unrelated, the upper left submatrix with
dimension a × a of the numerator relationship matrix is identity matrix I. This identity
submatrix will be expanded iteratively until it reaches to dimension n× n.
As we know the sub-numerator relationship matrix for the first unrelated a subjects
is the identity, next we will give the details how to calculate the remaining cells of the
numerator relationship matrix for the related subjects. Consider the jth and the ith
subject from the above ordered subjects.
1. If both parents of the jth individual are known, say g and h, then
Rji = Rij = 0.5(Rig + Rih), i = 1, ..., j − 1;
Rjj = 1 + 0.5Rgh
where Rji is the cell of numerator relationship matrix in the jth row and ith column.
2. If only one parent is known for the jth subject, say it is g, then
Rji = Rij = 0.5Rig, i = 1, ..., j − 1;
Rjj = 1.
3. If neither parent is known for the jth subject,
Rji = Rij = 0, i = 1, ...j − 1;
Rjj = 1.
For the loblolly pine data, we have 44 pines acting as grandparents and they produce 61
pine families. The 61 families contains 888 individual pine trees all together, also called
140
clones. The phenotypic responses are taken from the individual clones. So our interest is
in calculating the relationship matrix for the 888 clones and it would have a dimension
888 × 888. According the Henderson’s method, we ordered the 44 grandparent pines and
888 individual pines together such that the first a pines are not related. Starting from
the (a + 1)th pine, we applied the above iteration calculation algorithm, and in the end
had a relationship matrix with dimension 932 × 932 for all the grandparent pines and all
individual clones. We took a submatrix from the right bottom of the previous numerator
relationship matrix with dimension 888 × 888 and it is the numerator relationship matrix
we used in the loblolly pine data analysis.
141
REFERENCES
Akaike, H. (1973). Information theory and an extension of the maximum likelihoodprinciple. In B. N. Pertrov and F. Csaki (Eds.), Second International Symposium onInformation Theory, Budapest Akademiai Kiado, pp. 267C281. Springer-Verlag.
Allison, P. D. (2002). Missing Data. Sage, CA: Thousand Oaks.
Balding, D. (2006). A tutorial on statistical methods for population association studies.Nature Reviews Genetics 7, 781.
Barnard, J. and D. B. Rubin (1999). Small sample degrees of freedom with multipleimputation. Biometrica, 949–955.
Bartlett, M. S. (1951). An inverse matrix adjustment arising in discriminant analysis.Annual of Mathematical Statistics 22, 107–111.
Berger, J. and L. Pericchi (2001). Objective bayesian methods for model selection:Introduction and comparison (with discussion). Model Selection, 135–207.
Boyles, A. L., W. Scott, E. Martin, S. Schmidt, Y. J. Li, A. Ashley-Koch, M. P. Bass,M. Schmidt, M. A. Pericak-Vance, M. C. Speer, and E. R. Hauser (2005). Linkagedisequilibrium inflates type ı error rates in multipoint linkage analysis when parentalgenotypes are missing. Human Heridity 59, 220–227.
Brown, P. J., M. Vannucci, and T. Fern (1998). Multivariate bayesian variable selectionand prediction. Journal of Royal Statistal Society B , 627–641.
Carlin, B. P. and S. Chib (1995). Bayesian model choice via markov chain monte carlomethods. Journal of Royal Statistical Society B , 473–484.
Casella, G., F. J. Giron, M. L. Martinez, and E. Moreno. Consistency of bayesianprocedures for variable selection. The Annals of Statistics . (to appear).
Casella, G. and E. Moreno (2006). Objective bayesian variable selection. Journal of theAmerican Statistical Association, 157–167.
Chen, W. M. and G. R. Abecasis (2007). family-based association tests for genomewideassociation scan. The American Journal of Human Genetics 81, 913–926.
Chib, S. (1995). Marginal likelihood from the gibbs output. Journal of the AmericanStatistical Association, 1313–1321.
Cui, W. and E. I. George (2008). Empirical bayes versus fully bayes variable selection.Journal of Statistical Planning and Inference, 888–900.
Dai, J. Y., I. Ruczinski, M. LeBlanc, and C. Kooperberg (2006). Imputation methods toimprove inference in snp association studies. Genetic Epidemiology 30, 690–702.
142
Dellaportas, P., J. J. Forster, and I. Ntzoufras (2002). Bayesian model choice via markovchain monte carlo methods. Statistics and Computing , 27–36.
Dempster, A. P., N. Laird, and D. B. Rubin (1977). Maximum likelihood estimation fromincomplete data via the em algorithm(with discussion). Journal Royal Statistical SocietySeries B 39, 1–38.
Efron, B. (1994). Missing data imputation and the bootstrap. Journal of the AmericanStatistical Association, 463–474.
George, E. I. and D. Foster (1994). The risk inflation criterion for multiple regression.Annals of Statistics , 1947–1975.
George, E. I. and R. E. McCulloch (1993). Variable selection via gibbs sampling. Journalof the American Statistical Society , 881–889.
George, E. I. and R. E. McCulloch (1995). Stochastic search variable selection. MarkovChain Monte Carlo in Practice, 203–214.
George, E. I. and R. E. McCulloch (1997). Approaches to bayesian variable selection.Statistica Sinica, 339–379.
Green, P. J. (1995). Reversible jump markov chain monte carlo computation and bayesianmodel determination. Biometrika, 711–32.
Hager, W. W. (1989). Updating the inverse of a matrix. SIAM Review 31, 221–239.
Henderson, C. R. (1976). A simple method for computing the inverse of a numeratorrelationship matrix used in prediction of breeding values. Biometrics 39, 69–83.
Hobert, J. and G. Casella (1996). The effect of improper priors on gibbs sampling inhierarchical linear mixed models. Journal of the American Statistical Association 91,1461–1473.
Kayihan, G. C., D. A. Huber, A. M. Morse, T. T. White, and J. M. Davis (2005). Geneticdissection of fusiform rust and pitch canker disease traits in loblolly pine. Theory ofApplied Genetics 110, 948–958.
Li, K. H., T. E. Raghunathan, and D. B. Rubin (1991). Large-sample significancelevels from multiply imputed data using moment-based statistics and an f referencedistribution. J. American Statistical Association, 1065–1073.
Little, R. J. A. and D. B. Rubin (1987). Statistical Analysis with Missing Data. New York:Wiley & Sons.
Mallows, C. (1973). Some comments on cp. Technometrics 15, 661–675.
Marchini, J., B. Howie, S. Myers, G. McVean, and P. Donnelly (2007). A new multipointmethod for genome-wide association studies by imputation of genotypes. NatureGenetics 39, doi:10.1038/ng2088.
143
Martin, E. R., M. P. Bass, E. Hauser, and N. L. Kaplan (2003). Accounting for linkage infamily-based tests of association with missing parental genotypes. American Journal ofHuman Genetics 73, 1016–1026.
McKeever, D. B. and J. L. Howard (1996). Value of timber and agricultural products inthe united states,1991. Forest Products Journal 46, 45–50.
Meng, X. L. and D. B. Rubin (1992). Performing likelihood ratio tests with multiply-imputed data sets. Biometrica, 103–111.
Meng, X. L. and W. H. Wong (1996). Simulating ratios of mormalizing constants via asimple identity: A theoretical exploration. Statistica Sinica 6, 831–860.
Meyn, S. P. and R. Tweedie (2008). Markov Chains and Stochastic Stability. New York:Springer-Verlag.
Miller, K. S. (1981). On the inverse of the sum of matrices. Mathematics Magazine 54,67–72.
Mitchell, T. J. and J. J. Beauchamp (1988). Bayesian variable selection in linear regres-sion. Journal of the American Statistical Association, 1023–1032.
Quaas, R. L. (1976). Computing the diagonal elements and inverse of a large numeratorrelationship matrix. Biometrics 46, 949–953.
Reilley, M. (1993). Data analysis using hot deck mmultiple imputation. The Statistician,307–313.
Robert, C. P. and G. Casella (2004). Monte Carlo Statistical Methods. United States ofAmerica: Springer.
Roberts, A., L. McMillan, W. Wang, J. Parker, I. Rusyn, and D. Threadgill (2007).Inferring missing genotypes in large snp panels using fast nearest-neighbor searches overlliding windows. Bioinformatics 23, i401–i407.
Rubin, D. B. (1978). Multiple imputations in sample surveys- a phenomenological bayesianapproach to nonresponse. Journal of the American Statistical Association, 20–34.
Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: Wiley &Sons.
Sadighi, I. and P. K. Kalra (1988). Approaches for updating the matrix inverse for controlsystem problems with special reference to row or column perturbation. Electric PowerSystems Research 14, 137–147.
Schafer, J. L. (1997). Analysis of Incomplete Multivariate Data. London: Chapman &Hall.
144
Scheet, P. and M. Stephens (2006). A fast and flexible statistical model for large-scalepopulation genotype data: applications to inferring missing genotypes and haplotypicphase. The American Journal of Human Genetics 78, 629–44.
Schwarz, G. (1973). Estimating the dimension of a model. Annals of Statistics 6, 461–464.
Servin, B. and M. Stephens (2007). Imputation-based analysis of association atudies:candidate regions and quantitative traits. PLoS Genetics 7, e114.
Smith, A. M. F. and D. J. Spiegelhalter (1980). Bayes factors and choice criteria for linearmodels. Journal of the Royal Statistical Society, Serial B , 213–220.
Stephens, M., S. N. J., and P. Donnelly (2001). A new statistical method for haplotypereconstruction from population data. The American Journal of Human Genetics 68,978–989.
Sun, Y. V. and S. L. Kardia (2008). Imputing missing genotypic data of single-nucleotidepolymorphisms using neural networks. European Journal of Human Genetics 16,487–495.
Tanner, M. A. and W. H. Wong (1987). The calculation of posterior distributions by dataaugmentation. Journal of the American Statistical Association, 528–540.
Wear, D. N. and J. G. Greis (2002). Southern forest resource assessment: Summary offindings. Journal of Forestry 100, 6–14.
Weinberg, C. R. (1999). Allowing for missing parents in genetic studies of case-parenttriads. American Journal of Human Genetics 64, 1186–1193.
Woodbury, M. (1950). Inverting modified matrices. Memorandom Rept. 42, StatisticalResearch Group, Princeton University .
Wu, C. F. J. (1983). On the convergence properties of the em algorithm. The Annals ofStatistics , 95–103.
Xie, F. and M. Paik (1997). Multiple imputation methods for the missing covari-ates ingeneralized estimating equation. Biometrics , 1538–1546.
Yu, J. M., G. Pressoir, W. H. Briggs, I. V. Bi, M. Yamasaki, J. Doebley, M. D. McMullen,B. S. Gaut, D. M. Nielsen, J. B. Holland, S. Kresovich, and E. S. Buckler (2005). Aunified mixed model method for association mapping that accounts for multiple levels ofrelatedness. Nature Genetics 38, e114.
145
BIOGRAPHICAL SKETCH
Zhen Li was born in Jiangsu, China. She was the second of two daughters of Ping Li
and Xuemei Tang. She received a bachelor’s degree in mechanical engineering in Nanjing
University of Aeronautics and Astronautics in 2001. After that, she was enrolled in a
master’s program of Shanghai Jiaotong University. In 2004, she was admitted to the
Statistics Department of University of Florida. She received a master’s degree in statistics
in 2006 and is expecting to receive a Ph.D. degree in statistics in 2008.
146