Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Z. Gompert and C. A. Buerkle 1
A hierarchical Bayesian model for next-generationpopulation genomics
Zachariah Gompert and C. Alex Buerkle
Department of Botany and Program in Ecology, University of Wyoming, Laramie,
Wyoming 82071
Genetics: Published Articles Ahead of Print, published on January 12, 2011 as 10.1534/genetics.110.124693
Copyright 2011.
Z. Gompert and C. A. Buerkle 2
Running title: Bayesian population genomics
Keywords: Population genomics, selection, outlier analysis, Bayesian modeling, next-generation sequencing
Corresponding author: Zachariah GompertDepartment of Botany, 31651000 E. University Ave.University of WyomingLaramie, WY 82071Phone: (307) 760-3858Fax: (307) 766-2851Email: [email protected]
Z. Gompert and C. A. Buerkle 3
ABSTRACT1
The demography of populations and natural selection shape genetic variation across the2
genome and understanding the genomic consequences of these evolutionary processes is a3
fundamental aim of population genetics. We have developed a hierarchical Bayesian model4
to quantify genome-wide population structure and identify candidate genetic regions affected5
by selection. This model improves on existing methods by accounting for stochastic sam-6
pling of sequences inherent in next-generation sequencing (with pooled or indexed individual7
samples) and by incorporating genetic distances among haplotypes in measures of genetic8
differentiation. Using simulations we demonstrate that this model has a low false positive9
rate for classifying neutral genetic regions as selected genes (i.e., φST outliers), but can de-10
tect recent selective sweeps, particularly when genetic regions in multiple populations are11
affected by selection. Nonetheless, selection affecting just a single population was difficult12
to detect and resulted in a high false negative rate under certain conditions. We applied the13
Bayesian model to two large sets of human population genetic data. We found evidence of14
widespread positive and balancing selection among worldwide human populations, including15
many genetic regions previously thought to be under selection. Additionally, we identified16
novel candidate genes for selection, several of which have been linked to human diseases.17
This model will facilitate the population genetic analysis of a wide range of organisms based18
on next-generation sequence data.19
Z. Gompert and C. A. Buerkle 4
The distribution of genetic variants among populations is a fundamental attribute of evo-20
lutionary lineages. Population genetic diversity shapes contemporary functional diversity and21
future evolutionary dynamics, and provides a record of past evolutionary and demographic22
processes. Methods to quantify genetic diversity among populations have a long history23
(Wright 1951; Holsinger and Weir 2009) and provide a basis to distinguish neutral24
and adaptive evolutionary histories, to reconstruct migration histories, and to identify genes25
underlying diseases and other significant traits (Bamshad and Wooding 2003; Tishkoff26
and Verrelli 2003; Voight et al. 2006; Barreiro et al. 2008; Lohmueller et al. 2008;27
Novembre et al. 2008; Tishkoff et al. 2009; Hohenlohe et al. 2010). For example,28
population genetic analyses in humans have resolved a history of natural selection and inde-29
pendent origins of lactase persistence in adults in Europe and east Africa (Tishkoff et al.30
2007). Similarly, allelic diversity at the Duffy Blood group locus is consistent with the ac-31
tion of natural selection, and one allele that has gone to fixation in sub-Saharan populations32
confers resistance to malaria (Hamblin and Di Rienzo 2000; Hamblin et al. 2002). These33
studies of natural selection, and many others involving a diversity of organisms, utilize con-34
trasts between genetic differentiation at putatively selected loci and the remainder of the35
genome. Genomic diversity within and among populations is determined primarily by muta-36
tion and neutral demographic factors, such as effective population size and rates of migration37
among populations (Wright 1951; Slatkin 1987). Specifically, these demographic factors38
determine the rates of genetic drift and population differentiation across the genome. In39
contrast, selection affects variation in specific regions of the genome, including the direct40
targets of selection and to a lesser extent genetic regions in linkage disequilibrium with these41
targets (Maynard-Smith and Haigh 1974; Slatkin and Wiehe 1998; Gillespie 2000;42
Stajich and Hahn 2005). Thus, the genomic consequences of selection are superimposed43
on the genomic outcomes of neutral genetic differentiation and the two must be disentangled44
to identify regions of the genome affected by selection.45
A variety of models and methods have been proposed to identify genetic regions that have46
Z. Gompert and C. A. Buerkle 5
been affected by selection (Nielsen 2005), and these population genetic methods can be47
divided into within-population and among-population analyses. The former include widely48
used neutrality tests based on the site frequency spectrum for a single locus, such as Tajima’s49
D test (Tajima 1989), as well as recently developed tests based on the presence of extended50
blocks of linkage disequilibrium or reduced haplotype diversity (Andolfatto et al. 1999;51
Sabeti et al. 2002; Voight et al. 2006). Among-population tests for selection require52
multi-locus data sets and identify non-neutral or outlier loci by contrasting patterns of pop-53
ulation divergence among genetic regions (Lewontin and Krakauer 1973; Beaumont54
and Nichols 1996; Akey et al. 2002; Beaumont and Balding 2004; Foll and Gag-55
giotti 2008; Guo et al. 2009; Chen et al. 2010). The most commonly employed of these56
methods is the FST outlier analysis developed by Beaumont and Nichols (1996). This57
test contrasts FST for individual loci with an expected null distribution of FST based on a58
neutral, infinite island, coalescent model (Beaumont and Nichols 1996). Loci with very59
high levels of among population differentiation (i.e., high FST) are considered candidates60
for positive or divergent selection, whereas loci with exceptionally low FST are regarded as61
candidates for balancing selection. However, many FST outlier analyses can be biased by62
violations of the assumed demographic history (Flint et al. 1999; Excoffier et al. 2009).63
Alternative Bayesian approaches to obtain a null distribution of population genetic differ-64
entiation assume that FST for individual loci represent independent draws from a common,65
underlying distribution that characterizes the genome and that can be estimated directly66
from multi-locus data (Beaumont and Balding 2004; Foll and Gaggiotti 2008; Guo67
et al. 2009). These approaches are more robust to different demographic histories, but might68
have reduced power to detect selection relative to methods that model the appropriate de-69
mographic history when it is known (Guo et al. 2009).70
Herein we propose a hierarchical Bayesian model for estimating genomic population dif-71
ferentiation and detecting selection. This method improves on current methods for detecting72
selection in several important ways. First, unlike previous Bayesian outlier detection models73
Z. Gompert and C. A. Buerkle 6
(e.g., Beaumont and Balding 2004; Foll and Gaggiotti 2008; Guo et al. 2009), the74
likelihood component of the model captures the stochastic sampling processes inherent in75
next-generation sequence (NGS) data (Mardis 2008). Specifically, NGS technologies result76
in uneven coverage among individuals, genetic regions and homologous gene copies, including77
missing data for many individuals and loci, and thus, increased uncertainty in the diploid78
sequence of individuals relative to traditional Sanger sequencing (Lynch 2009; Gompert79
et al. 2010; Hohenlohe et al. 2010). This issue is most pronounced when population-level80
indexing is used (e.g., Gompert et al. 2010), but should persist even with individual-level81
indexing and high coverage data (i.e., even with high mean coverage the sequence coverage82
for certain combinations of individuals and genetic regions will be low). Moreover, appro-83
priately modeling and accounting for this uncertainty is important and vastly preferable to84
simply ignoring it or discarding large amounts of sequence data (e.g., Hohenlohe et al.85
2010). Previous methods to detect selection based on genetic differentiation among popula-86
tions have focused solely on allele frequency differences, as captured by FST (Nielsen 2005).87
Instead the model we propose measures population genetic differentiation using Excoffier et88
al.’s φ statistics, which are DNA sequence-based measures of the proportion of molecular89
variation partitioned among groups or populations (Excoffier et al. 1992). Quantifying90
genetic differentiation using φ statistics allows us to include the genetic distances among91
sequences in the measure of differentiation, and thus take mutation rate into account, which92
should increase our ability to accurately identify targets of selection (Kronholm et al.93
2010). Finally, the model we propose provides a novel criterion for designating outlier loci.94
Although we do not believe this criterion is inherently superior to alternatives (e.g., Beau-95
mont and Nichols 1996; Foll and Gaggiotti 2008; Guo et al. 2009), we believe it96
accords well with the concept of statistical outliers and is well suited for genome scans for97
divergent selection. Specifically, the model assumes that the φ statistics for each genetic98
region are drawn from a common, genome-level distribution, and identifies outlier loci, or99
loci potentially affected by selection, based on the probability of their locus-specific φ statis-100
Z. Gompert and C. A. Buerkle 7
tic given the genome-level probability distribution of φ. This genome-level distribution is101
equivalent to the conditional probability for the locus-specific φ statistics in the model.102
We begin this paper by fully describing the model, both verbally and mathematically.103
We then use coalescent simulations to generate data with a known history and investigate104
the performance of the model for identifying genetic regions affected by selection. To illus-105
trate its capacity to identify exceptional genes that are likely to have experienced selection,106
we use the proposed model to analyze empirical data from two studies of human popula-107
tion genetic variation. The first of these data sets includes 316 completely sequenced genes108
from 24 individuals with African ancestry and 23 individuals with European ancestry (Seat-109
tleSNPs data set). The second data set was published by Jakobsson et al. (2008) and110
includes genotype data from 525,901 SNPs from 33 human populations distributed world-111
wide. We analyzed these human data with the known haplotypes model (see Bayesian models112
for molecular variance) and found evidence of selection affecting a total of 569 genes includ-113
ing genes previously identified as targets of selection in humans (e.g., CYP3A5, SLC24A5,114
and PKDREJ) and novel genes not previously thought to be under selection (e.g., FOXA2115
and SPATA5L1). Whereas these remarkably large-scale studies do not involve NGS data,116
the analyses of empirical data identify a large set of genes for additional study and illustrate117
our analytical approach, which can be applied to a diversity of large-scale genomic data118
sets, including those that will result from the anticipated widespread adoption of NGS for119
population genomics.120
METHODS121
Bayesian models for molecular variance122
Our goal is to use molecular divergence among populations and groups to identify genetic123
regions that might have been affected by natural selection. Patterns of divergence at indi-124
vidual loci arise from differences in haplotype frequencies and the genetic distance among125
haplotypes. We assume the distances are fixed and known and we attempt to estimate the126
Z. Gompert and C. A. Buerkle 8
haplotype frequencies from sequence data using a hierarchical Bayesian model. The model127
includes a first-level likelihood for the probability of the observed haplotype counts given128
population haplotype frequencies, a conditional prior for the haplotype frequencies for each129
locus in each population given genome-level parameters and genetic distance matrix, and130
uninformative priors for the genome-level parameters. This conditional prior defines the131
distribution of φ statistics across the genome, whereas locus-specific φ statistics are derived132
from our estimate of haplotype frequencies and the genetic distance among haplotypes. From133
this model we are able to estimate the probability of each locus-specific φ statistic given the134
estimated genome-level distribution, which serves as a metric for identifying outlier loci that135
depart from the typical level of differentiation and therefore might have been affected by136
selection.137
First-level likelihood models We developed three models for the first-level likelihood of138
the observed haplotype counts (x) given the population haplotype frequencies (p). Each139
model is applicable to a different category of DNA sequence data that was generated via140
a different process. These models are presented in order of increasing uncertainty in the141
true genotype of individuals, and thus in the population allele frequencies. The first model142
(known haplotype model) is applicable if the two haplotypes of a diploid individual are known143
without error, as is generally assumed for phased Sanger sequence data. Under this model144
the probability of the observed haplotype counts for each locus and population follows a145
multinomial distribution, such that the complete likelihood is a product of multinomial146
distributions (for all loci and populations):147
P(x|p) =∏
i
∏
j
nij !
xij1! · · ·xijk!p
xij1
ij1 · · · pxijk
ijk , (1)
where nij is the total number of observed sequences at locus i for population j, xijk and pijk148
are the observed count and population frequency of the kth haplotype in the jth population149
Z. Gompert and C. A. Buerkle 9
at the ith locus. The multinomial likelihood model assumes Hardy-Weinberg equilibrium for150
each locus and linkage equilibrium among loci.151
Because additional sampling error occurs when sequence reads are sampled stochastically152
from DNA templates, NGS sequence data require alternative first level-likelihood models153
with the specific model depending on the information associated with each sequence. NGS154
can incorporate individual-level indexing or barcoding (Craig et al. 2008; Hohenlohe155
et al. 2010; Meyer and Kircher 2010). With individual-level indexes a sequence can be156
assigned to a particular individual and locus, but there is uncertainty regarding whether157
diploid individuals with a single observed haplotype are heterozygous or homozygous. If we158
assume sequencing errors are negligible or have already been corrected, any individual with159
two distinct haplotypes sampled is known to be a heterozygote, whereas individuals with160
one sampled haplotype might be heterozygous for the observed haplotype and an alternative161
haplotype or homozygous for the observed haplotype. Under these circumstances we propose162
the following likelihood (NGS-individual model) for the observed haplotype counts given the163
population haplotype frequencies:164
P(x|p) =∏
i
∏
j
∏
l
[2pijkapijkb
][nijl!
xijkal!xijkbl!0.5xijkal0.5xijkbl] if h is two
p2ijka
+∑
k 6=ka[2pijka
pijk][0.5xijkal ] if h is one
, (2)
where the product is across all individuals (l) as well as populations and loci, h is the number165
of distinct haplotypes observed for an individual at a locus, xijkal and xijkbl are the observed166
counts of the one or two haplotype sequences for individual l, and pijkaand pijkb
are the167
frequencies of these haplotypes in population j. The complementary set of haplotypes that168
exist, but that were not observed for an individual, have counts and frequencies simply169
denoted by xijkl = 0 and pijk, respectively.170
Alternatively, if NGS sequence data consist of indexed populations of pooled individuals171
(Gompert et al. 2010), rather than indexed individuals, information will only be associated172
Z. Gompert and C. A. Buerkle 10
with sequences at the population level. We propose a third likelihood model (NGS-population173
model) for this situation, in which the probability of the observed haplotype frequencies given174
the population frequencies is described by a multivariate Polya distribution (Gompert et al.175
2010):176
P(x|p, ν) =∏
i
∏
j
nij !∏
k(xijk!)
Γ(∑
k νjpijk + 1)
Γ(nij +∑
k νjpijk + 1)
∏
k
Γ(xijk + νjpijk + 1)
Γ(νjpijk + 1), (3)
where Γ is the gamma function, νj is the number of gene copies (i.e., twice the number of177
diploid individuals) sampled from population j, and other parameters are as described above.178
This likelihood model has been used previously by Gompert et al. (2010) for population-179
level NGS data. This likelihood function assumes that the frequency of each haplotype in180
the sample of individuals from a population can take on any value between zero and one,181
and thus might not be appropriate when very low numbers of individuals are sampled from182
each population.183
Model priors The Bayesian model includes a conditional prior for the population haplo-184
type frequencies p assuming the distance matrix d is fixed and known without error. The185
conditional prior with its estimated parameters provides a null distribution for identify-186
ing outliers and inferring selection (see Designating outlier loci). The conditional prior we187
choose depends on whether we are interested in genetic structure among populations, or188
genetic structure among groups of populations (e.g., geographic regions) and among popula-189
tions within groups. For genetic structure among populations, we assign the following prior190
to p:191
P(p|αST , βST ,d) =∏
i
1
B(αST , βST )
(φSTi+ 1)αST−1(1 − φSTi
)βST−1
2αST +βST−1, (4)
where B is the beta function and φSTidenotes φST for locus i calculated based on pi and192
di following Excoffier et al. (1992). This specification of the conditional prior for p does193
Z. Gompert and C. A. Buerkle 11
not correspond to a standard probability distribution with respect to p, but is equivalent to194
assuming that the locus-specific φST are distributed Beta(α = αST , β = βST , a = −1, b = 1)195
with d fixed, where αST and βST are the shape parameters of a Beta distribution and a and196
b define the lower and upper bounds of the distribution (i.e., we use a rescaled Beta distri-197
bution). Thus, this conditional prior describes the distribution of φST across the genome,198
with a mean and standard deviation given by:199
µST =2αST
αST + βST
− 1 (5)
and200
σST = 2
√
αSTβST
(αST + βST )2(αST + βST + 1), (6)
respectively. We complete this model by assigning uninformative, uniform priors to αST ∼201
U(0, uα) and βST ∼ U(0, uβ). For all analyses presented in this manuscript we set uα =202
uβ = 106, yielding priors with uniform density over all parameter values with non-negligible203
support in the posterior; however, alternative values might be necessary for specific data204
sets. The above specification results in the following hierarchical Bayesian model:205
P(p, αST , βST |x,d, ν) ∝ P(x|p, ν)P(p|αST , βST ,d)P(αST )P(βST ). (7)
We provide an alternative conditional prior for the haplotype frequencies when genetic struc-206
ture among groups and populations is of interest. Specifically we assume:207
Z. Gompert and C. A. Buerkle 12
P(p|α, β,d) =∏
i
1
B(αST , βST )
(φSTi+ 1)αST−1(1 − φSTi
)βST−1
2αST +βST−1×
1
B(αSC , βSC)
(φSCi+ 1)αSC−1(1 − φSCi
)βSC−1
2αSC+βSC−1×
1
B(αCT , βCT )
(φCTi+ 1)αCT −1(1 − φCTi
)βCT−1
2αCT +βCT−1,
(8)
where φSCiand φCTi
denote φSC (molecular variation among populations within groups of208
populations) and φCT (molecular variation among groups of populations relative to total hap-209
lotypic diversity) for locus i calculated based on pi and di following Excoffier et al. (1992),210
and other parameters are as described above. This specification estimates genome-level dis-211
tributions for each φ statistic independently. Because locus-specific φST are fully specified212
given locus-specific φSC and φCT , an alternative and equally valid modeling approach would213
be to treat the mean genome-level φST as a derived parameter and specify a conditional214
prior based only on φSC and φCT . This alternative approach would account for the lack of215
independence among φSC , φCT , and φST and would be closer to existing decompositions of216
φ statistics. However, this alternative model prior would not provide posterior estimates of217
αST or βST , which are necessary to specify the posterior distribution for the genome-level218
distribution of φST , as opposed to the posterior distribution of the mean genome-level φST .219
The former is necessary to identity outlier loci with respect to φST , which is central to our220
model. Similar to the model for genetic differentiation among populations only, equation (8)221
does not correspond to a standard probability distribution with respect to p, but is equiva-222
lent to assuming that each of the locus-specific φ statistics follows its own Beta distribution223
with d fixed. As with the model for population structure, it is possible to derive the mean224
and standard deviation of the genome-level beta distribution using equations (5) and (6) and225
substituting the appropriate α and β parameters. Finally, as before we assign uninformative,226
uniform priors to all α and β parameters. This specification results in the following Bayesian227
model:228
Z. Gompert and C. A. Buerkle 13
P(p, α, β|x,d, ν) ∝ P(x|p, ν)P(p|α, β,d)P(αST )P(βST )P(αSC)P(βSC)P(αCT )P(βCT ). (9)
Designating outlier loci This Bayesian model gives rise to a framework for identifying229
outlier loci, or genetic regions with an unusual proportion of molecular variation partitioned230
among populations or groups. Such genetic regions might have experienced selection directly,231
or indirectly through linkage. Genetic regions with unusual patterns of molecular variation232
can be identified by contrasting φ statistics for each locus with the genome-level distribu-233
tion for each φ statistic. Specifically, we identify outlier loci by estimating the posterior234
probability distribution for the quantile of each locus-specific φ statistic in the genome-level235
distribution. Formally, we define ai as the ath quantile of the posterior distribution for the236
locus-specific φ statistic for locus i and let qn be the interval with endpoints defined as the237
n2
th and 1− n2
th quantiles of the genome-level φ statistic distribution. We then consider locus238
i an outlier at the ath quantile with respect to a given φ statistic with probability n if the239
interval qn does not contain ai. Values of n and a used for outlier designation will determine240
the stringency of the analysis, with higher values of a and lower values of n resulting in241
fewer loci classified as outliers. In this paper we set n = 0.05 (qn = [0.025, 0.975]) and use242
two different values of a (0.5 and 0.95 quantiles). These values for a denote the median and243
95th quantile of the posterior probability distribution for the quantile of each locus-specific φ244
statistic. When only differentiation among populations is being considered, a genetic region245
can only be designated an outlier with respect to φST , whereas a genetic region could be an246
outlier with respect to φST , φSC, or φCT when group and population structure are considered.247
Outlier loci can be classified as having very low φST , which could be indicative of balanc-248
ing or purifying selection, or very high φST , which could be indicative of positive selection249
within populations or divergent selection among populations (Beaumont and Balding250
Z. Gompert and C. A. Buerkle 14
2004; Beaumont 2005). Patterns of selection giving rise low or high φCT and φSC might251
be a bit more complex. For example, high φCT outliers would be expected if divergent selec-252
tion occurred among groups of populations with the same alleles favored within each group253
and low φSC outliers might be expected with balancing selection within groups that favored254
different subsets of alleles among groups.255
Several approaches have been described for identifying outlier loci using genome-scans256
for differentiation (Beaumont and Balding 2004; Foll and Gaggiotti 2008; Guo et al.257
2009). Our approach is most similar to that proposed by Guo et al. (2009), but also differs258
from this approach in several important ways. Guo et al. (2009) contrast an approximation259
of the posterior probability distribution for each locus-specific φST (θi’s in their model) to the260
hyperdistribution describing among locus variation in φST (equivalent to our genome-level261
distribution). Approximation of the posterior probability distribution for φST is achieved by262
defining a Beta distribution with first and second moments equal to those of the posterior263
probability distribution (Guo et al. 2009). This approximation allows Guo et al. (2009)264
to measure the divergence between the posterior probability distribution for each locus-265
specific φST and the genome-level distribution (also a Beta distribution, which is derived266
from point estimates of genome-level parameters) using Kullback-Leibler divergence (Kull-267
back and Leibler 1951). Following calibration to determine a cutoff value for significance,268
the Kullback-Leibler divergence measure is used to designate outlier loci. The primary dis-269
tinction between the outlier detection method proposed by Guo et al. (2009) and the outlier270
detection method we have implemented is that their method tests for a difference between271
the posterior probability distribution of each locus-specific φST (this distribution measures272
uncertainty in the parameter estimate) and a point estimate of the genome-level φST distri-273
bution (this distribution measures expected variation among loci in φST ). In contrast, we274
test whether locus-specific φ statistics are unlikely given the genome-level distribution (i.e.,275
not that our estimates of these locus-specific φ statistics simply differ significantly from the276
genome-level distribution). Additionally, our method differs by accounting for uncertainty277
Z. Gompert and C. A. Buerkle 15
in the genome-level distribution by taking the marginal distribution of the quantile of each278
locus-specific φ statistic in the genome-level distribution, rather than using a point estimate279
of the genome-level distribution, which is necessary for the method of Guo et al. (2009).280
We do not believe that either of these methods is necessarily superior, but rather that they281
provide alternative criteria and definitions of outlier loci, with our method perhaps more282
closely reflecting common conceptions of outlier loci among evolutionary biologist and their283
method utilizing more information from the posterior distribution of locus-specific φST .284
Analysis of simulated data sets285
We conducted a series of simulations to determine what proportion of genetic regions would286
be classified as outliers in the absence of selection. These data sets were simulated for287
analysis with the population structure model and for each of the three likelihood mod-288
els (i.e., known haplotype model, NGS–individual model, and NGS-population model). We289
simulated sequence data using an infinite sites coalescent model using R. Hudson’s soft-290
ware ms (Hudson 2002). Data sets were simulated with 25 or 500 genetic regions. The291
simulations assumed five populations split from a common ancestor τ generations in the292
past, where τ has units of 4Ne and was set to 0.25, 0.5, or 1.0. We conducted simulations293
with migration among the five populations (Nem) set to 0 or 2. The ancestral population294
and all five descendant populations were assigned population mutation rates θ = 4Neµ of295
0.5, where µ is the per locus mutation rate. Forty gene copies were sampled from each296
of the five populations. For the known haplotype model analyses we treated the simulated297
sequences directly as the sampled data. For NGS–individual model and NGS-population298
model analyses we re-sampled the simulated sequence data sets such that coverage for each299
sequence was Poisson distributed (λ = 2). For the NGS–individual model analyses we re-300
tained information on which individual each sequence came from, whereas we only retained301
population identification for NGS-population model analyses. We generated 10 replicate302
simulated data sets for each combination of likelihood model, number of genetic regions303
Z. Gompert and C. A. Buerkle 16
(25 or 500), τ , and migration rate. Each data set was analyzed using an Markov chain304
Monte Carlo (MCMC) implementation of the proposed model using the bamova software we305
have developed (see Supplemental Material MCMC algorithm; available from the authors306
at http://www.uwyo.edu/buerkle/software/ as stand-alone software). We calculated the307
distance matrix for each locus using the number of sites by which each pair of sequences308
differed. Estimation of the posterior probability distribution for all parameters for each data309
set was based on a single MCMC algorithm run that included a 25,000 iteration burn-in310
followed by 50,000 samples from the posterior. Sample history plots were monitored to en-311
sure appropriate chain mixing and convergence on the stationary distribution. We classified312
genetic regions as outliers based on the posterior probability distribution for the quantile313
of each locus-specific φST in the genome-level φST distribution, as previously described (see314
Designating outlier loci). We then calculated the mean proportion of loci classified as outliers315
across the 10 replicates for each combination of parameters.316
We conducted an additional series of simulations to assess the capacity of our analytical317
model to detect selected loci among a larger set of neutral regions. We began with a set318
of simulations of population structure, as above. Simulations were conducted under all319
combinations of conditions described previously for simulations without selection. However,320
we simulated selective sweeps affecting 2 (8%; for 25 locus simulations) or 25 (5%; for 500321
locus simulations) of the simulated loci. We allowed selective sweeps to occur in 1, 3, or 5322
populations (one set of simulations for each). Selective sweeps were simulated by selecting one323
haplotype from each affected population at each affected locus and increasing its frequency324
to 1. This was meant to simulate recent and strong selective sweeps where affected loci325
were in arbitrary linkage disequilibrium with the gene subject to selection. Thus, it was326
possible for the same or different haplotypes to be driven to fixation across populations. We327
calculated the mean proportion of neutral and non-neutral loci classified as outliers across328
the 10 replicates for each combination of parameters.329
Finally, we conducted a series of simulations to determine the extent to which our group330
Z. Gompert and C. A. Buerkle 17
structure model could correctly identify genetic regions affected by selection. For these simu-331
lations we concentrated on the NGS–individual model and a single demographic history. We332
simulated two groups of five populations, with the divergence time of the populations within333
each group equal to 0.25 and the divergence of the two groups equal to 0.75. We allowed334
no migration between groups, but set the within group migration rate to 4. We simulated335
data sets of 25 and 500 loci, and as with the population structure simulations, we simulated336
selective sweeps affecting 2 (25 locus simulations) or 25 (500 locus simulations) genetic re-337
gions. Selective sweeps occurred in all five populations within one group (sg = 1), or all five338
populations in both groups (sg = 2). We simulated 10 replicate data sets for each combina-339
tion of simulation parameters. MCMC settings were as previously described. We identified340
outlier loci based on φST (the partitioning of molecular variation among populations), φCT341
(the partitioning of molecular variation between the two groups), φSC (the partitioning of342
molecular variation among the populations within each group). We determined the mean343
proportion of neutral and non-neutral loci classified as outliers based on each φ statistic344
across the 10 replicates for each combination of parameters.345
Analysis of human SeattleSNP data346
To illustrate the application of the proposed model to real genetic data we analyzed 316347
fully sequenced genes (exons and introns) from the SeattleSNPs data set (http://pga.gs.348
washington.edu; downloaded May 2010). Each gene was sequenced in 24 individuals with349
African ancestry (either the AA panel or the HapMap YRI population) and 23 individuals350
with European Ancestry (either the CEPH population or the HapMap CEU population).351
The phase of polymorphisms in these sequences has been estimated statistically using PHASE352
v2.0 (Stephens et al. 2001) to produce haplotypes. Similar to NGS data sets, these data are353
DNA sequences, but these data lack the sampling uncertainty associated with NGS data and354
might represent fewer (but much longer) genetic regions than would will be typical for current355
NGS studies. We used these data and the bamova software to identify loci with exceptional356
Z. Gompert and C. A. Buerkle 18
φST estimates that were consistent with divergent selection between or balancing selection357
within these European and African ancestry populations. The mean number of SNPs per358
gene was 62.98 (sd = 50.73) resulting in a large number of haplotypes per gene. This large359
number of SNPs per gene, which resulted from these data being complete gene sequences,360
was considerably greater than typical for the short reads generated by NGS (e.g, Gompert361
et al. 2010; Hohenlohe et al. 2010), and resulted in poor MCMC mixing. Therefore we362
based our analysis on the first five SNPs in each gene (see Supplemental Material Human363
SeattleSNP data: alternative data subsets for analyses using other SNPs). All insertion-364
deletion polymorphisms were ignored. We used the known haplotypes model with population365
structure. We ran a 25,000 iteration burnin followed by 50,000 iterations to estimate posterior366
probabilities. We identified outlier loci using the criteria described for the simulated data367
sets. We then examined variation in genetic differentiation among SNPs within each outlier368
locus, as well as several loci that had typical, “neutral” φST estimates, by calculating point369
estimates of FST at each SNP.370
Analysis of worldwide SNP data371
We obtained data for genomic diversity across 33 widely distributed human populations (in-372
cluding Africa, Eurasia, East Asia, Oceania, and America) from Jakobsson et al. (2008,373
(version 1.3; http://neurogenetics.nia.nih.gov/paperdata/public/). These data in-374
clude statistically phased haplotypes for 597 individuals, based on 525,901 SNPs. These375
data differ from NGS data as they are not DNA sequences, but the large number of genetic376
regions in this data set might be similar to what will be acquired in NGS studies. For each377
SNP in the HumanHap550 genotyping panel we obtained annotations and information for378
location relative to genes from the manufacturer of the genotyping technology (Illumina, San379
Diego, CA). To focus our analysis on haplotypic variation within transcribed, genic regions,380
we retained only those SNPs that were annotated as being within coding, intron, 5’UTR,381
3’UTR, or UTR regions. If for a particular gene this included more that 4 SNPs (minimally,382
Z. Gompert and C. A. Buerkle 19
for inclusion we required 2 SNPs per locus), we retained SNPs that were annotated as within383
coding or intron sequence plus any neighboring SNPs to bring the total to 4 SNPs per lo-384
cus. Finally, if there were more than 4 SNPs within coding or intron sequence at a locus,385
we retained the first and last SNP and randomly chose 2 intervening SNPs in coding or386
intron sequence. We utilized a maximum of 4 SNPs per locus because this was similar to the387
number of variable sites expected in short NGS data (Gompert et al. 2010; Hohenlohe388
et al. 2010) and led to a sufficiently small number of haplotypes across populations, relative389
to the numbers of individuals per population, to yield informative haplotype frequencies for390
populations. Some isolated SNPs had annotations to genes elsewhere in the genome and391
conflicted with neighboring SNPs and were excluded. However, we did allow these SNPs392
to break genes into separate genic regions for analysis, so that we utilized haplotypic data393
from 12649 regions and 11866 distinct genes. We excluded the limited amount of data from394
the mitochondrion, the Y chromosome, and the pseudoautosomal region of the X and Y395
chromosomes, and focused on the autosomes and the X chromosome.396
We conducted a separate analysis for each chromosome to test for evidence of selection397
based on levels of molecular differentiation among these human populations using the bamova398
software. This allowed us to contrast pattern of genetic variation among chromosomes and399
identify outlier loci relative to levels of genetic differentiation for the chromosome on which400
they are found. We used the known haplotypes likelihood model with population structure401
and estimated posterior probabilities based 50,000 MCMC iterations following a 25,000 iter-402
ation burnin. We classified outlier loci for each data set using the criteria described for the403
simulated data sets.404
RESULTS405
Results from simulated data sets When data were simulated in the absence of selection,406
few genetic regions were classified as outliers, and thus, as being associated with targets of407
selection (Table 1). This result suggests that the method has a low false positive rate (mean408
Z. Gompert and C. A. Buerkle 20
proportion = 0.012, sd = 0.030). The proportion of genetic regions classified as outliers was409
similar across all likelihood models, but tended to be slightly lower when migration among410
populations was simulated (particularly for high φST outliers). With a = 0.5 the mean411
proportion of genetic regions classified as outliers across 10 replicate simulations varied from412
0.003 (known haplotype model, no migration, 500 sequence loci, high φST outliers ) to 0.034413
(NGS–population model, no migration, 500 sequence loci, low φST outliers). Using a = 0.95414
the mean proportion of outliers detected across 10 replicate simulations varied from 0 (many415
simulation conditions) to 0.004 (NGS-individual model, no migration, 500 sequence loci, high416
φST outliers).417
Simulations that included selection typically resulted in an increased number of genetic418
regions being classified as outliers. As expected, we found that the ability to correctly classify419
non-neutral genetic regions as outlier loci was dependent upon the extent of selection. Few420
outliers were detected when only one population was affected (e.g., high φST , a = 0.5, NGS-421
individual model mean = 0.057, sd = 0.040), but most selected loci were correctly classified422
as outliers when all five populations were affected (e.g., high φST , a = 0.5, NGS-individual423
model mean = 0.612, sd = 0.310; Tables S1, S2, S3). Selection was easiest to detect when424
data were simulated and analyzed in accordance with the known haplotype model (in this425
case, when all five populations were affected by selection, all selected loci were identified426
as outliers even with a = 0.95; Table S1). Selection was generally easier to detect when427
migration was simulated. For example, under the known haplotype model the proportion of428
selected loci correctly classified as high outliers increased from 0.429 to 0.492 when simulated429
populations experienced the homogenizing effects of migration (means across all simulation430
conditions). For the complete set of simulations, the proportion of selected genetic regions431
that were correctly classified as outlier loci was greater than the proportion of neutral genetic432
regions incorrectly classified as outlier loci (Tables S1, S2, S3).433
The ability of the method to correctly identify genetic regions affected by selective sweeps434
in the presence of group structure (i.e., when populations are organized into groups of pop-435
Z. Gompert and C. A. Buerkle 21
ulations) was highly dependent on the extent of the selective sweeps. When selective sweeps436
were confined to a single group, outliers were most often detected as genetic regions with high437
differentiation among populations within a group (i.e., high φSC; Table S4). The proportion438
of swept genetic regions correctly classified as φSC outliers was generally small and did not439
exceed 0.150, but was much greater than the proportion of neutral genetic regions classified440
as φSC outliers. In contrast, when both groups of populations were affected by selective441
sweeps, nearly all swept loci were classified as high φST and φSC outliers (92.5–100% using442
a = 0.5) and a smaller proportion (0.100–0.225) were classified as low or high φCT outliers.443
Thus, the selective sweeps were most readily detected based on high differentiation among all444
populations (φST ) or among populations within each group (φSC), but could also be detected445
based on low or high differentiation between the two groups (φCT ). This variety of outcomes446
arises from the stochastic nature used to determine the specific haplotypes that were swept447
to fixation. The false positive rate for these group analyses was very low (between 0 and448
0.032), similar to the population structure analyses (Tables S2, S4).449
Divergent selection among African and European populations Mean genome-level450
φST between African ancestry and European ancestry populations based on the SeattleSNPs451
data set was 0.080 (95% equal tail probability interval (ETPI) 0.065–0.097). This means452
that approximately 8% of DNA sequence variation was partitioned between the African and453
European ancestry populations.454
We classified three genes as high φST outliers (using a = 0.5) based on the first five SNPs455
data subset (Fig. 1). One of these genes was HSD11B2. Approximately 32% of molecular456
variation at this gene was partitioned between African and European ancestry populations457
(φST = 0.317 95% ETPI 0.159–0.482, Figure 1). Allelic variants of this gene produce an458
inherited form of hypertension and an end-stage renal disease (Quinkler and Stewart459
2003). A weak association has also been detected between an intronic microsatellite in this460
gene and type 1 diabetes mellitus and diabetic nephropathy (Lavery et al. 2002). FOXA2461
Z. Gompert and C. A. Buerkle 22
was also identified as an outlier, with φST equal to 0.321 (95% ETPI 0.124–0.513, Figure462
1). FOXA2 regulates insulin sensitivity and controls hepatic lipid metabolism in fasting and463
type 2 diabetes mice (Wolfrum et al. 2004; Puigserver and Rodgers 2006). A genome-464
wide association study detected a SNP near FOXA2 (rs1209523) that was associated with465
fasting glucose levels in European and African Americans (Xing et al. 2010). This outlier466
is particularly interesting as Type 2 diabetes is 1.2–2.3 times more common in African467
Americans than European Americans (Harris 2001). The third outlier locus was POLG2,468
with an estimated φST of 0.327 (95% ETPI 0.175–0.477, Fig. 1). This gene was also classified469
as a target of selection in humans in a recent study (Barreiro et al. 2008).470
Selection among worldwide human populations For the worldwide human SNP data,471
estimates of mean chromosome-level φST were similar for the autosomal chromosomes and472
ranged from 0.083 (95% ETPI 0.0752–0.091; chromosome 22) to 0.113 (95% ETPI 0.104–473
0.120; chromosome 16; Table 2, Fig. 3). The estimate of φST for the X chromosome was474
considerably higher (0.139, 95% ETPI 0.128–0.149).475
We detected remarkable variation in φST along each of the chromosomes (Figure 4). We476
detected 569 unique φST outlier loci, which we designate as candidate genes for selection477
among worldwide human populations. Of these 569 genes, 518 were high φST outlier loci478
(including 222 with a = 0.95) and 51 were low φST outlier loci (including six with a =479
0.95). Outlier loci were detected on all chromosomes, with the greatest number identified480
on chromosome 1 (67 outlier loci) and the fewest on chromosome 13 (9 outlier loci; Table 2,481
Fig. 4). Some of the genes that we classified as outliers have previously been implicated as482
genes experiencing selection in human populations. These include CYP3A4 and CYP3A5,483
which are Cytochrome P450 genes found near one another on chromosome 7 (a = 0.95;484
CYP3A4 φST =0.293 95% ETPI [0.245–0.319] and CYP3A5 φST =0.359 95% ETPI [0.315–485
0.401]). These genes are important for detoxification of plant secondary compounds and are486
involved in metabolism of some prescribed drugs. Additional evidence that these genes have487
Z. Gompert and C. A. Buerkle 23
experienced positive selection in human populations exists from previous studies based on488
distortions of the site frequency spectrum in African, European, and Chinese populations,489
population differentiation between the CEU and YRI populations, and extended haplotype490
homozygosity in HapMap populations (Carlson et al. 2005; Voight et al. 2006; Nielsen491
et al. 2007; Chen et al. 2010). Another example is the PKDREJ gene on chromosome492
22, which is a candidate sperm receptor gene of mammalian egg-coat proteins and had493
one of the highest estimates of φST (0.455 95% ETPI 0.396–0.504). Variation in this gene494
is consistent with positive selection among primate lineages, although evidence suggests495
that balancing selection might act to maintain diversity at this gene in human populations496
(Hamm et al. 2007). We also found evidence of divergent selection acting on SLC24A5,497
which has been associated with differences in skin pigmentation among humans and was498
classified as a candidate for positive selection in several studies (Barreiro et al. 2008).499
Finally, three of the high φST outliers at a = 0.95 (RTTN, MSX2, and CDAN1) were among500
seven human skeletal genes identified by Wu and Zhang as genes with elevated FST at non-501
synonymous SNPs between African and non-African (European and East Asian) populations502
and candidates for recent positive selection in Europeans and East Asians (Wu and Zhang503
2010).504
We detected additional genes with very high φST that have not, to the best of our505
knowledge, been previously implicated as experiencing selection in human populations but506
have been found to be associated with disease traits. For example, the estimate of φST for507
SPATA5L1 (spermatogenesis associated 5-like 1) on chromosome 15 was 0.347 (95% ETPI508
0.246–0.430). This gene was classified as a high φST outlier in the analysis and has been509
linked to renal function and kidney disease, but has not been identified in previous tests for510
selection (Kottgen et al. 2009).511
We also identified novel candidate targets of balancing selection in worldwide human512
populations. Interestingly, none of the genes that we classified as low φST outlier loci were513
implicated as targets of balancing selection in European and African American populations in514
Z. Gompert and C. A. Buerkle 24
a recent study by Andres et al. (2009). Several of these genes have been linked to diseases.515
For example, RIF1 was classified as low φST outlier at the a = 0.95 level (φST =-0.065,516
95% ETPI -0.105– -0.023). RIF1 is an anti-apoptotic factor involved in DNA repair that is517
necessary for S-phase progression and is heavily expressed in breast cancer tumors (Wang518
et al. 2009). Another example is the AKT3 gene found on chromosome 1 (φST =-0.046, 95%519
ETPI -0.102–0.020; outlier at a = 0.5). This gene is involved in cell-cycle regulation and520
is highly expressed in malignant melanoma, but is also important for attainment of normal521
organ size, including brain size in mice (Stahl et al. 2004; Easton et al. 2005).522
DISCUSSION523
We have presented a novel model to quantify genome-wide population genetic structure and524
identify genetic regions that are likely to have experienced natural selection. Unlike previous525
methods for quantifying structure and detecting genetic signatures of selection, the proposed526
method accurately models the stochastic sampling of sequences that is inherent in current527
NGS instruments and incorporates genetic distances among sequences in estimates of genetic528
differentiation. Although few population genomics studies based on NGS of individuals have529
been published to date (hence the analysis of Sanger sequence and SNP human data sets530
instead of NGS data sets), various large-scale projects are currently underway to obtain these531
data in large samples of humans (e.g., 1000 genomes project, http://www.1000genomes.org)532
and a few recent studies suggest that these data will soon be available for many non-human533
(including non-model) species (Gompert et al. 2010; Hohenlohe et al. 2010). However,534
beyond data acquisition, substantial biological insights will only be possible if accompanied535
by models and methods designed to take full advantage of these data, while accurately536
modeling sources of error. Along with a few other recent reports of models for NGS (Guo537
et al. 2009; Lynch 2009; Gompert et al. 2010; Futschik and Schlotterer 2010), we538
believe our analytical methods help to address this need.539
Z. Gompert and C. A. Buerkle 25
Model properties The analyses of simulated and human genetic data sets suggest that540
the model provides statistically sound estimates of population differentiation for large sets541
of loci (see Supplemental Material Simulations: estimation of φ statistics). For example, the542
estimates of genome-level φST for the SeattleSNPs human sequence data and chromosome-543
level φST for the worldwide human SNP data (0.080–0.139) were similar to mean levels of544
genetic differentiation among human populations based on FST (e.g., FST = 0.09–0.14 for545
Yoruba, European, Han Chinese and Japanese populations; Weir et al. 2005; Barreiro546
et al. 2008).547
The simulation results indicate that the model only rarely identifies neutral loci incor-548
rectly as outliers (i.e., it has a low false positive rate). Using a = 0.5, the false positive rate549
for high or low φST outliers was never greater than 0.04, and using a = 0.95 this rate was550
never greater than 0.01. Low false positive rates are particularly important for genome-wide551
scans in which a large number of genes could otherwise show spurious evidence of selection.552
In contrast, false positive rates as high as 0.343 have been reported for FST outlier analyses553
of selection that use coalescent simulations under an incorrect demographic history to derive554
a neutral distribution (Excoffier et al. 2009). Moreover, the low false positive rate for555
the method held across all simulated demographic scenarios (i.e., different population diver-556
gence times, variation in migration rates and the presence or absence of group structure).557
This is expected as the null, genome-level distribution we estimate is based on the observed558
data rather than an assumed demographic history and most demographic histories should be559
appropriately captured by this genome-level distribution (Beaumont and Balding 2004;560
Guo et al. 2009). Similarly low false positive rates were reported by Guo et al. (2009)561
using a hierarchical Bayesian model based on FST. Nonetheless, it is important to note that562
these hierarchical models generally identify outlier loci as those that are inconsistent with563
the genome as a whole, and thus will not work well if most of the genome has recently564
experienced selection of a similar magnitude.565
Simulation results further indicate that the method has the ability to detect genetic566
Z. Gompert and C. A. Buerkle 26
regions under selection, and to a greater extent when selective sweeps affect multiple popu-567
lations or migration occurs. When a simulated genetic region was affected by selection in at568
least three populations the selected locus was generally at least 10 times more likely to be569
classified as an outlier than a randomly chosen neutral locus. This suggests a high true pos-570
itive rate (at least under favorable conditions) and that candidate selected genes identified571
by the method will be enriched substantially for genes actually experiencing selection. This572
is particularly true when genes are classified as outliers using the more stringent criterion of573
a = 0.95 (although this will also decrease the total number of selected genes identified rela-574
tive to a = 0.5). One benefit of quantifying genetic divergence based on haplotypes (using575
φST ) instead of SNP, microsatellite or AFLP alleles (using FST) is that multiple selective576
sweeps should result in increased genetic distances among haplotypes in different populations577
making them easier to detect than genes that experienced a single selective sweep, as was578
used for the simulations.579
Despite these generally promising results from the simulations, many selected genes were580
not classified as outliers using the method and constitute false negatives. There are several581
synergistic reasons that genes affected by selection might not be classified as outliers, which582
include inherent limitations in outlier-based tests for selection and difficulties detecting selec-583
tion (which does not always leave a clear signal), as well as specific details of the simulations584
(Nielsen 2005; Kelley et al. 2006). First, as pointed out previously, outlier analyses will585
only detect genes that stand out from the genome-wide distribution and thus might not586
detect genes experiencing weak selection or be applicable when much of the genome is under587
selection (Michel et al. 2010). However, this particular issue is more likely to affect empir-588
ical data than the simulated data sets. The data we simulated experienced selective sweeps589
with arbitrary linkage between the affected genetic region and the gene under selection. This590
means that selection could have favored the same or different haplotypes in each population.591
Thus, a clear molecular signature was not always left by the simulated selective sweeps. We592
simulated selection in this manner for computational efficiency and because it accurately593
Z. Gompert and C. A. Buerkle 27
reflects the effect of a recent and complete selective sweep. Specifically, we believe that this594
is a realistic model for selection as researchers will often detect selection based on genetic595
regions linked to the gene under selection rather than the actual gene under selection and596
patterns of linkage might vary among populations. Nonetheless, when evidence of selection597
comes directly from the target of selection or there is tight linkage with consistent patterns of598
linkage disequilibrium between the selected gene and a sequenced genetic region, a stronger599
signal of selection should be evident and selection should be easier to detect.600
Evidence of selection in human populations A diverse array of studies (Akey et al.601
2002; Nielsen et al. 2007; Barreiro et al. 2008; Nielsen et al. 2009) has found that602
natural selection has played an important role in shaping functional genetic variation in603
humans. Analyses based on our model for quantifying the distribution of genetic variation604
among populations similarly find substantial evidence for the action of natural selection.605
We classified approximately 10 times as many genes as candidates for positive or divergent606
selection (high φST outlier loci) than as candidates for balancing selection (low φST outlier607
loci). However, this does not mean that divergent and positive selection more commonly608
affect human genetic variation than balancing selection. Evidence from other studies suggests609
that weak negative selection is prevalent in the human genome and balancing selection is also610
fairly common (Bustamante et al. 2005; Andres et al. 2009). Instead this difference in the611
prevalence of different forms of selection might indicate that there is more variation in the612
strength of positive selection, resulting in a greater number of extreme genetic regions, than613
there is in the strength of negative or balancing selection with many genetic regions weakly614
affected by these factors. However, this cannot be explicitly addressed by our study. Finally,615
previous simulation studies suggest that it is difficult to detect balancing selection using616
outlier analyses (Beaumont and Balding 2004; Guo et al. 2009), which might reflect the617
different molecular signals left by divergent and balancing selection or the bounded nature618
of FST and φST .619
Z. Gompert and C. A. Buerkle 28
Specific genes identified as outliers by our analysis include several genes that have been620
repeatedly implicated as targets of positive selection in humans, such as POLG2, CYP3A5,621
and SLC24A5. Nonetheless, we also detected novel candidate genes for selection in humans622
(e.g., FOXA2) and failed to detect selection on genes expected to have experienced strong623
selection based on previous studies, such as the LCT (lactase) gene (Bersaglieri et al.624
2004; Nielsen et al. 2007; Tishkoff et al. 2007). A lack of complete concordance with625
earlier studies is not surprising as a general lack of concordance among studies of selection626
in humans has been noted (Nielsen et al. 2007). This lack of concordance likely reflects627
the sensitivity of different methods to different signatures of selection, which affects whether628
methods are more likely to detect recent, ongoing, or more ancient selective sweeps, as well629
as the specific nucleotides and populations analyzed. For example, the lack of evidence from630
the worldwide human data set for selection on LCT appears to reflect the SNPs included in631
this study. Previous studies of human genetic variation have detected high values of FST at632
specific SNPs in the LCT gene (e.g., 0.53 for SNPs rs4988235 and rs182549), but have also633
shown that FST varies across this gene (Bersaglieri et al. 2004). Previous estimates of634
FST for the two genic SNPs in LCT included in our analysis that were also investigated by635
Bersaglieri et al. (2004) were 0 (rs2874874) and 0.17 (rs2322659), which do not stand out636
markedly from background levels of differentiation.637
Conclusions Technological advances in DNA sequencing are ushering in a new era of pop-638
ulation genomics. Whereas researchers previously were constrained to various, relatively639
low throughput molecular markers for genotyping, it is now possible to rapidly generate very640
large volumes of DNA sequence data for any group of organisms (Mardis 2008; Gilad et al.641
2009). This new capability will allow researchers to address long-standing and fundamental642
questions in evolutionary biology, which were once considered nearly intractable because643
of limited genetic data (Mardis 2008). However, most published NGS studies have been644
primarily descriptive (e.g., transcriptome characterization; Vera et al. 2008; Parchman645
Z. Gompert and C. A. Buerkle 29
et al. 2010) or have been forced to discard valuable data because of analytical limitations646
(Hohenlohe et al. 2010) and have not taken full advantage of the potential of the sequence647
data. To do so, researchers need robust and accessible methods and models that can be ap-648
plied to NGS data to test evolutionary hypotheses. The model presented here helps fill this649
gap, as it allows us to quantify heterogeneous genomic divergence among populations and650
identify genetic regions affected by selection. The Bayesian analysis of molecular variance651
illustrates the potential of combining appropriate models and NGS to address important652
questions in evolutionary biology and genetics, and is a critical step toward utilizing the653
growing abundance of sequence data for population genomics.654
ACKNOWLEDGEMENTS655
We would like to thank J. Fordyce, M. Forister, C. Nice, P. Nosil, T. Parchman, and R.656
Safran for discussion and comments on previous drafts of this manuscript. Comments from L.657
Excoffier and two anonymous reviewers helped to improve the manuscript. We are indebted658
to R. Williamson for his contributions to the development of the bamova software. This659
work was supported by NSF DBI award 0701757 to CAB.660
LITERATURE CITED661
Akey, J. M., G. Zhang, K. Zhang, L. Jin, and M. D. Shriver, 2002 Interrogating662
a high-density SNP map for signatures of natural selection. Genome Research 12: 1805–663
1814.664
Andolfatto, P., J. D. Wall, and M. Kreitman, 1999 Unusual haplotype structure665
at the proximal breakpoint of in(2L)t in a natural population of Drosophila melanogaster.666
Genetics 153: 1297–1311.667
Andres, A. M., M. J. Hubisz, A. Indap, D. G. Torgerson, J. D. Degenhardt,668
A. R. Boyko, R. N. Gutenkunst, T. J. White, E. D. Green, C. D. Bustamante,669
Z. Gompert and C. A. Buerkle 30
A. G. Clark, and R. Nielsen, 2009 Targets of balancing selection in the human670
genome. Molecular Biology and Evolution 26: 2755–2764.671
Bamshad, M. and S. P. Wooding, 2003 Signatures of natural selection in the human672
genome. Nature Reviews Genetics 4: 99–111A.673
Barreiro, L. B., G. Laval, H. Quach, E. Patin, and L. Quintana-Murci, 2008674
Natural selection has driven population differentiation in modern humans. Nature Genetics675
40: 340–345.676
Beaumont, M. A., 2005 Adaptation and speciation: what can FST tell us? Trends in677
Ecology & Evolution 20: 435–440.678
Beaumont, M. A. and D. J. Balding, 2004 Identifying adaptive genetic divergence679
among populations from genome scans. Molecular Ecology 13: 969–980.680
Beaumont, M. A. and R. A. Nichols, 1996 Evaluating loci for use in the genetic681
analysis of population structure. Proceedings of the Royal Society B-Biological Sciences682
263: 1619–1626.683
Bersaglieri, T., P. C. Sabeti, N. Patterson, T. Vanderploeg, S. F. Schaffner,684
J. Drake, M. Rhodes, D. E. Reich, and J. N. Hirschhorn, 2004 Genetic signatures685
of strong recent positive selection at the lactase gene. American Journal of Human Genetics686
74: 1111–1120.687
Bustamante, C. D., A. Fledel-Alon, S. Williamson, R. Nielsen, M. T. Hu-688
bisz, S. Glanowski, D. Tanenbaum, T. White, J. Sninsky, R. Hernandez,689
D. Civello, M. Adams, M. Cargill, and A. Clark, 2005 Natural selection on690
protein-coding genes in the human genome. Nature 437: 1153–1157.691
Z. Gompert and C. A. Buerkle 31
Carlson, C. S., D. J. Thomas, M. A. Eberle, J. E. Swanson, R. J. Livingston,692
M. Rieder, and D. Nickerson, 2005 Genomic regions exhibiting positive selection693
identified from dense genotype data. Genome Research 15: 1553–1565.694
Chen, H., N. Patterson, and D. Reich, 2010 Population differentiation as a test for695
selective sweeps. Genome Research 20: 393–402.696
Craig, D. W., J. V. Pearson, S. Szelinger, A. Sekar, M. Redman, J. J. Corn-697
eveaux, T. L. Pawlowski, T. Laub, G. Nunn, D. A. Stephan, N. Homer, and698
M. J. Huentelman, 2008 Identification of genetic variants using bar-coded multiplexed699
sequencing. Nature Methods 5: 887–893.700
Easton, R. M., H. Cho, K. Roovers, D. W. Shineman, M. Mizrahi, M. Forman,701
V. Lee, M. Szabolcs, R. de Jong, T. Oltersdorf, T. Ludwig, A. Efstratiadis,702
and M. Birnbaum, 2005 Role for Akt3/Protein kinase B gamma in attainment of normal703
brain size. Molecular and Cellular Biology 25: 1869–1878.704
Excoffier, L., T. Hofer, and M. Foll, 2009 Detecting loci under selection in a705
hierarchically structured population. Heredity 103: 285–298.706
Excoffier, L., P. E. Smouse, and J. M. Quattro, 1992 Analysis of molecular variance707
inferred from metric distances among DNA haplotypes: application to human mitochon-708
drial DNA restriction data. Genetics 131: 479–491.709
Flint, J., J. Bond, D. C. Rees, A. J. Boyce, J. M. Roberts-Thomson, L. Ex-710
coffier, J. Clegg, M. Beaumont, R. Nichols, and R. Harding, 1999 Minisatellite711
mutational processes reduce FST estimates. Human Genetics 105: 567–576.712
Foll, M. and O. Gaggiotti, 2008 A genome-scan method to identify selected loci ap-713
propriate for both dominant and codominant markers: A Bayesian perspective. Genetics714
180: 977–993.715
Z. Gompert and C. A. Buerkle 32
Futschik, A. and C. Schlotterer, 2010 Massively parallel sequencing of pooled DNA716
samples–the next generation of molecular markers. Genetics p. genetics.110.114397.717
Gilad, Y., J. K. Pritchard, and K. Thornton, 2009 Characterizing natural variation718
using next-generation sequencing technologies. Trends in Genetics 25: 463–471.719
Gillespie, J. H., 2000 Genetic drift in an infinite population: The pseudohitchhiking720
model. Genetics 155: 909–919.721
Gompert, Z., M. L. Forister, J. A. Fordyce, C. C. Nice, R. Williamson, and722
C. A. Buerkle, 2010 Bayesian analysis of molecular variance in pyrosequences quantifies723
population genetic structure across the genome of Lycaeides butterflies. Molecular Ecology724
19: 2455–2473.725
Guo, F., D. K. Dey, and K. E. Holsinger, 2009 A Bayesian hierarchical model for anal-726
ysis of single-nucleotide polymorphisms diversity in multilocus, multipopulation samples.727
Journal of the American Statistical Association 104: 142–154.728
Hamblin, M. T. and A. Di Rienzo, 2000 Detection of the signature of natural selection729
in humans: Evidence from the Duffy blood group locus. The American Journal of Human730
Genetics 66: 1669–1679.731
Hamblin, M. T., E. E. Thompson, and A. Di Rienzo, 2002 Complex signatures of732
natural selection at the Duffy blood group locus. American Journal of Human Genetics733
70: 369–383.734
Hamm, D., B. S. Mautz, M. F. Wolfner, C. F. Aquadro, and W. J. Swanson,735
2007 Evidence of amino acid diversity-enhancing selection within humans and among736
primates at the candidate sperm-receptor gene PKDREJ. American Journal of Human737
Genetics 81: 44–52.738
Z. Gompert and C. A. Buerkle 33
Harris, M. I., 2001 Racial and ethnic differences in health care access and health outcomes739
for adults with type 2 diabetes. Diabetes Care 24: 454–459.740
Hohenlohe, P. A., S. Bassham, P. D. Etter, N. Stiffler, E. A. Johnson, and741
W. A. Cresko, 2010 Population genomics of parallel adaptation in threespine stickleback742
using sequenced RAD tags. PLoS Genetics 6: e1000862.743
Holsinger, K. E. and B. S. Weir, 2009 Fundamental concepts in genetics: genetics in744
geographically structured populations: defining, estimating and interpreting FST . Nature745
Reviews Genetics 10: 639–650.746
Hudson, R. R., 2002 Generating samples under a Wright-Fisher neutral model of genetic747
variation. Bioinformatics 18: 337–338.748
Jakobsson, M., S. W. Scholz, P. Scheet, J. R. Gibbs, J. M. VanLiere, H.-C.749
Fung, Z. A. Szpiech, J. H. Degnan, K. Wang, R. Guerreiro, J. M. Bras, J. C.750
Schymick, D. G. Hernandez, B. J. Traynor, J. Simon-Sanchez, M. Matarin,751
A. Britton, J. van de Leemput, I. Rafferty, M. Bucan, H. M. Cann, J. A.752
Hardy, N. A. Rosenberg, and A. B. Singleton, 2008 Genotype, haplotype and753
copy-number variation in worldwide human populations. Nature 451: 998–1003.754
Kelley, J. L., J. Madeoy, J. C. Calhoun, W. Swanson, and J. M. Akey, 2006755
Genomic signatures of positive selection in humans and the limits of outlier approaches.756
Genome Research 16: 980–989.757
Kottgen, A., N. L. Glazer, A. Dehghan, S.-J. Hwang, R. Katz, M. Li, Q. Yang,758
V. Gudnason, L. J. Launer, T. B. Harris, A. V. Smith, D. E. Arking, B. C.759
Astor, E. Boerwinkle, G. B. Ehret, I. Ruczinski, R. B. Scharpf, Y.-D. I.760
Chen, I. H. de Boer, T. Haritunians, T. Lumley, M. Sarnak, D. Siscovick,761
E. J. Benjamin, D. Levy, A. Upadhyay, Y. S. Aulchenko, A. Hofman, F. Ri-762
vadeneira, A. G. Uitterlinden, C. M. van Duijn, D. I. Chasman, G. Pare,763
Z. Gompert and C. A. Buerkle 34
P. M. Ridker, W. H. L. Kao, J. C. j. Witteman, J. c. Coresh, M. G. m. Shli-764
pak, and C. S. f. Fox, 2009 Multiple loci associated with indices of renal function and765
chronic kidney disease. Nature Genetics 41: 712–717.766
Kronholm, I., O. Loudet, and J. de Meaux, 2010 Influence of mutation rate on767
estimators of genetic differentiation - lessons from Arabidopsis thaliana. BMC Genetics768
11.769
Kullback, S. and R. A. Leibler, 1951 On information and sufficiency. Annals of Math-770
ematical Statistics 22: 79–86.771
Lavery, G. G., C. L. McTernan, S. C. Bain, T. A. Chowdhury, M. Hewison, and772
P. M. Stewart, 2002 Association studies between the HSD11B2 gene (encoding hu-773
man 11 beta-hydroxysteroid dehydrogenase type 2), type 1 diabetes mellitus and diabetic774
nephropathy. European Journal of Endocrinology 146: 553–558.775
Lewontin, R. C. and J. Krakauer, 1973 Distribution of gene frequency as a test of776
theory of selective neutrality of polymorphisms. Genetics 74: 175–195.777
Lohmueller, K. E., A. R. Indap, S. Schmidt, A. R. Boyko, R. D. Hernandez,778
M. J. Hubisz, J. J. Sninsky, T. J. White, S. R. Sunyaev, R. Nielsen, A. G.779
Clark, and C. D. Bustamante, 2008 Proportionally more deleterious genetic variation780
in European than in African populations. Nature 451: 994–997.781
Lynch, M., 2009 Estimation of allele frequencies from high-coverage genome-sequencing782
projects. Genetics 182: 295–301.783
Mardis, E. R., 2008 The impact of next-generation sequencing technology on genetics.784
Trends in Genetics 24: 133–141.785
Mardis, E. R., 2008 Next-generation DNA sequencing methods. Annual Review of Ge-786
nomics and Human Genetics 9: 387–402.787
Z. Gompert and C. A. Buerkle 35
Maynard-Smith, J. and J. Haigh, 1974 Hitch-hiking effect of a favorable gene. Genetical788
Research 23: 23–35.789
Meyer, M. and M. Kircher, 2010 Illumina sequencing library preparation for790
highly multiplexed target capture and sequencing. Cold Spring Harbor Protocols 2010:791
pdb.prot5448–.792
Michel, A. P., S. Sim, T. H. Q. Powell, M. S. Taylor, P. Nosil, and J. L.793
Feder, 2010 Widespread genomic divergence during sympatric speciation. Proceedings794
of National Academy of Sciences 107.795
Nielsen, R., 2005 Molecular signatures of natural selection. Annual Review of Genetics796
39: 197–218.797
Nielsen, R., I. Hellmann, M. Hubisz, C. Bustamante, and A. G. Clark, 2007798
Recent and ongoing selection in the human genome. Nature Reviews Genetics 8: 857–868.799
Nielsen, R., M. J. Hubisz, I. Hellmann, D. Torgerson, A. M. Andres, A. Al-800
brechtsen, R. Gutenkunst, M. D. Adams, M. Cargill, A. Boyko, A. Indap,801
C. D. Bustamante, and A. G. Clark, 2009 Darwinian and demographic forces af-802
fecting human protein coding genes. Genome Research 19: 838–849.803
Novembre, J., T. Johnson, K. Bryc, Z. Kutalik, A. R. Boyko, A. Auton, A. In-804
dap, K. S. King, S. Bergmann, M. R. Nelson, M. Stephens, and C. D. Busta-805
mante, 2008 Genes mirror geography within Europe. Nature 456: 98–101.806
Parchman, T. L., K. S. Geist, J. A. Grahnen, C. W. Benkman, and C. A.807
Buerkle, 2010 Transcriptome sequencing in an ecologically important tree species: as-808
sembly, annotation, and marker discovery. BMC Genomics 11: 180.809
Puigserver, P. and J. T. Rodgers, 2006 FOXA2, a novel transcriptional regulator of810
insulin sensitivity. Nature Medicine 12: 38–39.811
Z. Gompert and C. A. Buerkle 36
Quinkler, M. and P. M. Stewart, 2003 Hypertension and the cortisol–cortisone shuttle.812
Journal of Clinical Endocrinology and Metabolism 88: 2384–2392.813
Sabeti, P. C., D. E. Reich, J. M. Higgins, H. Levine, D. J. Richter, S. F.814
Schaffner, S. B. Gabriel, J. V. Platko, N. J. Patterson, G. J. McDon-815
ald, H. Ackerman, S. Campbell, D. Altshuler, R. Cooper, D. Kwiatkowski,816
R. Ward, and E. Lander, 2002 Detecting recent positive selection in the human genome817
from haplotype structure. Nature 419: 832–837.818
Slatkin, M., 1987 Gene flow and the geographic structure of natural-populations. Science819
236: 787–792.820
Slatkin, M. and T. Wiehe, 1998 Genetic hitch-hiking in a subdivided population. Ge-821
netical Research 71: 155–160.822
Stahl, J. M., A. Sharma, M. Cheung, M. Zimmerman, J. Q. Cheng, M. Bosen-823
berg, M. Kester, L. Sandirasegarane, and G. Robertson, 2004 Deregulated824
Akt3 activity promotes development of malignant melanoma. Cancer Research 64: 7002–825
7010.826
Stajich, J. E. and M. H. Hahn, 2005 Disentangling the effects of demography and827
selection in human history. Molecular Biology and Evolution 22: 63–73.828
Stephens, M., N. J. Smith, and P. Donnelly, 2001 A new statistical method for829
haplotype reconstruction from population data. American Journal of Human Genetics 68:830
978–989.831
Tajima, F., 1989 Statistical-method for testing the neutral mutation hypothesis by DNA832
polymorphism. Genetics 123: 585–595.833
Tishkoff, S. A., F. A. Reed, F. R. Friedlaender, C. Ehret, A. Ran-834
ciaro, A. Froment, J. B. Hirbo, A. A. Awomoyi, J.-M. Bodo, O. Doumbo,835
Z. Gompert and C. A. Buerkle 37
M. Ibrahim, A. T. Juma, M. J. Kotze, G. Lema, J. H. Moore, H. Mortensen,836
T. B. Nyambo, S. A. Omar, K. Powell, G. S. Pretorius, M. W. Smith, M. A.837
Thera, C. Wambebe, J. L. Weber, and S. M. Williams, 2009 The genetic structure838
and history of Africans and African Americans. Science 324: 1035–1044.839
Tishkoff, S. A., F. A. Reed, A. Ranciaro, B. F. Voight, C. C. Babbitt, J. S.840
Silverman, K. Powell, H. M. Mortensen, J. B. Hirbo, M. Osman, M. Ibrahim,841
S. A. Omar, G. Lema, T. B. Nyambo, J. Ghori, S. Bumpstead, J. K. Pritchard,842
G. A. Wray, and P. Deloukas, 2007 Convergent adaptation of human lactase persis-843
tence in Africa and Europe. Nature Genetics 39: 31–40.844
Tishkoff, S. A. and B. C. Verrelli, 2003 Patterns of human genetic diversity: im-845
plications for human evolutionary history and disease. Annual Review of Genomics and846
Human Genetics 4: 293–340.847
Vera, J. C., C. W. Wheat, H. W. Fescemyer, M. J. Frilander, D. L. Craw-848
ford, I. Hanski, and J. H. Marden, 2008 Rapid transcriptome characterization for849
a nonmodel organism using 454 pyrosequencing. Molecular Ecology 17: 1636–1647.850
Voight, B. F., S. Kudaravalli, X. Q. Wen, and J. K. Pritchard, 2006 A map of851
recent positive selection in the human genome. PLoS Biology 4: 446–458.852
Wang, H., A. Zhao, L. Chen, X. Zhong, J. Liao, M. Gao, M. Cai, D.-H. Lee,853
J. Li, D. Chowdhury, Y.-g. Yang, G. P. Pfeifer, Y. Yen, and X. Xu, 2009854
Human RIF1 encodes an anti-apoptotic factor required for DNA repair. Carcinogenesis855
30: 1314–1319.856
Weir, B. S., L. R. Cardon, A. D. Anderson, D. Nielsen, and W. Hill, 2005 Mea-857
sures of human population structure show heterogeneity among genomic regions. Genome858
Research 15: 1468–1476.859
Z. Gompert and C. A. Buerkle 38
Wolfrum, C., E. Asilmaz, E. Luca, J. Friedman, and M. Stoffel, 2004 FOXA2860
regulates lipid metabolism and ketogenesis in the liver during fasting and in diabetes.861
Nature 432: 1027–1032.862
Wright, S., 1951 The genetical structure of populations. Annals of Eugenics 15: 323–354.863
Wu, D.-D. and Y.-P. Zhang, 2010 Positive selection drives population differentiation in864
the skeletal genes in modern humans. Human Molecular Genetics 19.865
Xing, C., J. C. Cohen, and E. Boerwinkle, 2010 A weighted false discovery rate866
control procedure reveals alleles at FOXA2 that influence fasting glucose levels. American867
Journal of Human Genetics 86: 440–446.868
Z. Gompert and C. A. Buerkle 39
−0.4 −0.2 0.0 0.2 0.4 0.6
05
1015
φST
Den
sity
HSD11B2FOXA2POLG2neutralgenome
Figure 1.—Locus-specific φST estimates for Africans and Europeans (SeattleSNPs data set).A point estimate of the genome-level φST distribution (based on the median from the posteriorprobability distributions of αST and βST is denoted with a solid black line. The posterior probabilitydistributions for the three outlier loci (colored lines) and 50 additional, randomly chosen geneticregions (gray lines) are also shown. These results are based on the first five SNPs in each gene;additional results are shown in Figure S3
Z. Gompert and C. A. Buerkle 40
FS
T
0 2000 6000 10000
0.0
0.5
1.0
HSD11B2
1000 3000 5000
0.0
0.5
1.0
FOXA2
FS
T
0 5000 15000
0.0
0.5
1.0
POLG2
5000 15000 25000
0.0
0.5
1.0
FUT2
FS
T
0 5000 10000 20000
0.0
0.5
1.0
CPSF4
5000 10000 15000
0.0
0.5
1.0
IL1F6
Physical location in gene (bp)
FS
T
10000 30000 50000
0.0
0.5
1.0
IKBKB
Physical location in gene (bp)
0 5000 10000 20000
0.0
0.5
1.0
neutral
Figure 2.—Point estimates of FST along specific genes (SeattleSNPs data set). Point estimatesof FST are shown at each SNP for seven genes identified as outliers in the comparison of humanpopulations with African and European ancestry as well as three randomly chosen neutral genes.
Z. Gompert and C. A. Buerkle 41
ΦST
Den
sity
−0.2 0.0 0.2 0.40
1020
30
A. Chromosome ΦST distribution
ΦST
Den
sity
0.08 0.12 0.16
050
100
150
200
B. Mean chromosome ΦST
Figure 3.—Chromosome-level estimates of φST for the large sample of human genetic diversity in33 populations (data from Jakobsson et al. 2008). Point estimates of each chromosome-level φST
distribution (based on the median from the posterior probability distributions of αST and βST ) aredenoted with solid black lines (autosomes) or a dashed orange line (X-chromosome; A). Posteriorprobability distributions for the mean chromosome-level φST for each chromosome (B).
Z. Gompert and C. A. Buerkle 42
ΦS
T
−0.
200.
150.
50
1 124.1 247.2
1
−0.
200.
150.
50
0 121.2 242.4
2
−0.
200.
150.
50
0.4 99.8 199.2
3
−0.
200.
150.
50
0.1 94.7 189.3
4
ΦS
T
−0.
200.
150.
50
0.2 90.4 180.6
5
−0.
200.
150.
50
0.1 85.4 170.7
6
−0.
200.
150.
50
0.9 79.7 158.6
7
−0.
200.
150.
50
0.2 73.2 146.2
8
ΦS
T
−0.
200.
150.
50
0.4 70.3 140.1
9−
0.20
0.15
0.50
0.2 67.7 135.2
10
−0.
200.
150.
50
0.2 67 133.8
11
−0.
200.
150.
50
0.2 66.2 132.2
12
ΦS
T
−0.
200.
150.
50
18.7 66.4 114.1
13
−0.
200.
150.
50
19.6 62.3 105
14
−0.
200.
150.
50
20.4 60.2 100.1
15
−0.
200.
150.
50
0 44.3 88.6
16
ΦS
T
−0.
200.
150.
50
0.2 39.4 78.6
17
−0.
200.
150.
50
0.2 38.1 76.1
18−
0.20
0.15
0.50
0.2 32 63.8
19
−0.
200.
150.
50
0 31.2 62.4
20
ΦS
T
−0.
200.
150.
50
14.5 30.7 46.9
21
−0.
200.
150.
50
15.5 32.4 49.4
22
−0.
200.
150.
50
2.7 78.6 154.5
X
Chromosomal position (Mb) Chromosomal position (Mb) Chromosomal position (Mb) Chromosomal position (Mb)
Figure 4.—Estimates of φST across the genome of worldwide human populations (data fromJakobsson et al. 2008). Each panel depicts a different human chromosome and is labeled accord-ingly. The solid black line denotes the point estimate (median of the posterior distribution) of themean chromosome-level φST for a chromosome. Estimates of φST for individual genes are shownin grey for non-outlier genes and light (at a = 0.5) or dark blue (at a = 0.95) for outlier genes. Foreach gene the closed circle gives the median from the posterior distribution of φST for that locusand the bars denote the 95% ETPI.
Z. Gompert and C. A. Buerkle 43
TABLE 1
Proportion of outlier loci in simulated neutral datam = 0 m = 2
a = 0.5 a = 0.95 a = 0.5 a = 0.95
Model No. loci τ low high low high low high low high
Known haplotypes model 25 0.25 0.012 0.028 0.000 0.000 0.016 0.020 0.000 0.000
0.50 0.028 0.028 0.000 0.000 0.016 0.012 0.000 0.000
1.00 0.024 0.004 0.004 0.000 0.012 0.016 0.000 0.000
500 0.25 0.0132 0.0212 0.0004 0.0036 0.0108 0.0170 0.0000 0.0012
0.50 0.0240 0.0164 0.0028 0.0004 0.0130 0.0158 0.0000 0.0012
1.00 0.0320 0.0032 0.0080 0.0000 0.0136 0.0188 0.0002 0.0024
NGS–individual model 25 0.25 0.012 0.012 0.000 0.000 0.020 0.016 0.000 0.000
0.50 0.004 0.024 0.000 0.000 0.020 0.012 0.000 0.000
1.00 0.020 0.024 0.000 0.000 0.040 0.008 0.000 0.000
500 0.25 0.0136 0.0242 0.0005 0.0042 0.0098 0.0156 0.0004 0.0020
0.50 0.0220 0.0202 0.0022 0.0042 0.0112 0.0182 0.0004 0.0020
1.00 0.0280 0.0074 0.0068 0.0006 0.0140 0.0176 0.0008 0.0036
NGS–population model 25 0.25 0.016 0.020 0.000 0.000 0.008 0.012 0.000 0.000
0.50 0.024 0.040 0.000 0.000 0.008 0.012 0.000 0.000
1.00 0.032 0.004 0.000 0.000 0.004 0.008 0.000 0.004
500 0.25 0.0112 0.0202 0.0002 0.0024 0.0054 0.0166 0.0000 0.0018
0.50 0.0190 0.0164 0.0006 0.0012 0.0100 0.0174 0.0000 0.0012
1.00 0.0340 0.0064 0.0050 0.0000 0.0088 0.0180 0.0000 0.0022
Mean proportion of loci over 10 replicates identified as outliers in the absence of selection foreach of the three different likelihoods (known haplotypes model, NGS–individual model, and NGS–
population model). Three times since divergence (τ) and two levels of migration (m) were simulated.Outlier loci were identified as low or high outliers relative to the genome-wide distribution and basedon two different quantiles (a = 0.5 or 0.95) of their posterior distribution.
Z. Gompert and C. A. Buerkle 44
TABLE 2
Summary of worldwide human HapMap data and results
High φST Low φST
Chromosome Chrom. φST No. loci a = 0.5 a = 0.95 a = 0.5 a = 0.95
1 0.098 (0.095–0.103) 1402 59 28 8 12 0.101 (0.097–0.106) 956 32 11 5 13 0.099 (0.095–0.104) 723 29 13 2 04 0.100 (0.094–0.106) 563 22 12 1 05 0.095 (0.089–0.100) 565 20 10 3 26 0.091 (0.087–0.095) 710 23 8 6 07 0.098 (0.092–0.103) 679 25 14 2 08 0.092 (0.087–0.098) 438 24 6 2 09 0.095 (0.089–0.099) 569 22 9 2 0
10 0.092 (0.087–0.098) 526 20 6 3 211 0.096 (0.092–0.101) 686 24 7 1 012 0.097 (0.092–0.102) 683 36 14 2 013 0.090 (0.082–0.099) 243 8 7 1 014 0.100 (0.093–0.107) 388 16 8 2 015 0.109 (0.101–0.117) 380 20 10 0 016 0.113 (0.104–0.120) 426 21 9 1 017 0.107 (0.101–0.114) 603 27 12 1 018 0.092 (0.084–0.100) 225 8 4 3 019 0.099 (0.094–0.104) 729 34 10 4 020 0.101 (0.093–0.109) 367 15 8 0 021 0.083 (0.075–0.091) 158 6 3 1 022 0.102 (0.093–0.111) 283 11 6 1 0X 0.139 (0.128–0.149) 347 16 7 0 0
Summary of data and results for each chromosome, including chromosome–level φST (median and95% ETPI), the number of loci, and the number of loci classified as outliers (data from Jakobsson
et al. 2008).