44
Z. Gompert and C. A. Buerkle 1 A hierarchical Bayesian model for next-generation population genomics Zachariah Gompert and C. Alex Buerkle Department of Botany and Program in Ecology, University of Wyoming, Laramie, Wyoming 82071 Genetics: Published Articles Ahead of Print, published on January 12, 2011 as 10.1534/genetics.110.124693 Copyright 2011.

A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 1

A hierarchical Bayesian model for next-generationpopulation genomics

Zachariah Gompert and C. Alex Buerkle

Department of Botany and Program in Ecology, University of Wyoming, Laramie,

Wyoming 82071

Genetics: Published Articles Ahead of Print, published on January 12, 2011 as 10.1534/genetics.110.124693

Copyright 2011.

Page 2: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 2

Running title: Bayesian population genomics

Keywords: Population genomics, selection, outlier analysis, Bayesian modeling, next-generation sequencing

Corresponding author: Zachariah GompertDepartment of Botany, 31651000 E. University Ave.University of WyomingLaramie, WY 82071Phone: (307) 760-3858Fax: (307) 766-2851Email: [email protected]

Page 3: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 3

ABSTRACT1

The demography of populations and natural selection shape genetic variation across the2

genome and understanding the genomic consequences of these evolutionary processes is a3

fundamental aim of population genetics. We have developed a hierarchical Bayesian model4

to quantify genome-wide population structure and identify candidate genetic regions affected5

by selection. This model improves on existing methods by accounting for stochastic sam-6

pling of sequences inherent in next-generation sequencing (with pooled or indexed individual7

samples) and by incorporating genetic distances among haplotypes in measures of genetic8

differentiation. Using simulations we demonstrate that this model has a low false positive9

rate for classifying neutral genetic regions as selected genes (i.e., φST outliers), but can de-10

tect recent selective sweeps, particularly when genetic regions in multiple populations are11

affected by selection. Nonetheless, selection affecting just a single population was difficult12

to detect and resulted in a high false negative rate under certain conditions. We applied the13

Bayesian model to two large sets of human population genetic data. We found evidence of14

widespread positive and balancing selection among worldwide human populations, including15

many genetic regions previously thought to be under selection. Additionally, we identified16

novel candidate genes for selection, several of which have been linked to human diseases.17

This model will facilitate the population genetic analysis of a wide range of organisms based18

on next-generation sequence data.19

Page 4: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 4

The distribution of genetic variants among populations is a fundamental attribute of evo-20

lutionary lineages. Population genetic diversity shapes contemporary functional diversity and21

future evolutionary dynamics, and provides a record of past evolutionary and demographic22

processes. Methods to quantify genetic diversity among populations have a long history23

(Wright 1951; Holsinger and Weir 2009) and provide a basis to distinguish neutral24

and adaptive evolutionary histories, to reconstruct migration histories, and to identify genes25

underlying diseases and other significant traits (Bamshad and Wooding 2003; Tishkoff26

and Verrelli 2003; Voight et al. 2006; Barreiro et al. 2008; Lohmueller et al. 2008;27

Novembre et al. 2008; Tishkoff et al. 2009; Hohenlohe et al. 2010). For example,28

population genetic analyses in humans have resolved a history of natural selection and inde-29

pendent origins of lactase persistence in adults in Europe and east Africa (Tishkoff et al.30

2007). Similarly, allelic diversity at the Duffy Blood group locus is consistent with the ac-31

tion of natural selection, and one allele that has gone to fixation in sub-Saharan populations32

confers resistance to malaria (Hamblin and Di Rienzo 2000; Hamblin et al. 2002). These33

studies of natural selection, and many others involving a diversity of organisms, utilize con-34

trasts between genetic differentiation at putatively selected loci and the remainder of the35

genome. Genomic diversity within and among populations is determined primarily by muta-36

tion and neutral demographic factors, such as effective population size and rates of migration37

among populations (Wright 1951; Slatkin 1987). Specifically, these demographic factors38

determine the rates of genetic drift and population differentiation across the genome. In39

contrast, selection affects variation in specific regions of the genome, including the direct40

targets of selection and to a lesser extent genetic regions in linkage disequilibrium with these41

targets (Maynard-Smith and Haigh 1974; Slatkin and Wiehe 1998; Gillespie 2000;42

Stajich and Hahn 2005). Thus, the genomic consequences of selection are superimposed43

on the genomic outcomes of neutral genetic differentiation and the two must be disentangled44

to identify regions of the genome affected by selection.45

A variety of models and methods have been proposed to identify genetic regions that have46

Page 5: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 5

been affected by selection (Nielsen 2005), and these population genetic methods can be47

divided into within-population and among-population analyses. The former include widely48

used neutrality tests based on the site frequency spectrum for a single locus, such as Tajima’s49

D test (Tajima 1989), as well as recently developed tests based on the presence of extended50

blocks of linkage disequilibrium or reduced haplotype diversity (Andolfatto et al. 1999;51

Sabeti et al. 2002; Voight et al. 2006). Among-population tests for selection require52

multi-locus data sets and identify non-neutral or outlier loci by contrasting patterns of pop-53

ulation divergence among genetic regions (Lewontin and Krakauer 1973; Beaumont54

and Nichols 1996; Akey et al. 2002; Beaumont and Balding 2004; Foll and Gag-55

giotti 2008; Guo et al. 2009; Chen et al. 2010). The most commonly employed of these56

methods is the FST outlier analysis developed by Beaumont and Nichols (1996). This57

test contrasts FST for individual loci with an expected null distribution of FST based on a58

neutral, infinite island, coalescent model (Beaumont and Nichols 1996). Loci with very59

high levels of among population differentiation (i.e., high FST) are considered candidates60

for positive or divergent selection, whereas loci with exceptionally low FST are regarded as61

candidates for balancing selection. However, many FST outlier analyses can be biased by62

violations of the assumed demographic history (Flint et al. 1999; Excoffier et al. 2009).63

Alternative Bayesian approaches to obtain a null distribution of population genetic differ-64

entiation assume that FST for individual loci represent independent draws from a common,65

underlying distribution that characterizes the genome and that can be estimated directly66

from multi-locus data (Beaumont and Balding 2004; Foll and Gaggiotti 2008; Guo67

et al. 2009). These approaches are more robust to different demographic histories, but might68

have reduced power to detect selection relative to methods that model the appropriate de-69

mographic history when it is known (Guo et al. 2009).70

Herein we propose a hierarchical Bayesian model for estimating genomic population dif-71

ferentiation and detecting selection. This method improves on current methods for detecting72

selection in several important ways. First, unlike previous Bayesian outlier detection models73

Page 6: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 6

(e.g., Beaumont and Balding 2004; Foll and Gaggiotti 2008; Guo et al. 2009), the74

likelihood component of the model captures the stochastic sampling processes inherent in75

next-generation sequence (NGS) data (Mardis 2008). Specifically, NGS technologies result76

in uneven coverage among individuals, genetic regions and homologous gene copies, including77

missing data for many individuals and loci, and thus, increased uncertainty in the diploid78

sequence of individuals relative to traditional Sanger sequencing (Lynch 2009; Gompert79

et al. 2010; Hohenlohe et al. 2010). This issue is most pronounced when population-level80

indexing is used (e.g., Gompert et al. 2010), but should persist even with individual-level81

indexing and high coverage data (i.e., even with high mean coverage the sequence coverage82

for certain combinations of individuals and genetic regions will be low). Moreover, appro-83

priately modeling and accounting for this uncertainty is important and vastly preferable to84

simply ignoring it or discarding large amounts of sequence data (e.g., Hohenlohe et al.85

2010). Previous methods to detect selection based on genetic differentiation among popula-86

tions have focused solely on allele frequency differences, as captured by FST (Nielsen 2005).87

Instead the model we propose measures population genetic differentiation using Excoffier et88

al.’s φ statistics, which are DNA sequence-based measures of the proportion of molecular89

variation partitioned among groups or populations (Excoffier et al. 1992). Quantifying90

genetic differentiation using φ statistics allows us to include the genetic distances among91

sequences in the measure of differentiation, and thus take mutation rate into account, which92

should increase our ability to accurately identify targets of selection (Kronholm et al.93

2010). Finally, the model we propose provides a novel criterion for designating outlier loci.94

Although we do not believe this criterion is inherently superior to alternatives (e.g., Beau-95

mont and Nichols 1996; Foll and Gaggiotti 2008; Guo et al. 2009), we believe it96

accords well with the concept of statistical outliers and is well suited for genome scans for97

divergent selection. Specifically, the model assumes that the φ statistics for each genetic98

region are drawn from a common, genome-level distribution, and identifies outlier loci, or99

loci potentially affected by selection, based on the probability of their locus-specific φ statis-100

Page 7: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 7

tic given the genome-level probability distribution of φ. This genome-level distribution is101

equivalent to the conditional probability for the locus-specific φ statistics in the model.102

We begin this paper by fully describing the model, both verbally and mathematically.103

We then use coalescent simulations to generate data with a known history and investigate104

the performance of the model for identifying genetic regions affected by selection. To illus-105

trate its capacity to identify exceptional genes that are likely to have experienced selection,106

we use the proposed model to analyze empirical data from two studies of human popula-107

tion genetic variation. The first of these data sets includes 316 completely sequenced genes108

from 24 individuals with African ancestry and 23 individuals with European ancestry (Seat-109

tleSNPs data set). The second data set was published by Jakobsson et al. (2008) and110

includes genotype data from 525,901 SNPs from 33 human populations distributed world-111

wide. We analyzed these human data with the known haplotypes model (see Bayesian models112

for molecular variance) and found evidence of selection affecting a total of 569 genes includ-113

ing genes previously identified as targets of selection in humans (e.g., CYP3A5, SLC24A5,114

and PKDREJ) and novel genes not previously thought to be under selection (e.g., FOXA2115

and SPATA5L1). Whereas these remarkably large-scale studies do not involve NGS data,116

the analyses of empirical data identify a large set of genes for additional study and illustrate117

our analytical approach, which can be applied to a diversity of large-scale genomic data118

sets, including those that will result from the anticipated widespread adoption of NGS for119

population genomics.120

METHODS121

Bayesian models for molecular variance122

Our goal is to use molecular divergence among populations and groups to identify genetic123

regions that might have been affected by natural selection. Patterns of divergence at indi-124

vidual loci arise from differences in haplotype frequencies and the genetic distance among125

haplotypes. We assume the distances are fixed and known and we attempt to estimate the126

Page 8: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 8

haplotype frequencies from sequence data using a hierarchical Bayesian model. The model127

includes a first-level likelihood for the probability of the observed haplotype counts given128

population haplotype frequencies, a conditional prior for the haplotype frequencies for each129

locus in each population given genome-level parameters and genetic distance matrix, and130

uninformative priors for the genome-level parameters. This conditional prior defines the131

distribution of φ statistics across the genome, whereas locus-specific φ statistics are derived132

from our estimate of haplotype frequencies and the genetic distance among haplotypes. From133

this model we are able to estimate the probability of each locus-specific φ statistic given the134

estimated genome-level distribution, which serves as a metric for identifying outlier loci that135

depart from the typical level of differentiation and therefore might have been affected by136

selection.137

First-level likelihood models We developed three models for the first-level likelihood of138

the observed haplotype counts (x) given the population haplotype frequencies (p). Each139

model is applicable to a different category of DNA sequence data that was generated via140

a different process. These models are presented in order of increasing uncertainty in the141

true genotype of individuals, and thus in the population allele frequencies. The first model142

(known haplotype model) is applicable if the two haplotypes of a diploid individual are known143

without error, as is generally assumed for phased Sanger sequence data. Under this model144

the probability of the observed haplotype counts for each locus and population follows a145

multinomial distribution, such that the complete likelihood is a product of multinomial146

distributions (for all loci and populations):147

P(x|p) =∏

i

j

nij !

xij1! · · ·xijk!p

xij1

ij1 · · · pxijk

ijk , (1)

where nij is the total number of observed sequences at locus i for population j, xijk and pijk148

are the observed count and population frequency of the kth haplotype in the jth population149

Page 9: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 9

at the ith locus. The multinomial likelihood model assumes Hardy-Weinberg equilibrium for150

each locus and linkage equilibrium among loci.151

Because additional sampling error occurs when sequence reads are sampled stochastically152

from DNA templates, NGS sequence data require alternative first level-likelihood models153

with the specific model depending on the information associated with each sequence. NGS154

can incorporate individual-level indexing or barcoding (Craig et al. 2008; Hohenlohe155

et al. 2010; Meyer and Kircher 2010). With individual-level indexes a sequence can be156

assigned to a particular individual and locus, but there is uncertainty regarding whether157

diploid individuals with a single observed haplotype are heterozygous or homozygous. If we158

assume sequencing errors are negligible or have already been corrected, any individual with159

two distinct haplotypes sampled is known to be a heterozygote, whereas individuals with160

one sampled haplotype might be heterozygous for the observed haplotype and an alternative161

haplotype or homozygous for the observed haplotype. Under these circumstances we propose162

the following likelihood (NGS-individual model) for the observed haplotype counts given the163

population haplotype frequencies:164

P(x|p) =∏

i

j

l

[2pijkapijkb

][nijl!

xijkal!xijkbl!0.5xijkal0.5xijkbl] if h is two

p2ijka

+∑

k 6=ka[2pijka

pijk][0.5xijkal ] if h is one

, (2)

where the product is across all individuals (l) as well as populations and loci, h is the number165

of distinct haplotypes observed for an individual at a locus, xijkal and xijkbl are the observed166

counts of the one or two haplotype sequences for individual l, and pijkaand pijkb

are the167

frequencies of these haplotypes in population j. The complementary set of haplotypes that168

exist, but that were not observed for an individual, have counts and frequencies simply169

denoted by xijkl = 0 and pijk, respectively.170

Alternatively, if NGS sequence data consist of indexed populations of pooled individuals171

(Gompert et al. 2010), rather than indexed individuals, information will only be associated172

Page 10: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 10

with sequences at the population level. We propose a third likelihood model (NGS-population173

model) for this situation, in which the probability of the observed haplotype frequencies given174

the population frequencies is described by a multivariate Polya distribution (Gompert et al.175

2010):176

P(x|p, ν) =∏

i

j

nij !∏

k(xijk!)

Γ(∑

k νjpijk + 1)

Γ(nij +∑

k νjpijk + 1)

k

Γ(xijk + νjpijk + 1)

Γ(νjpijk + 1), (3)

where Γ is the gamma function, νj is the number of gene copies (i.e., twice the number of177

diploid individuals) sampled from population j, and other parameters are as described above.178

This likelihood model has been used previously by Gompert et al. (2010) for population-179

level NGS data. This likelihood function assumes that the frequency of each haplotype in180

the sample of individuals from a population can take on any value between zero and one,181

and thus might not be appropriate when very low numbers of individuals are sampled from182

each population.183

Model priors The Bayesian model includes a conditional prior for the population haplo-184

type frequencies p assuming the distance matrix d is fixed and known without error. The185

conditional prior with its estimated parameters provides a null distribution for identify-186

ing outliers and inferring selection (see Designating outlier loci). The conditional prior we187

choose depends on whether we are interested in genetic structure among populations, or188

genetic structure among groups of populations (e.g., geographic regions) and among popula-189

tions within groups. For genetic structure among populations, we assign the following prior190

to p:191

P(p|αST , βST ,d) =∏

i

1

B(αST , βST )

(φSTi+ 1)αST−1(1 − φSTi

)βST−1

2αST +βST−1, (4)

where B is the beta function and φSTidenotes φST for locus i calculated based on pi and192

di following Excoffier et al. (1992). This specification of the conditional prior for p does193

Page 11: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 11

not correspond to a standard probability distribution with respect to p, but is equivalent to194

assuming that the locus-specific φST are distributed Beta(α = αST , β = βST , a = −1, b = 1)195

with d fixed, where αST and βST are the shape parameters of a Beta distribution and a and196

b define the lower and upper bounds of the distribution (i.e., we use a rescaled Beta distri-197

bution). Thus, this conditional prior describes the distribution of φST across the genome,198

with a mean and standard deviation given by:199

µST =2αST

αST + βST

− 1 (5)

and200

σST = 2

αSTβST

(αST + βST )2(αST + βST + 1), (6)

respectively. We complete this model by assigning uninformative, uniform priors to αST ∼201

U(0, uα) and βST ∼ U(0, uβ). For all analyses presented in this manuscript we set uα =202

uβ = 106, yielding priors with uniform density over all parameter values with non-negligible203

support in the posterior; however, alternative values might be necessary for specific data204

sets. The above specification results in the following hierarchical Bayesian model:205

P(p, αST , βST |x,d, ν) ∝ P(x|p, ν)P(p|αST , βST ,d)P(αST )P(βST ). (7)

We provide an alternative conditional prior for the haplotype frequencies when genetic struc-206

ture among groups and populations is of interest. Specifically we assume:207

Page 12: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 12

P(p|α, β,d) =∏

i

1

B(αST , βST )

(φSTi+ 1)αST−1(1 − φSTi

)βST−1

2αST +βST−1×

1

B(αSC , βSC)

(φSCi+ 1)αSC−1(1 − φSCi

)βSC−1

2αSC+βSC−1×

1

B(αCT , βCT )

(φCTi+ 1)αCT −1(1 − φCTi

)βCT−1

2αCT +βCT−1,

(8)

where φSCiand φCTi

denote φSC (molecular variation among populations within groups of208

populations) and φCT (molecular variation among groups of populations relative to total hap-209

lotypic diversity) for locus i calculated based on pi and di following Excoffier et al. (1992),210

and other parameters are as described above. This specification estimates genome-level dis-211

tributions for each φ statistic independently. Because locus-specific φST are fully specified212

given locus-specific φSC and φCT , an alternative and equally valid modeling approach would213

be to treat the mean genome-level φST as a derived parameter and specify a conditional214

prior based only on φSC and φCT . This alternative approach would account for the lack of215

independence among φSC , φCT , and φST and would be closer to existing decompositions of216

φ statistics. However, this alternative model prior would not provide posterior estimates of217

αST or βST , which are necessary to specify the posterior distribution for the genome-level218

distribution of φST , as opposed to the posterior distribution of the mean genome-level φST .219

The former is necessary to identity outlier loci with respect to φST , which is central to our220

model. Similar to the model for genetic differentiation among populations only, equation (8)221

does not correspond to a standard probability distribution with respect to p, but is equiva-222

lent to assuming that each of the locus-specific φ statistics follows its own Beta distribution223

with d fixed. As with the model for population structure, it is possible to derive the mean224

and standard deviation of the genome-level beta distribution using equations (5) and (6) and225

substituting the appropriate α and β parameters. Finally, as before we assign uninformative,226

uniform priors to all α and β parameters. This specification results in the following Bayesian227

model:228

Page 13: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 13

P(p, α, β|x,d, ν) ∝ P(x|p, ν)P(p|α, β,d)P(αST )P(βST )P(αSC)P(βSC)P(αCT )P(βCT ). (9)

Designating outlier loci This Bayesian model gives rise to a framework for identifying229

outlier loci, or genetic regions with an unusual proportion of molecular variation partitioned230

among populations or groups. Such genetic regions might have experienced selection directly,231

or indirectly through linkage. Genetic regions with unusual patterns of molecular variation232

can be identified by contrasting φ statistics for each locus with the genome-level distribu-233

tion for each φ statistic. Specifically, we identify outlier loci by estimating the posterior234

probability distribution for the quantile of each locus-specific φ statistic in the genome-level235

distribution. Formally, we define ai as the ath quantile of the posterior distribution for the236

locus-specific φ statistic for locus i and let qn be the interval with endpoints defined as the237

n2

th and 1− n2

th quantiles of the genome-level φ statistic distribution. We then consider locus238

i an outlier at the ath quantile with respect to a given φ statistic with probability n if the239

interval qn does not contain ai. Values of n and a used for outlier designation will determine240

the stringency of the analysis, with higher values of a and lower values of n resulting in241

fewer loci classified as outliers. In this paper we set n = 0.05 (qn = [0.025, 0.975]) and use242

two different values of a (0.5 and 0.95 quantiles). These values for a denote the median and243

95th quantile of the posterior probability distribution for the quantile of each locus-specific φ244

statistic. When only differentiation among populations is being considered, a genetic region245

can only be designated an outlier with respect to φST , whereas a genetic region could be an246

outlier with respect to φST , φSC, or φCT when group and population structure are considered.247

Outlier loci can be classified as having very low φST , which could be indicative of balanc-248

ing or purifying selection, or very high φST , which could be indicative of positive selection249

within populations or divergent selection among populations (Beaumont and Balding250

Page 14: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 14

2004; Beaumont 2005). Patterns of selection giving rise low or high φCT and φSC might251

be a bit more complex. For example, high φCT outliers would be expected if divergent selec-252

tion occurred among groups of populations with the same alleles favored within each group253

and low φSC outliers might be expected with balancing selection within groups that favored254

different subsets of alleles among groups.255

Several approaches have been described for identifying outlier loci using genome-scans256

for differentiation (Beaumont and Balding 2004; Foll and Gaggiotti 2008; Guo et al.257

2009). Our approach is most similar to that proposed by Guo et al. (2009), but also differs258

from this approach in several important ways. Guo et al. (2009) contrast an approximation259

of the posterior probability distribution for each locus-specific φST (θi’s in their model) to the260

hyperdistribution describing among locus variation in φST (equivalent to our genome-level261

distribution). Approximation of the posterior probability distribution for φST is achieved by262

defining a Beta distribution with first and second moments equal to those of the posterior263

probability distribution (Guo et al. 2009). This approximation allows Guo et al. (2009)264

to measure the divergence between the posterior probability distribution for each locus-265

specific φST and the genome-level distribution (also a Beta distribution, which is derived266

from point estimates of genome-level parameters) using Kullback-Leibler divergence (Kull-267

back and Leibler 1951). Following calibration to determine a cutoff value for significance,268

the Kullback-Leibler divergence measure is used to designate outlier loci. The primary dis-269

tinction between the outlier detection method proposed by Guo et al. (2009) and the outlier270

detection method we have implemented is that their method tests for a difference between271

the posterior probability distribution of each locus-specific φST (this distribution measures272

uncertainty in the parameter estimate) and a point estimate of the genome-level φST distri-273

bution (this distribution measures expected variation among loci in φST ). In contrast, we274

test whether locus-specific φ statistics are unlikely given the genome-level distribution (i.e.,275

not that our estimates of these locus-specific φ statistics simply differ significantly from the276

genome-level distribution). Additionally, our method differs by accounting for uncertainty277

Page 15: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 15

in the genome-level distribution by taking the marginal distribution of the quantile of each278

locus-specific φ statistic in the genome-level distribution, rather than using a point estimate279

of the genome-level distribution, which is necessary for the method of Guo et al. (2009).280

We do not believe that either of these methods is necessarily superior, but rather that they281

provide alternative criteria and definitions of outlier loci, with our method perhaps more282

closely reflecting common conceptions of outlier loci among evolutionary biologist and their283

method utilizing more information from the posterior distribution of locus-specific φST .284

Analysis of simulated data sets285

We conducted a series of simulations to determine what proportion of genetic regions would286

be classified as outliers in the absence of selection. These data sets were simulated for287

analysis with the population structure model and for each of the three likelihood mod-288

els (i.e., known haplotype model, NGS–individual model, and NGS-population model). We289

simulated sequence data using an infinite sites coalescent model using R. Hudson’s soft-290

ware ms (Hudson 2002). Data sets were simulated with 25 or 500 genetic regions. The291

simulations assumed five populations split from a common ancestor τ generations in the292

past, where τ has units of 4Ne and was set to 0.25, 0.5, or 1.0. We conducted simulations293

with migration among the five populations (Nem) set to 0 or 2. The ancestral population294

and all five descendant populations were assigned population mutation rates θ = 4Neµ of295

0.5, where µ is the per locus mutation rate. Forty gene copies were sampled from each296

of the five populations. For the known haplotype model analyses we treated the simulated297

sequences directly as the sampled data. For NGS–individual model and NGS-population298

model analyses we re-sampled the simulated sequence data sets such that coverage for each299

sequence was Poisson distributed (λ = 2). For the NGS–individual model analyses we re-300

tained information on which individual each sequence came from, whereas we only retained301

population identification for NGS-population model analyses. We generated 10 replicate302

simulated data sets for each combination of likelihood model, number of genetic regions303

Page 16: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 16

(25 or 500), τ , and migration rate. Each data set was analyzed using an Markov chain304

Monte Carlo (MCMC) implementation of the proposed model using the bamova software we305

have developed (see Supplemental Material MCMC algorithm; available from the authors306

at http://www.uwyo.edu/buerkle/software/ as stand-alone software). We calculated the307

distance matrix for each locus using the number of sites by which each pair of sequences308

differed. Estimation of the posterior probability distribution for all parameters for each data309

set was based on a single MCMC algorithm run that included a 25,000 iteration burn-in310

followed by 50,000 samples from the posterior. Sample history plots were monitored to en-311

sure appropriate chain mixing and convergence on the stationary distribution. We classified312

genetic regions as outliers based on the posterior probability distribution for the quantile313

of each locus-specific φST in the genome-level φST distribution, as previously described (see314

Designating outlier loci). We then calculated the mean proportion of loci classified as outliers315

across the 10 replicates for each combination of parameters.316

We conducted an additional series of simulations to assess the capacity of our analytical317

model to detect selected loci among a larger set of neutral regions. We began with a set318

of simulations of population structure, as above. Simulations were conducted under all319

combinations of conditions described previously for simulations without selection. However,320

we simulated selective sweeps affecting 2 (8%; for 25 locus simulations) or 25 (5%; for 500321

locus simulations) of the simulated loci. We allowed selective sweeps to occur in 1, 3, or 5322

populations (one set of simulations for each). Selective sweeps were simulated by selecting one323

haplotype from each affected population at each affected locus and increasing its frequency324

to 1. This was meant to simulate recent and strong selective sweeps where affected loci325

were in arbitrary linkage disequilibrium with the gene subject to selection. Thus, it was326

possible for the same or different haplotypes to be driven to fixation across populations. We327

calculated the mean proportion of neutral and non-neutral loci classified as outliers across328

the 10 replicates for each combination of parameters.329

Finally, we conducted a series of simulations to determine the extent to which our group330

Page 17: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 17

structure model could correctly identify genetic regions affected by selection. For these simu-331

lations we concentrated on the NGS–individual model and a single demographic history. We332

simulated two groups of five populations, with the divergence time of the populations within333

each group equal to 0.25 and the divergence of the two groups equal to 0.75. We allowed334

no migration between groups, but set the within group migration rate to 4. We simulated335

data sets of 25 and 500 loci, and as with the population structure simulations, we simulated336

selective sweeps affecting 2 (25 locus simulations) or 25 (500 locus simulations) genetic re-337

gions. Selective sweeps occurred in all five populations within one group (sg = 1), or all five338

populations in both groups (sg = 2). We simulated 10 replicate data sets for each combina-339

tion of simulation parameters. MCMC settings were as previously described. We identified340

outlier loci based on φST (the partitioning of molecular variation among populations), φCT341

(the partitioning of molecular variation between the two groups), φSC (the partitioning of342

molecular variation among the populations within each group). We determined the mean343

proportion of neutral and non-neutral loci classified as outliers based on each φ statistic344

across the 10 replicates for each combination of parameters.345

Analysis of human SeattleSNP data346

To illustrate the application of the proposed model to real genetic data we analyzed 316347

fully sequenced genes (exons and introns) from the SeattleSNPs data set (http://pga.gs.348

washington.edu; downloaded May 2010). Each gene was sequenced in 24 individuals with349

African ancestry (either the AA panel or the HapMap YRI population) and 23 individuals350

with European Ancestry (either the CEPH population or the HapMap CEU population).351

The phase of polymorphisms in these sequences has been estimated statistically using PHASE352

v2.0 (Stephens et al. 2001) to produce haplotypes. Similar to NGS data sets, these data are353

DNA sequences, but these data lack the sampling uncertainty associated with NGS data and354

might represent fewer (but much longer) genetic regions than would will be typical for current355

NGS studies. We used these data and the bamova software to identify loci with exceptional356

Page 18: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 18

φST estimates that were consistent with divergent selection between or balancing selection357

within these European and African ancestry populations. The mean number of SNPs per358

gene was 62.98 (sd = 50.73) resulting in a large number of haplotypes per gene. This large359

number of SNPs per gene, which resulted from these data being complete gene sequences,360

was considerably greater than typical for the short reads generated by NGS (e.g, Gompert361

et al. 2010; Hohenlohe et al. 2010), and resulted in poor MCMC mixing. Therefore we362

based our analysis on the first five SNPs in each gene (see Supplemental Material Human363

SeattleSNP data: alternative data subsets for analyses using other SNPs). All insertion-364

deletion polymorphisms were ignored. We used the known haplotypes model with population365

structure. We ran a 25,000 iteration burnin followed by 50,000 iterations to estimate posterior366

probabilities. We identified outlier loci using the criteria described for the simulated data367

sets. We then examined variation in genetic differentiation among SNPs within each outlier368

locus, as well as several loci that had typical, “neutral” φST estimates, by calculating point369

estimates of FST at each SNP.370

Analysis of worldwide SNP data371

We obtained data for genomic diversity across 33 widely distributed human populations (in-372

cluding Africa, Eurasia, East Asia, Oceania, and America) from Jakobsson et al. (2008,373

(version 1.3; http://neurogenetics.nia.nih.gov/paperdata/public/). These data in-374

clude statistically phased haplotypes for 597 individuals, based on 525,901 SNPs. These375

data differ from NGS data as they are not DNA sequences, but the large number of genetic376

regions in this data set might be similar to what will be acquired in NGS studies. For each377

SNP in the HumanHap550 genotyping panel we obtained annotations and information for378

location relative to genes from the manufacturer of the genotyping technology (Illumina, San379

Diego, CA). To focus our analysis on haplotypic variation within transcribed, genic regions,380

we retained only those SNPs that were annotated as being within coding, intron, 5’UTR,381

3’UTR, or UTR regions. If for a particular gene this included more that 4 SNPs (minimally,382

Page 19: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 19

for inclusion we required 2 SNPs per locus), we retained SNPs that were annotated as within383

coding or intron sequence plus any neighboring SNPs to bring the total to 4 SNPs per lo-384

cus. Finally, if there were more than 4 SNPs within coding or intron sequence at a locus,385

we retained the first and last SNP and randomly chose 2 intervening SNPs in coding or386

intron sequence. We utilized a maximum of 4 SNPs per locus because this was similar to the387

number of variable sites expected in short NGS data (Gompert et al. 2010; Hohenlohe388

et al. 2010) and led to a sufficiently small number of haplotypes across populations, relative389

to the numbers of individuals per population, to yield informative haplotype frequencies for390

populations. Some isolated SNPs had annotations to genes elsewhere in the genome and391

conflicted with neighboring SNPs and were excluded. However, we did allow these SNPs392

to break genes into separate genic regions for analysis, so that we utilized haplotypic data393

from 12649 regions and 11866 distinct genes. We excluded the limited amount of data from394

the mitochondrion, the Y chromosome, and the pseudoautosomal region of the X and Y395

chromosomes, and focused on the autosomes and the X chromosome.396

We conducted a separate analysis for each chromosome to test for evidence of selection397

based on levels of molecular differentiation among these human populations using the bamova398

software. This allowed us to contrast pattern of genetic variation among chromosomes and399

identify outlier loci relative to levels of genetic differentiation for the chromosome on which400

they are found. We used the known haplotypes likelihood model with population structure401

and estimated posterior probabilities based 50,000 MCMC iterations following a 25,000 iter-402

ation burnin. We classified outlier loci for each data set using the criteria described for the403

simulated data sets.404

RESULTS405

Results from simulated data sets When data were simulated in the absence of selection,406

few genetic regions were classified as outliers, and thus, as being associated with targets of407

selection (Table 1). This result suggests that the method has a low false positive rate (mean408

Page 20: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 20

proportion = 0.012, sd = 0.030). The proportion of genetic regions classified as outliers was409

similar across all likelihood models, but tended to be slightly lower when migration among410

populations was simulated (particularly for high φST outliers). With a = 0.5 the mean411

proportion of genetic regions classified as outliers across 10 replicate simulations varied from412

0.003 (known haplotype model, no migration, 500 sequence loci, high φST outliers ) to 0.034413

(NGS–population model, no migration, 500 sequence loci, low φST outliers). Using a = 0.95414

the mean proportion of outliers detected across 10 replicate simulations varied from 0 (many415

simulation conditions) to 0.004 (NGS-individual model, no migration, 500 sequence loci, high416

φST outliers).417

Simulations that included selection typically resulted in an increased number of genetic418

regions being classified as outliers. As expected, we found that the ability to correctly classify419

non-neutral genetic regions as outlier loci was dependent upon the extent of selection. Few420

outliers were detected when only one population was affected (e.g., high φST , a = 0.5, NGS-421

individual model mean = 0.057, sd = 0.040), but most selected loci were correctly classified422

as outliers when all five populations were affected (e.g., high φST , a = 0.5, NGS-individual423

model mean = 0.612, sd = 0.310; Tables S1, S2, S3). Selection was easiest to detect when424

data were simulated and analyzed in accordance with the known haplotype model (in this425

case, when all five populations were affected by selection, all selected loci were identified426

as outliers even with a = 0.95; Table S1). Selection was generally easier to detect when427

migration was simulated. For example, under the known haplotype model the proportion of428

selected loci correctly classified as high outliers increased from 0.429 to 0.492 when simulated429

populations experienced the homogenizing effects of migration (means across all simulation430

conditions). For the complete set of simulations, the proportion of selected genetic regions431

that were correctly classified as outlier loci was greater than the proportion of neutral genetic432

regions incorrectly classified as outlier loci (Tables S1, S2, S3).433

The ability of the method to correctly identify genetic regions affected by selective sweeps434

in the presence of group structure (i.e., when populations are organized into groups of pop-435

Page 21: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 21

ulations) was highly dependent on the extent of the selective sweeps. When selective sweeps436

were confined to a single group, outliers were most often detected as genetic regions with high437

differentiation among populations within a group (i.e., high φSC; Table S4). The proportion438

of swept genetic regions correctly classified as φSC outliers was generally small and did not439

exceed 0.150, but was much greater than the proportion of neutral genetic regions classified440

as φSC outliers. In contrast, when both groups of populations were affected by selective441

sweeps, nearly all swept loci were classified as high φST and φSC outliers (92.5–100% using442

a = 0.5) and a smaller proportion (0.100–0.225) were classified as low or high φCT outliers.443

Thus, the selective sweeps were most readily detected based on high differentiation among all444

populations (φST ) or among populations within each group (φSC), but could also be detected445

based on low or high differentiation between the two groups (φCT ). This variety of outcomes446

arises from the stochastic nature used to determine the specific haplotypes that were swept447

to fixation. The false positive rate for these group analyses was very low (between 0 and448

0.032), similar to the population structure analyses (Tables S2, S4).449

Divergent selection among African and European populations Mean genome-level450

φST between African ancestry and European ancestry populations based on the SeattleSNPs451

data set was 0.080 (95% equal tail probability interval (ETPI) 0.065–0.097). This means452

that approximately 8% of DNA sequence variation was partitioned between the African and453

European ancestry populations.454

We classified three genes as high φST outliers (using a = 0.5) based on the first five SNPs455

data subset (Fig. 1). One of these genes was HSD11B2. Approximately 32% of molecular456

variation at this gene was partitioned between African and European ancestry populations457

(φST = 0.317 95% ETPI 0.159–0.482, Figure 1). Allelic variants of this gene produce an458

inherited form of hypertension and an end-stage renal disease (Quinkler and Stewart459

2003). A weak association has also been detected between an intronic microsatellite in this460

gene and type 1 diabetes mellitus and diabetic nephropathy (Lavery et al. 2002). FOXA2461

Page 22: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 22

was also identified as an outlier, with φST equal to 0.321 (95% ETPI 0.124–0.513, Figure462

1). FOXA2 regulates insulin sensitivity and controls hepatic lipid metabolism in fasting and463

type 2 diabetes mice (Wolfrum et al. 2004; Puigserver and Rodgers 2006). A genome-464

wide association study detected a SNP near FOXA2 (rs1209523) that was associated with465

fasting glucose levels in European and African Americans (Xing et al. 2010). This outlier466

is particularly interesting as Type 2 diabetes is 1.2–2.3 times more common in African467

Americans than European Americans (Harris 2001). The third outlier locus was POLG2,468

with an estimated φST of 0.327 (95% ETPI 0.175–0.477, Fig. 1). This gene was also classified469

as a target of selection in humans in a recent study (Barreiro et al. 2008).470

Selection among worldwide human populations For the worldwide human SNP data,471

estimates of mean chromosome-level φST were similar for the autosomal chromosomes and472

ranged from 0.083 (95% ETPI 0.0752–0.091; chromosome 22) to 0.113 (95% ETPI 0.104–473

0.120; chromosome 16; Table 2, Fig. 3). The estimate of φST for the X chromosome was474

considerably higher (0.139, 95% ETPI 0.128–0.149).475

We detected remarkable variation in φST along each of the chromosomes (Figure 4). We476

detected 569 unique φST outlier loci, which we designate as candidate genes for selection477

among worldwide human populations. Of these 569 genes, 518 were high φST outlier loci478

(including 222 with a = 0.95) and 51 were low φST outlier loci (including six with a =479

0.95). Outlier loci were detected on all chromosomes, with the greatest number identified480

on chromosome 1 (67 outlier loci) and the fewest on chromosome 13 (9 outlier loci; Table 2,481

Fig. 4). Some of the genes that we classified as outliers have previously been implicated as482

genes experiencing selection in human populations. These include CYP3A4 and CYP3A5,483

which are Cytochrome P450 genes found near one another on chromosome 7 (a = 0.95;484

CYP3A4 φST =0.293 95% ETPI [0.245–0.319] and CYP3A5 φST =0.359 95% ETPI [0.315–485

0.401]). These genes are important for detoxification of plant secondary compounds and are486

involved in metabolism of some prescribed drugs. Additional evidence that these genes have487

Page 23: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 23

experienced positive selection in human populations exists from previous studies based on488

distortions of the site frequency spectrum in African, European, and Chinese populations,489

population differentiation between the CEU and YRI populations, and extended haplotype490

homozygosity in HapMap populations (Carlson et al. 2005; Voight et al. 2006; Nielsen491

et al. 2007; Chen et al. 2010). Another example is the PKDREJ gene on chromosome492

22, which is a candidate sperm receptor gene of mammalian egg-coat proteins and had493

one of the highest estimates of φST (0.455 95% ETPI 0.396–0.504). Variation in this gene494

is consistent with positive selection among primate lineages, although evidence suggests495

that balancing selection might act to maintain diversity at this gene in human populations496

(Hamm et al. 2007). We also found evidence of divergent selection acting on SLC24A5,497

which has been associated with differences in skin pigmentation among humans and was498

classified as a candidate for positive selection in several studies (Barreiro et al. 2008).499

Finally, three of the high φST outliers at a = 0.95 (RTTN, MSX2, and CDAN1) were among500

seven human skeletal genes identified by Wu and Zhang as genes with elevated FST at non-501

synonymous SNPs between African and non-African (European and East Asian) populations502

and candidates for recent positive selection in Europeans and East Asians (Wu and Zhang503

2010).504

We detected additional genes with very high φST that have not, to the best of our505

knowledge, been previously implicated as experiencing selection in human populations but506

have been found to be associated with disease traits. For example, the estimate of φST for507

SPATA5L1 (spermatogenesis associated 5-like 1) on chromosome 15 was 0.347 (95% ETPI508

0.246–0.430). This gene was classified as a high φST outlier in the analysis and has been509

linked to renal function and kidney disease, but has not been identified in previous tests for510

selection (Kottgen et al. 2009).511

We also identified novel candidate targets of balancing selection in worldwide human512

populations. Interestingly, none of the genes that we classified as low φST outlier loci were513

implicated as targets of balancing selection in European and African American populations in514

Page 24: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 24

a recent study by Andres et al. (2009). Several of these genes have been linked to diseases.515

For example, RIF1 was classified as low φST outlier at the a = 0.95 level (φST =-0.065,516

95% ETPI -0.105– -0.023). RIF1 is an anti-apoptotic factor involved in DNA repair that is517

necessary for S-phase progression and is heavily expressed in breast cancer tumors (Wang518

et al. 2009). Another example is the AKT3 gene found on chromosome 1 (φST =-0.046, 95%519

ETPI -0.102–0.020; outlier at a = 0.5). This gene is involved in cell-cycle regulation and520

is highly expressed in malignant melanoma, but is also important for attainment of normal521

organ size, including brain size in mice (Stahl et al. 2004; Easton et al. 2005).522

DISCUSSION523

We have presented a novel model to quantify genome-wide population genetic structure and524

identify genetic regions that are likely to have experienced natural selection. Unlike previous525

methods for quantifying structure and detecting genetic signatures of selection, the proposed526

method accurately models the stochastic sampling of sequences that is inherent in current527

NGS instruments and incorporates genetic distances among sequences in estimates of genetic528

differentiation. Although few population genomics studies based on NGS of individuals have529

been published to date (hence the analysis of Sanger sequence and SNP human data sets530

instead of NGS data sets), various large-scale projects are currently underway to obtain these531

data in large samples of humans (e.g., 1000 genomes project, http://www.1000genomes.org)532

and a few recent studies suggest that these data will soon be available for many non-human533

(including non-model) species (Gompert et al. 2010; Hohenlohe et al. 2010). However,534

beyond data acquisition, substantial biological insights will only be possible if accompanied535

by models and methods designed to take full advantage of these data, while accurately536

modeling sources of error. Along with a few other recent reports of models for NGS (Guo537

et al. 2009; Lynch 2009; Gompert et al. 2010; Futschik and Schlotterer 2010), we538

believe our analytical methods help to address this need.539

Page 25: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 25

Model properties The analyses of simulated and human genetic data sets suggest that540

the model provides statistically sound estimates of population differentiation for large sets541

of loci (see Supplemental Material Simulations: estimation of φ statistics). For example, the542

estimates of genome-level φST for the SeattleSNPs human sequence data and chromosome-543

level φST for the worldwide human SNP data (0.080–0.139) were similar to mean levels of544

genetic differentiation among human populations based on FST (e.g., FST = 0.09–0.14 for545

Yoruba, European, Han Chinese and Japanese populations; Weir et al. 2005; Barreiro546

et al. 2008).547

The simulation results indicate that the model only rarely identifies neutral loci incor-548

rectly as outliers (i.e., it has a low false positive rate). Using a = 0.5, the false positive rate549

for high or low φST outliers was never greater than 0.04, and using a = 0.95 this rate was550

never greater than 0.01. Low false positive rates are particularly important for genome-wide551

scans in which a large number of genes could otherwise show spurious evidence of selection.552

In contrast, false positive rates as high as 0.343 have been reported for FST outlier analyses553

of selection that use coalescent simulations under an incorrect demographic history to derive554

a neutral distribution (Excoffier et al. 2009). Moreover, the low false positive rate for555

the method held across all simulated demographic scenarios (i.e., different population diver-556

gence times, variation in migration rates and the presence or absence of group structure).557

This is expected as the null, genome-level distribution we estimate is based on the observed558

data rather than an assumed demographic history and most demographic histories should be559

appropriately captured by this genome-level distribution (Beaumont and Balding 2004;560

Guo et al. 2009). Similarly low false positive rates were reported by Guo et al. (2009)561

using a hierarchical Bayesian model based on FST. Nonetheless, it is important to note that562

these hierarchical models generally identify outlier loci as those that are inconsistent with563

the genome as a whole, and thus will not work well if most of the genome has recently564

experienced selection of a similar magnitude.565

Simulation results further indicate that the method has the ability to detect genetic566

Page 26: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 26

regions under selection, and to a greater extent when selective sweeps affect multiple popu-567

lations or migration occurs. When a simulated genetic region was affected by selection in at568

least three populations the selected locus was generally at least 10 times more likely to be569

classified as an outlier than a randomly chosen neutral locus. This suggests a high true pos-570

itive rate (at least under favorable conditions) and that candidate selected genes identified571

by the method will be enriched substantially for genes actually experiencing selection. This572

is particularly true when genes are classified as outliers using the more stringent criterion of573

a = 0.95 (although this will also decrease the total number of selected genes identified rela-574

tive to a = 0.5). One benefit of quantifying genetic divergence based on haplotypes (using575

φST ) instead of SNP, microsatellite or AFLP alleles (using FST) is that multiple selective576

sweeps should result in increased genetic distances among haplotypes in different populations577

making them easier to detect than genes that experienced a single selective sweep, as was578

used for the simulations.579

Despite these generally promising results from the simulations, many selected genes were580

not classified as outliers using the method and constitute false negatives. There are several581

synergistic reasons that genes affected by selection might not be classified as outliers, which582

include inherent limitations in outlier-based tests for selection and difficulties detecting selec-583

tion (which does not always leave a clear signal), as well as specific details of the simulations584

(Nielsen 2005; Kelley et al. 2006). First, as pointed out previously, outlier analyses will585

only detect genes that stand out from the genome-wide distribution and thus might not586

detect genes experiencing weak selection or be applicable when much of the genome is under587

selection (Michel et al. 2010). However, this particular issue is more likely to affect empir-588

ical data than the simulated data sets. The data we simulated experienced selective sweeps589

with arbitrary linkage between the affected genetic region and the gene under selection. This590

means that selection could have favored the same or different haplotypes in each population.591

Thus, a clear molecular signature was not always left by the simulated selective sweeps. We592

simulated selection in this manner for computational efficiency and because it accurately593

Page 27: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 27

reflects the effect of a recent and complete selective sweep. Specifically, we believe that this594

is a realistic model for selection as researchers will often detect selection based on genetic595

regions linked to the gene under selection rather than the actual gene under selection and596

patterns of linkage might vary among populations. Nonetheless, when evidence of selection597

comes directly from the target of selection or there is tight linkage with consistent patterns of598

linkage disequilibrium between the selected gene and a sequenced genetic region, a stronger599

signal of selection should be evident and selection should be easier to detect.600

Evidence of selection in human populations A diverse array of studies (Akey et al.601

2002; Nielsen et al. 2007; Barreiro et al. 2008; Nielsen et al. 2009) has found that602

natural selection has played an important role in shaping functional genetic variation in603

humans. Analyses based on our model for quantifying the distribution of genetic variation604

among populations similarly find substantial evidence for the action of natural selection.605

We classified approximately 10 times as many genes as candidates for positive or divergent606

selection (high φST outlier loci) than as candidates for balancing selection (low φST outlier607

loci). However, this does not mean that divergent and positive selection more commonly608

affect human genetic variation than balancing selection. Evidence from other studies suggests609

that weak negative selection is prevalent in the human genome and balancing selection is also610

fairly common (Bustamante et al. 2005; Andres et al. 2009). Instead this difference in the611

prevalence of different forms of selection might indicate that there is more variation in the612

strength of positive selection, resulting in a greater number of extreme genetic regions, than613

there is in the strength of negative or balancing selection with many genetic regions weakly614

affected by these factors. However, this cannot be explicitly addressed by our study. Finally,615

previous simulation studies suggest that it is difficult to detect balancing selection using616

outlier analyses (Beaumont and Balding 2004; Guo et al. 2009), which might reflect the617

different molecular signals left by divergent and balancing selection or the bounded nature618

of FST and φST .619

Page 28: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 28

Specific genes identified as outliers by our analysis include several genes that have been620

repeatedly implicated as targets of positive selection in humans, such as POLG2, CYP3A5,621

and SLC24A5. Nonetheless, we also detected novel candidate genes for selection in humans622

(e.g., FOXA2) and failed to detect selection on genes expected to have experienced strong623

selection based on previous studies, such as the LCT (lactase) gene (Bersaglieri et al.624

2004; Nielsen et al. 2007; Tishkoff et al. 2007). A lack of complete concordance with625

earlier studies is not surprising as a general lack of concordance among studies of selection626

in humans has been noted (Nielsen et al. 2007). This lack of concordance likely reflects627

the sensitivity of different methods to different signatures of selection, which affects whether628

methods are more likely to detect recent, ongoing, or more ancient selective sweeps, as well629

as the specific nucleotides and populations analyzed. For example, the lack of evidence from630

the worldwide human data set for selection on LCT appears to reflect the SNPs included in631

this study. Previous studies of human genetic variation have detected high values of FST at632

specific SNPs in the LCT gene (e.g., 0.53 for SNPs rs4988235 and rs182549), but have also633

shown that FST varies across this gene (Bersaglieri et al. 2004). Previous estimates of634

FST for the two genic SNPs in LCT included in our analysis that were also investigated by635

Bersaglieri et al. (2004) were 0 (rs2874874) and 0.17 (rs2322659), which do not stand out636

markedly from background levels of differentiation.637

Conclusions Technological advances in DNA sequencing are ushering in a new era of pop-638

ulation genomics. Whereas researchers previously were constrained to various, relatively639

low throughput molecular markers for genotyping, it is now possible to rapidly generate very640

large volumes of DNA sequence data for any group of organisms (Mardis 2008; Gilad et al.641

2009). This new capability will allow researchers to address long-standing and fundamental642

questions in evolutionary biology, which were once considered nearly intractable because643

of limited genetic data (Mardis 2008). However, most published NGS studies have been644

primarily descriptive (e.g., transcriptome characterization; Vera et al. 2008; Parchman645

Page 29: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 29

et al. 2010) or have been forced to discard valuable data because of analytical limitations646

(Hohenlohe et al. 2010) and have not taken full advantage of the potential of the sequence647

data. To do so, researchers need robust and accessible methods and models that can be ap-648

plied to NGS data to test evolutionary hypotheses. The model presented here helps fill this649

gap, as it allows us to quantify heterogeneous genomic divergence among populations and650

identify genetic regions affected by selection. The Bayesian analysis of molecular variance651

illustrates the potential of combining appropriate models and NGS to address important652

questions in evolutionary biology and genetics, and is a critical step toward utilizing the653

growing abundance of sequence data for population genomics.654

ACKNOWLEDGEMENTS655

We would like to thank J. Fordyce, M. Forister, C. Nice, P. Nosil, T. Parchman, and R.656

Safran for discussion and comments on previous drafts of this manuscript. Comments from L.657

Excoffier and two anonymous reviewers helped to improve the manuscript. We are indebted658

to R. Williamson for his contributions to the development of the bamova software. This659

work was supported by NSF DBI award 0701757 to CAB.660

LITERATURE CITED661

Akey, J. M., G. Zhang, K. Zhang, L. Jin, and M. D. Shriver, 2002 Interrogating662

a high-density SNP map for signatures of natural selection. Genome Research 12: 1805–663

1814.664

Andolfatto, P., J. D. Wall, and M. Kreitman, 1999 Unusual haplotype structure665

at the proximal breakpoint of in(2L)t in a natural population of Drosophila melanogaster.666

Genetics 153: 1297–1311.667

Andres, A. M., M. J. Hubisz, A. Indap, D. G. Torgerson, J. D. Degenhardt,668

A. R. Boyko, R. N. Gutenkunst, T. J. White, E. D. Green, C. D. Bustamante,669

Page 30: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 30

A. G. Clark, and R. Nielsen, 2009 Targets of balancing selection in the human670

genome. Molecular Biology and Evolution 26: 2755–2764.671

Bamshad, M. and S. P. Wooding, 2003 Signatures of natural selection in the human672

genome. Nature Reviews Genetics 4: 99–111A.673

Barreiro, L. B., G. Laval, H. Quach, E. Patin, and L. Quintana-Murci, 2008674

Natural selection has driven population differentiation in modern humans. Nature Genetics675

40: 340–345.676

Beaumont, M. A., 2005 Adaptation and speciation: what can FST tell us? Trends in677

Ecology & Evolution 20: 435–440.678

Beaumont, M. A. and D. J. Balding, 2004 Identifying adaptive genetic divergence679

among populations from genome scans. Molecular Ecology 13: 969–980.680

Beaumont, M. A. and R. A. Nichols, 1996 Evaluating loci for use in the genetic681

analysis of population structure. Proceedings of the Royal Society B-Biological Sciences682

263: 1619–1626.683

Bersaglieri, T., P. C. Sabeti, N. Patterson, T. Vanderploeg, S. F. Schaffner,684

J. Drake, M. Rhodes, D. E. Reich, and J. N. Hirschhorn, 2004 Genetic signatures685

of strong recent positive selection at the lactase gene. American Journal of Human Genetics686

74: 1111–1120.687

Bustamante, C. D., A. Fledel-Alon, S. Williamson, R. Nielsen, M. T. Hu-688

bisz, S. Glanowski, D. Tanenbaum, T. White, J. Sninsky, R. Hernandez,689

D. Civello, M. Adams, M. Cargill, and A. Clark, 2005 Natural selection on690

protein-coding genes in the human genome. Nature 437: 1153–1157.691

Page 31: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 31

Carlson, C. S., D. J. Thomas, M. A. Eberle, J. E. Swanson, R. J. Livingston,692

M. Rieder, and D. Nickerson, 2005 Genomic regions exhibiting positive selection693

identified from dense genotype data. Genome Research 15: 1553–1565.694

Chen, H., N. Patterson, and D. Reich, 2010 Population differentiation as a test for695

selective sweeps. Genome Research 20: 393–402.696

Craig, D. W., J. V. Pearson, S. Szelinger, A. Sekar, M. Redman, J. J. Corn-697

eveaux, T. L. Pawlowski, T. Laub, G. Nunn, D. A. Stephan, N. Homer, and698

M. J. Huentelman, 2008 Identification of genetic variants using bar-coded multiplexed699

sequencing. Nature Methods 5: 887–893.700

Easton, R. M., H. Cho, K. Roovers, D. W. Shineman, M. Mizrahi, M. Forman,701

V. Lee, M. Szabolcs, R. de Jong, T. Oltersdorf, T. Ludwig, A. Efstratiadis,702

and M. Birnbaum, 2005 Role for Akt3/Protein kinase B gamma in attainment of normal703

brain size. Molecular and Cellular Biology 25: 1869–1878.704

Excoffier, L., T. Hofer, and M. Foll, 2009 Detecting loci under selection in a705

hierarchically structured population. Heredity 103: 285–298.706

Excoffier, L., P. E. Smouse, and J. M. Quattro, 1992 Analysis of molecular variance707

inferred from metric distances among DNA haplotypes: application to human mitochon-708

drial DNA restriction data. Genetics 131: 479–491.709

Flint, J., J. Bond, D. C. Rees, A. J. Boyce, J. M. Roberts-Thomson, L. Ex-710

coffier, J. Clegg, M. Beaumont, R. Nichols, and R. Harding, 1999 Minisatellite711

mutational processes reduce FST estimates. Human Genetics 105: 567–576.712

Foll, M. and O. Gaggiotti, 2008 A genome-scan method to identify selected loci ap-713

propriate for both dominant and codominant markers: A Bayesian perspective. Genetics714

180: 977–993.715

Page 32: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 32

Futschik, A. and C. Schlotterer, 2010 Massively parallel sequencing of pooled DNA716

samples–the next generation of molecular markers. Genetics p. genetics.110.114397.717

Gilad, Y., J. K. Pritchard, and K. Thornton, 2009 Characterizing natural variation718

using next-generation sequencing technologies. Trends in Genetics 25: 463–471.719

Gillespie, J. H., 2000 Genetic drift in an infinite population: The pseudohitchhiking720

model. Genetics 155: 909–919.721

Gompert, Z., M. L. Forister, J. A. Fordyce, C. C. Nice, R. Williamson, and722

C. A. Buerkle, 2010 Bayesian analysis of molecular variance in pyrosequences quantifies723

population genetic structure across the genome of Lycaeides butterflies. Molecular Ecology724

19: 2455–2473.725

Guo, F., D. K. Dey, and K. E. Holsinger, 2009 A Bayesian hierarchical model for anal-726

ysis of single-nucleotide polymorphisms diversity in multilocus, multipopulation samples.727

Journal of the American Statistical Association 104: 142–154.728

Hamblin, M. T. and A. Di Rienzo, 2000 Detection of the signature of natural selection729

in humans: Evidence from the Duffy blood group locus. The American Journal of Human730

Genetics 66: 1669–1679.731

Hamblin, M. T., E. E. Thompson, and A. Di Rienzo, 2002 Complex signatures of732

natural selection at the Duffy blood group locus. American Journal of Human Genetics733

70: 369–383.734

Hamm, D., B. S. Mautz, M. F. Wolfner, C. F. Aquadro, and W. J. Swanson,735

2007 Evidence of amino acid diversity-enhancing selection within humans and among736

primates at the candidate sperm-receptor gene PKDREJ. American Journal of Human737

Genetics 81: 44–52.738

Page 33: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 33

Harris, M. I., 2001 Racial and ethnic differences in health care access and health outcomes739

for adults with type 2 diabetes. Diabetes Care 24: 454–459.740

Hohenlohe, P. A., S. Bassham, P. D. Etter, N. Stiffler, E. A. Johnson, and741

W. A. Cresko, 2010 Population genomics of parallel adaptation in threespine stickleback742

using sequenced RAD tags. PLoS Genetics 6: e1000862.743

Holsinger, K. E. and B. S. Weir, 2009 Fundamental concepts in genetics: genetics in744

geographically structured populations: defining, estimating and interpreting FST . Nature745

Reviews Genetics 10: 639–650.746

Hudson, R. R., 2002 Generating samples under a Wright-Fisher neutral model of genetic747

variation. Bioinformatics 18: 337–338.748

Jakobsson, M., S. W. Scholz, P. Scheet, J. R. Gibbs, J. M. VanLiere, H.-C.749

Fung, Z. A. Szpiech, J. H. Degnan, K. Wang, R. Guerreiro, J. M. Bras, J. C.750

Schymick, D. G. Hernandez, B. J. Traynor, J. Simon-Sanchez, M. Matarin,751

A. Britton, J. van de Leemput, I. Rafferty, M. Bucan, H. M. Cann, J. A.752

Hardy, N. A. Rosenberg, and A. B. Singleton, 2008 Genotype, haplotype and753

copy-number variation in worldwide human populations. Nature 451: 998–1003.754

Kelley, J. L., J. Madeoy, J. C. Calhoun, W. Swanson, and J. M. Akey, 2006755

Genomic signatures of positive selection in humans and the limits of outlier approaches.756

Genome Research 16: 980–989.757

Kottgen, A., N. L. Glazer, A. Dehghan, S.-J. Hwang, R. Katz, M. Li, Q. Yang,758

V. Gudnason, L. J. Launer, T. B. Harris, A. V. Smith, D. E. Arking, B. C.759

Astor, E. Boerwinkle, G. B. Ehret, I. Ruczinski, R. B. Scharpf, Y.-D. I.760

Chen, I. H. de Boer, T. Haritunians, T. Lumley, M. Sarnak, D. Siscovick,761

E. J. Benjamin, D. Levy, A. Upadhyay, Y. S. Aulchenko, A. Hofman, F. Ri-762

vadeneira, A. G. Uitterlinden, C. M. van Duijn, D. I. Chasman, G. Pare,763

Page 34: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 34

P. M. Ridker, W. H. L. Kao, J. C. j. Witteman, J. c. Coresh, M. G. m. Shli-764

pak, and C. S. f. Fox, 2009 Multiple loci associated with indices of renal function and765

chronic kidney disease. Nature Genetics 41: 712–717.766

Kronholm, I., O. Loudet, and J. de Meaux, 2010 Influence of mutation rate on767

estimators of genetic differentiation - lessons from Arabidopsis thaliana. BMC Genetics768

11.769

Kullback, S. and R. A. Leibler, 1951 On information and sufficiency. Annals of Math-770

ematical Statistics 22: 79–86.771

Lavery, G. G., C. L. McTernan, S. C. Bain, T. A. Chowdhury, M. Hewison, and772

P. M. Stewart, 2002 Association studies between the HSD11B2 gene (encoding hu-773

man 11 beta-hydroxysteroid dehydrogenase type 2), type 1 diabetes mellitus and diabetic774

nephropathy. European Journal of Endocrinology 146: 553–558.775

Lewontin, R. C. and J. Krakauer, 1973 Distribution of gene frequency as a test of776

theory of selective neutrality of polymorphisms. Genetics 74: 175–195.777

Lohmueller, K. E., A. R. Indap, S. Schmidt, A. R. Boyko, R. D. Hernandez,778

M. J. Hubisz, J. J. Sninsky, T. J. White, S. R. Sunyaev, R. Nielsen, A. G.779

Clark, and C. D. Bustamante, 2008 Proportionally more deleterious genetic variation780

in European than in African populations. Nature 451: 994–997.781

Lynch, M., 2009 Estimation of allele frequencies from high-coverage genome-sequencing782

projects. Genetics 182: 295–301.783

Mardis, E. R., 2008 The impact of next-generation sequencing technology on genetics.784

Trends in Genetics 24: 133–141.785

Mardis, E. R., 2008 Next-generation DNA sequencing methods. Annual Review of Ge-786

nomics and Human Genetics 9: 387–402.787

Page 35: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 35

Maynard-Smith, J. and J. Haigh, 1974 Hitch-hiking effect of a favorable gene. Genetical788

Research 23: 23–35.789

Meyer, M. and M. Kircher, 2010 Illumina sequencing library preparation for790

highly multiplexed target capture and sequencing. Cold Spring Harbor Protocols 2010:791

pdb.prot5448–.792

Michel, A. P., S. Sim, T. H. Q. Powell, M. S. Taylor, P. Nosil, and J. L.793

Feder, 2010 Widespread genomic divergence during sympatric speciation. Proceedings794

of National Academy of Sciences 107.795

Nielsen, R., 2005 Molecular signatures of natural selection. Annual Review of Genetics796

39: 197–218.797

Nielsen, R., I. Hellmann, M. Hubisz, C. Bustamante, and A. G. Clark, 2007798

Recent and ongoing selection in the human genome. Nature Reviews Genetics 8: 857–868.799

Nielsen, R., M. J. Hubisz, I. Hellmann, D. Torgerson, A. M. Andres, A. Al-800

brechtsen, R. Gutenkunst, M. D. Adams, M. Cargill, A. Boyko, A. Indap,801

C. D. Bustamante, and A. G. Clark, 2009 Darwinian and demographic forces af-802

fecting human protein coding genes. Genome Research 19: 838–849.803

Novembre, J., T. Johnson, K. Bryc, Z. Kutalik, A. R. Boyko, A. Auton, A. In-804

dap, K. S. King, S. Bergmann, M. R. Nelson, M. Stephens, and C. D. Busta-805

mante, 2008 Genes mirror geography within Europe. Nature 456: 98–101.806

Parchman, T. L., K. S. Geist, J. A. Grahnen, C. W. Benkman, and C. A.807

Buerkle, 2010 Transcriptome sequencing in an ecologically important tree species: as-808

sembly, annotation, and marker discovery. BMC Genomics 11: 180.809

Puigserver, P. and J. T. Rodgers, 2006 FOXA2, a novel transcriptional regulator of810

insulin sensitivity. Nature Medicine 12: 38–39.811

Page 36: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 36

Quinkler, M. and P. M. Stewart, 2003 Hypertension and the cortisol–cortisone shuttle.812

Journal of Clinical Endocrinology and Metabolism 88: 2384–2392.813

Sabeti, P. C., D. E. Reich, J. M. Higgins, H. Levine, D. J. Richter, S. F.814

Schaffner, S. B. Gabriel, J. V. Platko, N. J. Patterson, G. J. McDon-815

ald, H. Ackerman, S. Campbell, D. Altshuler, R. Cooper, D. Kwiatkowski,816

R. Ward, and E. Lander, 2002 Detecting recent positive selection in the human genome817

from haplotype structure. Nature 419: 832–837.818

Slatkin, M., 1987 Gene flow and the geographic structure of natural-populations. Science819

236: 787–792.820

Slatkin, M. and T. Wiehe, 1998 Genetic hitch-hiking in a subdivided population. Ge-821

netical Research 71: 155–160.822

Stahl, J. M., A. Sharma, M. Cheung, M. Zimmerman, J. Q. Cheng, M. Bosen-823

berg, M. Kester, L. Sandirasegarane, and G. Robertson, 2004 Deregulated824

Akt3 activity promotes development of malignant melanoma. Cancer Research 64: 7002–825

7010.826

Stajich, J. E. and M. H. Hahn, 2005 Disentangling the effects of demography and827

selection in human history. Molecular Biology and Evolution 22: 63–73.828

Stephens, M., N. J. Smith, and P. Donnelly, 2001 A new statistical method for829

haplotype reconstruction from population data. American Journal of Human Genetics 68:830

978–989.831

Tajima, F., 1989 Statistical-method for testing the neutral mutation hypothesis by DNA832

polymorphism. Genetics 123: 585–595.833

Tishkoff, S. A., F. A. Reed, F. R. Friedlaender, C. Ehret, A. Ran-834

ciaro, A. Froment, J. B. Hirbo, A. A. Awomoyi, J.-M. Bodo, O. Doumbo,835

Page 37: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 37

M. Ibrahim, A. T. Juma, M. J. Kotze, G. Lema, J. H. Moore, H. Mortensen,836

T. B. Nyambo, S. A. Omar, K. Powell, G. S. Pretorius, M. W. Smith, M. A.837

Thera, C. Wambebe, J. L. Weber, and S. M. Williams, 2009 The genetic structure838

and history of Africans and African Americans. Science 324: 1035–1044.839

Tishkoff, S. A., F. A. Reed, A. Ranciaro, B. F. Voight, C. C. Babbitt, J. S.840

Silverman, K. Powell, H. M. Mortensen, J. B. Hirbo, M. Osman, M. Ibrahim,841

S. A. Omar, G. Lema, T. B. Nyambo, J. Ghori, S. Bumpstead, J. K. Pritchard,842

G. A. Wray, and P. Deloukas, 2007 Convergent adaptation of human lactase persis-843

tence in Africa and Europe. Nature Genetics 39: 31–40.844

Tishkoff, S. A. and B. C. Verrelli, 2003 Patterns of human genetic diversity: im-845

plications for human evolutionary history and disease. Annual Review of Genomics and846

Human Genetics 4: 293–340.847

Vera, J. C., C. W. Wheat, H. W. Fescemyer, M. J. Frilander, D. L. Craw-848

ford, I. Hanski, and J. H. Marden, 2008 Rapid transcriptome characterization for849

a nonmodel organism using 454 pyrosequencing. Molecular Ecology 17: 1636–1647.850

Voight, B. F., S. Kudaravalli, X. Q. Wen, and J. K. Pritchard, 2006 A map of851

recent positive selection in the human genome. PLoS Biology 4: 446–458.852

Wang, H., A. Zhao, L. Chen, X. Zhong, J. Liao, M. Gao, M. Cai, D.-H. Lee,853

J. Li, D. Chowdhury, Y.-g. Yang, G. P. Pfeifer, Y. Yen, and X. Xu, 2009854

Human RIF1 encodes an anti-apoptotic factor required for DNA repair. Carcinogenesis855

30: 1314–1319.856

Weir, B. S., L. R. Cardon, A. D. Anderson, D. Nielsen, and W. Hill, 2005 Mea-857

sures of human population structure show heterogeneity among genomic regions. Genome858

Research 15: 1468–1476.859

Page 38: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 38

Wolfrum, C., E. Asilmaz, E. Luca, J. Friedman, and M. Stoffel, 2004 FOXA2860

regulates lipid metabolism and ketogenesis in the liver during fasting and in diabetes.861

Nature 432: 1027–1032.862

Wright, S., 1951 The genetical structure of populations. Annals of Eugenics 15: 323–354.863

Wu, D.-D. and Y.-P. Zhang, 2010 Positive selection drives population differentiation in864

the skeletal genes in modern humans. Human Molecular Genetics 19.865

Xing, C., J. C. Cohen, and E. Boerwinkle, 2010 A weighted false discovery rate866

control procedure reveals alleles at FOXA2 that influence fasting glucose levels. American867

Journal of Human Genetics 86: 440–446.868

Page 39: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 39

−0.4 −0.2 0.0 0.2 0.4 0.6

05

1015

φST

Den

sity

HSD11B2FOXA2POLG2neutralgenome

Figure 1.—Locus-specific φST estimates for Africans and Europeans (SeattleSNPs data set).A point estimate of the genome-level φST distribution (based on the median from the posteriorprobability distributions of αST and βST is denoted with a solid black line. The posterior probabilitydistributions for the three outlier loci (colored lines) and 50 additional, randomly chosen geneticregions (gray lines) are also shown. These results are based on the first five SNPs in each gene;additional results are shown in Figure S3

Page 40: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 40

FS

T

0 2000 6000 10000

0.0

0.5

1.0

HSD11B2

1000 3000 5000

0.0

0.5

1.0

FOXA2

FS

T

0 5000 15000

0.0

0.5

1.0

POLG2

5000 15000 25000

0.0

0.5

1.0

FUT2

FS

T

0 5000 10000 20000

0.0

0.5

1.0

CPSF4

5000 10000 15000

0.0

0.5

1.0

IL1F6

Physical location in gene (bp)

FS

T

10000 30000 50000

0.0

0.5

1.0

IKBKB

Physical location in gene (bp)

0 5000 10000 20000

0.0

0.5

1.0

neutral

Figure 2.—Point estimates of FST along specific genes (SeattleSNPs data set). Point estimatesof FST are shown at each SNP for seven genes identified as outliers in the comparison of humanpopulations with African and European ancestry as well as three randomly chosen neutral genes.

Page 41: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 41

ΦST

Den

sity

−0.2 0.0 0.2 0.40

1020

30

A. Chromosome ΦST distribution

ΦST

Den

sity

0.08 0.12 0.16

050

100

150

200

B. Mean chromosome ΦST

Figure 3.—Chromosome-level estimates of φST for the large sample of human genetic diversity in33 populations (data from Jakobsson et al. 2008). Point estimates of each chromosome-level φST

distribution (based on the median from the posterior probability distributions of αST and βST ) aredenoted with solid black lines (autosomes) or a dashed orange line (X-chromosome; A). Posteriorprobability distributions for the mean chromosome-level φST for each chromosome (B).

Page 42: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 42

ΦS

T

−0.

200.

150.

50

1 124.1 247.2

1

−0.

200.

150.

50

0 121.2 242.4

2

−0.

200.

150.

50

0.4 99.8 199.2

3

−0.

200.

150.

50

0.1 94.7 189.3

4

ΦS

T

−0.

200.

150.

50

0.2 90.4 180.6

5

−0.

200.

150.

50

0.1 85.4 170.7

6

−0.

200.

150.

50

0.9 79.7 158.6

7

−0.

200.

150.

50

0.2 73.2 146.2

8

ΦS

T

−0.

200.

150.

50

0.4 70.3 140.1

9−

0.20

0.15

0.50

0.2 67.7 135.2

10

−0.

200.

150.

50

0.2 67 133.8

11

−0.

200.

150.

50

0.2 66.2 132.2

12

ΦS

T

−0.

200.

150.

50

18.7 66.4 114.1

13

−0.

200.

150.

50

19.6 62.3 105

14

−0.

200.

150.

50

20.4 60.2 100.1

15

−0.

200.

150.

50

0 44.3 88.6

16

ΦS

T

−0.

200.

150.

50

0.2 39.4 78.6

17

−0.

200.

150.

50

0.2 38.1 76.1

18−

0.20

0.15

0.50

0.2 32 63.8

19

−0.

200.

150.

50

0 31.2 62.4

20

ΦS

T

−0.

200.

150.

50

14.5 30.7 46.9

21

−0.

200.

150.

50

15.5 32.4 49.4

22

−0.

200.

150.

50

2.7 78.6 154.5

X

Chromosomal position (Mb) Chromosomal position (Mb) Chromosomal position (Mb) Chromosomal position (Mb)

Figure 4.—Estimates of φST across the genome of worldwide human populations (data fromJakobsson et al. 2008). Each panel depicts a different human chromosome and is labeled accord-ingly. The solid black line denotes the point estimate (median of the posterior distribution) of themean chromosome-level φST for a chromosome. Estimates of φST for individual genes are shownin grey for non-outlier genes and light (at a = 0.5) or dark blue (at a = 0.95) for outlier genes. Foreach gene the closed circle gives the median from the posterior distribution of φST for that locusand the bars denote the 95% ETPI.

Page 43: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 43

TABLE 1

Proportion of outlier loci in simulated neutral datam = 0 m = 2

a = 0.5 a = 0.95 a = 0.5 a = 0.95

Model No. loci τ low high low high low high low high

Known haplotypes model 25 0.25 0.012 0.028 0.000 0.000 0.016 0.020 0.000 0.000

0.50 0.028 0.028 0.000 0.000 0.016 0.012 0.000 0.000

1.00 0.024 0.004 0.004 0.000 0.012 0.016 0.000 0.000

500 0.25 0.0132 0.0212 0.0004 0.0036 0.0108 0.0170 0.0000 0.0012

0.50 0.0240 0.0164 0.0028 0.0004 0.0130 0.0158 0.0000 0.0012

1.00 0.0320 0.0032 0.0080 0.0000 0.0136 0.0188 0.0002 0.0024

NGS–individual model 25 0.25 0.012 0.012 0.000 0.000 0.020 0.016 0.000 0.000

0.50 0.004 0.024 0.000 0.000 0.020 0.012 0.000 0.000

1.00 0.020 0.024 0.000 0.000 0.040 0.008 0.000 0.000

500 0.25 0.0136 0.0242 0.0005 0.0042 0.0098 0.0156 0.0004 0.0020

0.50 0.0220 0.0202 0.0022 0.0042 0.0112 0.0182 0.0004 0.0020

1.00 0.0280 0.0074 0.0068 0.0006 0.0140 0.0176 0.0008 0.0036

NGS–population model 25 0.25 0.016 0.020 0.000 0.000 0.008 0.012 0.000 0.000

0.50 0.024 0.040 0.000 0.000 0.008 0.012 0.000 0.000

1.00 0.032 0.004 0.000 0.000 0.004 0.008 0.000 0.004

500 0.25 0.0112 0.0202 0.0002 0.0024 0.0054 0.0166 0.0000 0.0018

0.50 0.0190 0.0164 0.0006 0.0012 0.0100 0.0174 0.0000 0.0012

1.00 0.0340 0.0064 0.0050 0.0000 0.0088 0.0180 0.0000 0.0022

Mean proportion of loci over 10 replicates identified as outliers in the absence of selection foreach of the three different likelihoods (known haplotypes model, NGS–individual model, and NGS–

population model). Three times since divergence (τ) and two levels of migration (m) were simulated.Outlier loci were identified as low or high outliers relative to the genome-wide distribution and basedon two different quantiles (a = 0.5 or 0.95) of their posterior distribution.

Page 44: A hierarchical Bayesian model for next-generation ......Nonetheless, selection affecting just a single population was difficult 13 to detect and resulted in a high false negative

Z. Gompert and C. A. Buerkle 44

TABLE 2

Summary of worldwide human HapMap data and results

High φST Low φST

Chromosome Chrom. φST No. loci a = 0.5 a = 0.95 a = 0.5 a = 0.95

1 0.098 (0.095–0.103) 1402 59 28 8 12 0.101 (0.097–0.106) 956 32 11 5 13 0.099 (0.095–0.104) 723 29 13 2 04 0.100 (0.094–0.106) 563 22 12 1 05 0.095 (0.089–0.100) 565 20 10 3 26 0.091 (0.087–0.095) 710 23 8 6 07 0.098 (0.092–0.103) 679 25 14 2 08 0.092 (0.087–0.098) 438 24 6 2 09 0.095 (0.089–0.099) 569 22 9 2 0

10 0.092 (0.087–0.098) 526 20 6 3 211 0.096 (0.092–0.101) 686 24 7 1 012 0.097 (0.092–0.102) 683 36 14 2 013 0.090 (0.082–0.099) 243 8 7 1 014 0.100 (0.093–0.107) 388 16 8 2 015 0.109 (0.101–0.117) 380 20 10 0 016 0.113 (0.104–0.120) 426 21 9 1 017 0.107 (0.101–0.114) 603 27 12 1 018 0.092 (0.084–0.100) 225 8 4 3 019 0.099 (0.094–0.104) 729 34 10 4 020 0.101 (0.093–0.109) 367 15 8 0 021 0.083 (0.075–0.091) 158 6 3 1 022 0.102 (0.093–0.111) 283 11 6 1 0X 0.139 (0.128–0.149) 347 16 7 0 0

Summary of data and results for each chromosome, including chromosome–level φST (median and95% ETPI), the number of loci, and the number of loci classified as outliers (data from Jakobsson

et al. 2008).