12
Copyright Ó 2007 by the Genetics Society of America DOI: 10.1534/genetics.106.063677 Identification of Selective Sweeps Using a Dynamically Adjusted Number of Linked Microsatellites Thomas Wiehe,* Viola Nolte, Daniel Zivkovic* and Christian Schlo ¨tterer †,1 *Institut fu ¨r Genetik, Universita ¨t zu Ko ¨ln, Ko ¨ln, Germany and Veterina ¨rmedizinische Universita ¨t Wien, Wien, 1210, Austria Manuscript received July 18, 2006 Accepted for publication October 14, 2006 ABSTRACT There is currently large interest in distinguishing the signatures of genetic variation produced by demographic events from those produced by natural selection. We propose a simple multilocus statistical test to identify candidate sites of selective sweeps with high power. The test is based on the variability profile measured in an array of linked microsatellites. We also show that the analysis of flanking mark- ers drastically reduces the number of false positives among the candidates that are identified in a genomewide survey of unlinked loci and find that this property is maintained in many population- bottleneck scenarios. However, for a certain range of intermediately severe population bottlenecks we find genomic signatures that are very similar to those produced by a selective sweep. While in these worst-case scenarios the power of the proposed test remains high, the false-positive rate reaches values close to 50%. Hence, selective sweeps may be hard to identify even if multiple linked loci are analyzed. Nevertheless, the integration of information from multiple linked loci always leads to a considerable reduction of the false- positive rate compared to a genome scan of unlinked loci. We discuss the application of this test to experimental data from Drosophila melanogaster . T HE central role of adaptation for the evolution of natural populations is widely accepted. Neverthe- less, until very recently no systematic approaches were available to uncover the genetic changes underlying adaptation processes in natural populations. One approach relies on population genetic principles to identify beneficial mutations from patterns of natural variation and has been called hitchhiking mapping (Schlo ¨ tterer 2003). The basic idea of hitchhiking mapping is that beneficial mutations increase in fre- quency until they become fixed in the population. Hereby, not only the selected site but also linked neu- tral variants are affected by such a selective sweep (Maynard Smith and Haigh 1974). As a consequence of the spread of a beneficial mutation, levels of vari- ability are strongly reduced in the genomic region flanking the selected site. Hence, hitchhiking mapping does not require the exact location of the selected site to be known; it is sufficient to analyze linked neutral markers. This approach has already gained widespread interest and a number of studies have set out to survey genomewide levels of genetic variability to trace regions in the genome that may have been shaped by natural selection (Schlo ¨ tterer et al. 1997; Payseur et al. 2002; Vigouroux et al. 2002; Glinka et al. 2003; Kauer et al. 2003; Kayser et al. 2003; Scho ¨fl and Schlo ¨ tterer 2004; Ihle et al. 2006; Pool et al. 2006). In a genome screen for beneficial mutations, a large number of loci are analyzed. Due to their high amount of polymorphism microsatellites are well-suited markers (Schlo ¨ tterer 2004). Their high informativeness and the cost-effective typing has made microsatellites the marker of choice for many hitchhiking mapping stud- ies. The emerging picture from all of the various studies was that genome scans are a suitable tool for the iden- tification of putatively selected genomic regions (Kohn et al. 2000; Harr et al. 2002; Vigouroux et al. 2002; Wootton et al. 2002; Kayser et al. 2003; Nair et al. 2003; Schlenke and Begun 2004; Scho ¨fl and Schlo ¨ tterer 2004; Ihle et al. 2006). Typical hitchhiking mapping studies use markers that are distributed across the genome. Despite some recent progress toward a high-throughput analysis of markers, the density of characterized markers is still fairly low. Hence, in most cases a candidate region is identified on the basis of a single locus only. The problem of such large surveys is that each locus is tested for deviation from neutral expectations and a large number of tests could result in substantial numbers of false positives. While statistical approaches, such as the Bonferroni correction (Sokal and Rohlf 1995) and false discovery rate (FDR) (Storey 2002), could account for this, the trade-off is that the sensitivity of the mapping strategy can be compromised. Alternatively, after an initial genomewide survey one can perform a more detailed analysis of additional markers flanking a previously identified candidate locus. The rationale is that the genealogical history around a 1 Corresponding author: Institut fu ¨r O ¨ kologie, Universita ¨t Innsbruck, Technikerstrasse 25, 6020 Innsbruck, Austria. E-mail: [email protected] Genetics 175: 207–218 ( January 2007)

Identification of Selective Sweeps Using a Dynamically ...Thomas Wiehe,* Viola Nolte,† Daniel Zivkovic* and Christian Schlo¨tterer†,1 *Institut fu¨r Genetik, Universita¨tzuKo¨ln,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Identification of Selective Sweeps Using a Dynamically ...Thomas Wiehe,* Viola Nolte,† Daniel Zivkovic* and Christian Schlo¨tterer†,1 *Institut fu¨r Genetik, Universita¨tzuKo¨ln,

Copyright � 2007 by the Genetics Society of AmericaDOI: 10.1534/genetics.106.063677

Identification of Selective Sweeps Using a Dynamically AdjustedNumber of Linked Microsatellites

Thomas Wiehe,* Viola Nolte,† Daniel Zivkovic* and Christian Schlotterer†,1

*Institut fur Genetik, Universitat zu Koln, Koln, Germany and †Veterinarmedizinische Universitat Wien, Wien, 1210, Austria

Manuscript received July 18, 2006Accepted for publication October 14, 2006

ABSTRACT

There is currently large interest in distinguishing the signatures of genetic variation produced bydemographic events from those produced by natural selection. We propose a simple multilocus statisticaltest to identify candidate sites of selective sweeps with high power. The test is based on the variabilityprofile measured in an array of linked microsatellites. We also show that the analysis of flanking mark-ers drastically reduces the number of false positives among the candidates that are identified in agenomewide survey of unlinked loci and find that this property is maintained in many population-bottleneck scenarios. However, for a certain range of intermediately severe population bottlenecks we findgenomic signatures that are very similar to those produced by a selective sweep. While in these worst-casescenarios the power of the proposed test remains high, the false-positive rate reaches values close to 50%.Hence, selective sweeps may be hard to identify even if multiple linked loci are analyzed. Nevertheless, theintegration of information from multiple linked loci always leads to a considerable reduction of the false-positive rate compared to a genome scan of unlinked loci. We discuss the application of this test toexperimental data from Drosophila melanogaster.

THE central role of adaptation for the evolution ofnatural populations is widely accepted. Neverthe-

less, until very recently no systematic approaches wereavailable to uncover the genetic changes underlyingadaptation processes in natural populations.

One approach relies on population genetic principlesto identify beneficial mutations from patterns of naturalvariation and has been called hitchhiking mapping(Schlotterer 2003). The basic idea of hitchhikingmapping is that beneficial mutations increase in fre-quency until they become fixed in the population.Hereby, not only the selected site but also linked neu-tral variants are affected by such a selective sweep(Maynard Smith and Haigh 1974). As a consequenceof the spread of a beneficial mutation, levels of vari-ability are strongly reduced in the genomic regionflanking the selected site. Hence, hitchhiking mappingdoes not require the exact location of the selected siteto be known; it is sufficient to analyze linked neutralmarkers. This approach has already gained widespreadinterest and a number of studies have set out to surveygenomewide levels of genetic variability to trace regionsin the genome that may have been shaped by naturalselection (Schlotterer et al. 1997; Payseur et al. 2002;Vigouroux et al. 2002; Glinka et al. 2003; Kauer et al.2003; Kayser et al. 2003; Schofl and Schlotterer

2004; Ihle et al. 2006; Pool et al. 2006).

In a genome screen for beneficial mutations, a largenumber of loci are analyzed. Due to their high amountof polymorphism microsatellites are well-suited markers(Schlotterer 2004). Their high informativeness andthe cost-effective typing has made microsatellites themarker of choice for many hitchhiking mapping stud-ies. The emerging picture from all of the various studieswas that genome scans are a suitable tool for the iden-tification of putatively selected genomic regions (Kohn

et al. 2000; Harr et al. 2002; Vigouroux et al. 2002;Wootton et al. 2002; Kayser et al. 2003; Nair et al. 2003;Schlenke and Begun 2004; Schofl and Schlotterer

2004; Ihle et al. 2006).Typical hitchhiking mapping studies use markers that

are distributed across the genome. Despite some recentprogress toward a high-throughput analysis of markers,the density of characterized markers is still fairly low.Hence, in most cases a candidate region is identified onthe basis of a single locus only. The problem of suchlarge surveys is that each locus is tested for deviationfrom neutral expectations and a large number of testscould result in substantial numbers of false positives.While statistical approaches, such as the Bonferronicorrection (Sokal and Rohlf 1995) and false discoveryrate (FDR) (Storey 2002), could account for this, thetrade-off is that the sensitivity of the mapping strategycan be compromised.

Alternatively, after an initial genomewide survey onecan perform a more detailed analysis of additionalmarkers flanking a previously identified candidate locus.The rationale is that the genealogical history around a

1Corresponding author: Institut fur Okologie, Universitat Innsbruck,Technikerstrasse 25, 6020 Innsbruck, Austria.E-mail: [email protected]

Genetics 175: 207–218 ( January 2007)

Page 2: Identification of Selective Sweeps Using a Dynamically ...Thomas Wiehe,* Viola Nolte,† Daniel Zivkovic* and Christian Schlo¨tterer†,1 *Institut fu¨r Genetik, Universita¨tzuKo¨ln,

selected site is more uniform, and resembles the historyof the selected site, than would be expected underneutrality. Hence, since flanking loci are also affected bya selective sweep, they should be useful to obtain furtherconfidence that an identified genomic region has beenexposed to a recent selective sweep. In agreement withthis expectation experimental work demonstrated thatvariability is often reduced in genomic regions flankinga putative selective sweep (Harr et al. 2002; Nair et al.2003; Schlenke and Begun 2004). Nevertheless, untilrecently no statistical tests were available for microsatel-lite data to evaluate the statistical significance of re-duced variability at linked microsatellite loci.

Here, we describe a new test statistic for the identifica-tion of recent selective sweeps using linked microsatellites.

METHODS

Definition of ln Ru: A selective sweep often reducesvariability in the flanking sequences (Charlesworth

1992; Hudson 1994; Schlotterer 2003). Hence, agenome scan for regions affected by selective sweepscould focus on those regions with low levels of variabil-ity. If microsatellites are used as genetic markers in suchgenome scans, the complication arises that microsatel-lites have a large heterogeneity in mutation rates(Schlotterer 2000; Ellegren 2004). To distinguishbetween low variability due to a low mutation rate anda selective sweep, additional information is required.To overcome this limitation, it has been proposed toconsider for each locus the ratio of the populationvariation estimators (u ¼ 4Nem) from two populations(Schlotterer 2002; Kauer et al. 2003). If the mutationrate m is the same in both populations, it cancels out andan estimator Ru is obtained that does not explicitlydepend on the mutation rate but only on the hetero-zygosities in the two populations:

Ru ¼ qðPop1ÞqðPop2Þ

¼ 4mNeðPop1Þ4mNeðPop2Þ

¼1

1�H ðPop1Þ

� �2� 1

11�H ðPop2Þ

� �2� 1

: ð1Þ

In Equation 1, the first equality is the definition ofRu and the last equality is derived from Ohta andKimura’s (1973) formula of the expected heterozygos-ity (H) in the stepwise mutation model.

For unlinked microsatellites extensive computer sim-ulations showed that the distribution of the naturallogarithm of Ru is well approximated by a standardnormal distribution with the mean corresponding tothe ratio in effective population size of the two pop-ulations compared. The standard deviation was shownto be affected by mutation rate, sample size, anddemography (Schlotterer 2002; Kauer et al. 2003;Schlotterer and Dieringer 2005). To test for devia-tion from neutral expectation, a set of neutrally evolving

loci (reference loci) is required to estimate the mean(m) of ln Ru and its standard deviation (s). Hence,altogether four data sets are required. The loci to betested need to be genotyped in two populations (testloci), and a set of reference loci needs to be genotypedfrom the same two populations. The ln Ru-values yi of thetest loci are transformed according to zi ¼ (yi � m)/s,i ¼ 1, 2, . . . , to obtain standard-normal-distributedvariates. Throughout this article we assume that selec-tion or a bottleneck occurred in population 1 (Pop1).Thus, negative ln Ru-values are expected for loci that arelinked to a selected site. Due to the above transforma-tion the expected value of ln Ru¼ 0 under a populationbottleneck, since the reference loci are expected to beequally affected by the bottleneck as the test loci.

Principle of the test: Here, we extend the single-locustest to multiple linked loci. The rationale is that aselective sweep typically affects a genomic region thatis large enough to contain multiple microsatellite loci.Thus, their joint analysis should provide more confi-dence about deviation from neutrality than an analysisbased on a single microsatellite marker. The size of theregion affected by a single selective sweep dependsmainly on the strength of selection and the local re-combination rate. A rough estimate is obtained by calcu-lating the ‘‘half-life’’ of a selective sweep using Equation19 from Stephan et al. (1992). The half-life (D) is thesize of the region around the selected site in whichvariability is reduced by 50% or more of its neutralequilibrium value. This is

1

2� 2rD

sð2NsÞ�2rD=sG �2rD

s;

1

2Ns

� �:

For instance, if 2Ns ¼ 1000 and r/s ¼10�6 the regionin which variability is reduced by $50% has a size of�110 kb. Qualitatively similar estimates have been ob-tained by Kaplan et al. (1989). With the above param-eters one would obtain an estimate of �370 kb (seeTable 2 in Kaplan et al. 1989).

The experimental design for a survey of linked lociis tricky, as the parameters of a selective sweep, andtherefore the size of the affected genomic region, arenot known a priori. One further complication arisesfrom the variation in microsatellite density among ge-nomic regions (Bachtrog et al. 2000). To avoid theproblem that the number of analyzed and/or availablemarkers affects the test results, we propose the followingmultilocus test strategy:

1. Standardize ln Ru-values of the test loci as describedabove.

2. Identify the locus x* with the lowest ln Ru-value.3. Starting from x* include all upstream and down-

stream microsatellites for the test until, on either sideof x*, the first locus with a positive value of ln Ru isfound or the terminus of the analyzed sequence isreached. Using this procedure, K 1 1 loci are chosen

208 T. Wiehe et al.

Page 3: Identification of Selective Sweeps Using a Dynamically ...Thomas Wiehe,* Viola Nolte,† Daniel Zivkovic* and Christian Schlo¨tterer†,1 *Institut fu¨r Genetik, Universita¨tzuKo¨ln,

(a number that may be different for different datasets).

4. Drop locus x* from further analysis; this results in aset of K loci to be analyzed.

5. Calculate the test statistic T(K) ¼P

zi , which is thesum of the standardized ln Ru-values of the K loci.

6. Determine the P-value.

Under neutral scenarios, either with constant pop-ulation size or with population bottleneck, and if theindividual loci are not tightly linked, the distributionof T(K) is approximately normal with mean �0 andstandard deviation OK. Tight linkage leads to a violationof the assumption of independence and in this case thedistribution of T(K) is not known analytically.

Two different approaches can be used to determine ifT(K) deviates from neutral expectation. The first one isbased on computer simulations and the second relies onthe simplifying assumption of independence of linkedloci. Both approaches are discussed in detail below.

Simulating the distribution of the test statistic: Weuse neutral coalescent simulations to determine T(K)for 10,000 neutral data sets. Assuming that selection willgenerate more extreme (negative) T(K) values thanexpected under neutrality, we determined the lowera-quantile qsim(a) of T(K) (for example, a ¼ 0.05)below which we consider a test significant. Note thatqsim(a) does not depend on K, since it is obtained fromaveraging over a large number of data sets (10,000) andin each data set K may be different. A test may be sig-nificant either due to several slightly negative ln Ru-values or due to a small number of strongly negativeln Ru-values. The simulation approach is preferable forsmall populations and closely linked loci (see results).Our computer simulations were performed with mod-ified versions of the program ‘‘ms’’ of R. Hudson(Hudson 2002) and a program of Y. Kim (Kim andStephan 2002). Both programs were originally writtento simulate the distribution of segregating sites in DNAsequences. A modification was needed to account formicrosatellite evolution. Furthermore, to avoid memoryand runtime problems when simulating a large genomicregion (of length 600 kb), we assumed that recombina-tion scales linearly. For our purposes it is then equiva-lent to simulate a region for which the length isdownscaled by some factor and simultaneously the re-combination rate is upscaled by the same factor. Thus,we simulated a region of 600 bp with a 1000-fold in-creased recombination rate. Generation of the neutraland bottlenecked data sets N and B (see below) wasperformed with the modified version of ms, which isfast and permits the incorporation of demography, inparticular, population bottlenecks. This program doesnot produce the genealogy of individual sites butpartitions the unit interval into regions that share thesame genealogy. We simulated microsatellite mutationsaccording to an unbiased stepwise mutation model.

Each mutation either added or removed one repeat unitwith equal probability. At the end of the simulation theprogram provides the number of repeat units for eachlocus and chromosome. To generate data set S (seebelow) we modified the program of Y. Kim (Kim andStephan 2002) to simulate the genealogy of a positivelyselected site embedded in a genomic region of 600 bp.This program generates polymorphism data and pro-vides the character state for each nucleotide in the re-gion. Microsatellites were modeled by reinterpretingthe evolution of nucleotides. Each site that coincidedwith a microsatellite position was treated as a microsatel-lite array and each mutation either added or removedone repeat unit with equal probability.

Both programs produced consistent results for neu-tral simulations of constant-size populations when cor-responding parameters were used. We further checkedthe software with a different microsatellite evolutionsoftware (Schlotterer 2002) and also obtained con-sistent results.

Simulated data sets: Neutral, constant population size(data set N): Assuming neutrality and a constant pop-ulation size, we generated various data sets differing inthe scaled recombination rate (R ¼ 4Nr), the scaledmutation rate (u¼ 4Nm), and sample size (n). Each dataset consisted of 10,000 replicate simulations.

Bottlenecked populations (data set B): Population bottle-necks were simulated assuming a three-phase model: apopulation of constant size N is reduced to constant sized � N during time interval d. The population size beforeand after the bottleneck was assumed to be identical.We further assumed that reduction and increase inpopulation size was instantaneous. The population sizereduction (d) and duration (d) of the bottleneck werecombined into the bottleneck-severity parameter g ¼d/d. The severity was varied from 0.1 to 10. Further-more, we also modified the time t when the bottleneckwas completed (measured in 2N generations, goingbackward from the present). The population recombi-nation rate was set to 0.02/bp/generation and thescaled mutation rate was fixed at u ¼ 5. Samples of n ¼60 chromosomes were simulated.

Selective sweep (data set S): Data sets with selectedloci were obtained by the modified version of Kim’sprogram (Kim and Stephan 2002). We varied the selec-tion coefficient 2Ns from 10 to 2000. For all simulationswe assumed a very recent selective sweep that was com-pleted t ¼ 0.001 generations ago, where time is mea-sured in units of 2N. We set e to 10�4 (e is a parameter ina sweep scenario that determines from which minimalfrequency onward the dynamics of the beneficial alleleare treated as deterministic rather than stochastic; seeKim and Stephan 2002 for a detailed discussion on thechoice of e).

Candidate regions (data sets NL and BL): To mimic atypical first-pass hitchhiking mapping study, we consid-ered a subset of either the neutral runs N or bottleneck

Multilocus Test for Selective Sweeps 209

Page 4: Identification of Selective Sweeps Using a Dynamically ...Thomas Wiehe,* Viola Nolte,† Daniel Zivkovic* and Christian Schlo¨tterer†,1 *Institut fu¨r Genetik, Universita¨tzuKo¨ln,

runs B: we selected from these simulations the 5% of theruns with the lowest ln Ru-values at a particular locus(say x*). These subsets are called NL and BL, respec-tively. The rationale is that a significantly reduced valueof ln Ru at a locus would be falsely interpreted as thetrace of a selective sweep in a one-locus test; i.e., for datasets NL and BL the false-positive rate would be 100%. It isour goal to reduce the false-positive rate in such cases byadding information from flanking loci.

Approximate test based on the assumption of a nor-mal distribution: The second approach applies to largeneutrally evolving populations and relies on the obser-vation that levels of heterozygosity are correlated onlyamong very tightly linked sites (Figure 1). In contrast, aselective sweep leads to a drastic increase of the correla-tion of heterozygosity levels among linked microsatelliteloci. As a simplifying null model we therefore treatneutrally evolving multiple microsatellites as indepen-dent even if there is some linkage. Since the sum of kindependent standard-normal-distributed random vari-ables is also normal with mean 0 and standard deviationOk, it is easy to construct a one-sided test on the basisof this distribution. More explicitly, one comparesT(k) with the a-quantile qtheo(a, k) and rejects the nullhypothesis (‘‘no selective sweep’’) if T(k) , qtheo(a, k),which is the a-quantile of the normal distribution withmean 0 and variance k. In contrast to the simulationapproach described before, the critical value qtheo(a, k)here is based on the number k of loci that are included ina particular test. For different tests, k may be different.

Multiple testing: A practical consequence of thismethod is that it can be applied to any number oflinked loci. Irrespective of the number of loci surveyed,the locus with the smallest ln Ru-value is identified and kflanking loci are then chosen for the test. Note that onlya single test is performed per region, since the k flankingloci are tested jointly. Thus, no adjustment for multipletesting is required. Recall that the approximate analyt-ical test rests upon the assumption that the genealogicalhistories of the microsatellite loci are independent.

Calculation of ln Ru and standardization: Similar tothe ln Ru test based on a single locus or unlinked loci,the multilocus test also requires data from two popula-tions, population 1 and population 2. They need to betyped for the same set of loci to calculate the ln Ru-values. Throughout all analyses population 2 is assumedto evolve neutrally at a constant population size (N ¼5 3 105). Once ln Ru-values are obtained, they are stan-dardized. This requires two additional data sets fromdifferent loci for the same two populations to estimatemean (m) and standard deviation (s) of ln Ru. For thiswe simulated an additional 10,000 unlinked loci withotherwise the same population parameters and calcu-lated m and s. Data set S is standardized with valuesobtained from neutral simulations.

Marker spacing: All test statistics that are based onmultiple, linked loci depend on the number of loci,

their physical spacing, and the recombination rate.Thus, an almost infinite number of possible parametercombinations could be tested. We have therefore fo-cused on the analysis of some representative exampleseither to evaluate the influence of one parameter orto indicate a general trend. We considered differentmarker distributions along a 600-kb genomic region. Inone case (marker topology A, see Figure 2) we assumedthat the loci were evenly spaced and that the target ofselection coincided with the position of one of theanalyzed loci. In another set (marker topology B) weassumed the position of the selected site to be locatedbetween two adjacent neutral markers, all of themevenly spaced. In a third case (marker topology C) themarker spacing was obtained from an experimental dataset of 15 microsatellites (Harr et al. 2002).

Experimental data from Drosophila melanogaster: Weapplied the new multilocus test to a recently identifiedsweep region on the third chromosome of Drosophilamelanogaster (Harr et al. 2002). For the standardizationof the data we were particularly cautious to use testand reference loci from the same populations. We used40 third-chromosomal reference loci (Kauer et al. 2003)for one African (Kisoro, Uganda) and one European(Katovice, Poland) population. We genotyped 15 micro-satellites covering the sweep region for 15 Kisoro and30 Katovice females that were first-generation descen-dants from freshly collected flies, using standard typ-ing protocols (Schlotterer 1998). All loci are locatedon autosomes. Gene diversity was calculated with the‘‘Microsatellite Analyzer’’ software (Dieringer andSchlotterer 2003).

RESULTS

The multilocus test statistic T(K) rests on the assump-tion that the pattern of variability at linked loci is morecorrelated under selection than under neutrality. How-ever, the increase of correlation as well as the reductionof variability produced by a selective sweep is transitoryand depends on recombination rate and selectioncoefficient as well as on the time since the selectivesweep was completed. Our approach focuses on theexpectation that a recent selective sweep leaves reducedlevels of variability in a genomic region flanking thetarget of selection. Therefore, such a region shouldharbor more linked loci with reduced variability thanexpected under a neutral scenario. To account for thelarge variation in microsatellite density across a ge-nome, we propose a dynamically adjusted, rather than afixed, number of markers to be tested. The details aredescribed in materials and methods.

Data set N provides the (simulated) distribution ofT(K) and the critical value qsim(a). Due to the construc-tion of the test, the distribution’s mean and median arenot equal to zero, but are slightly shifted to a positivevalue (Figure 3A, shaded solid line). This is due to the

210 T. Wiehe et al.

Page 5: Identification of Selective Sweeps Using a Dynamically ...Thomas Wiehe,* Viola Nolte,† Daniel Zivkovic* and Christian Schlo¨tterer†,1 *Institut fu¨r Genetik, Universita¨tzuKo¨ln,

way in which multiple markers are selected for the test:both the leftmost and the rightmost markers are re-quired to have positive ln Ru-values, thereby creating aslight upward bias of the distribution of T(K). The false-positive rate of the test for data sets NL and B is de-termined on the basis of qsim(a).

While computer simulations yield an accurate esti-mate for the probability that a given data set is consistentwith neutral expectations, they suffer from the disad-vantage that a priori assumptions have to be made for aset of parameters, such as the mutation and recombi-nation rates, u ¼ 4Nm and R ¼ 4Nr. Therefore, we alsouse an approximate analytical procedure that assumesindependence between linked loci and standard nor-mality of the statistic ln Ru for single loci. Figure 1 showsthe correlation coefficient between the ln Ru-values fora pair of loci with varying distance. Even for markers asclose as 1 kb the correlation coefficient is quite low(�0.1). This shows that under neutrality only a verysmall proportion of the variation at a given marker canbe explained by a linked one, even if linkage is tight. Wedetermined the power and the false-positive rate ofthe test with two test strategies, the one relying on thesimulated distribution of T(K) (Monte Carlo strategy)and the one relying on an analytical approximation(Tables 1–3). Both test strategies produced highly con-sistent results. For most cases we observed that the the-oretical false-positive rate was below the simulated one.

The various data sets and their descriptions togetherwith the symbols used are listed in Table 4.

Power estimates: The statistical power to detect de-viation from neutrality is assessed with the help ofdata set S in which data for a range of different selectionintensities and recombination rates were simulated.Overall, our test had a high statistical power with manyscenarios having a 100% detection rate of selective

sweeps. The most important factor influencing thepower of our test is the strength of selection, with smallselection coefficients (2Ns , 10) resulting in a very lowpower. Similarly, high recombination rates (R . 2 3

10�2) also resulted in a loss of power to detect a selectivesweep. Consistent with previous results for unlinked loci(Schlotterer and Dieringer 2005), the mutation ratehad almost no impact on the statistical power. The sameholds for the sample size n, except for extremely smallsamples (Table 1). We also found the tests based ongene diversity (heterozygosity) to be more powerful

Figure 1.—Correlation between ln Ru1 and ln Ru2 of a pairof markers vs. their distance (in kilobases) under a neutralconstant-population-sizemodel.Parametersare:recombinationrate R¼ 2 3 10�3, mutation rate u¼ 5, sample size n¼ 60. Thecorrelation coefficient between 23 pairs of loci at distancesranging from 1 to 500 kb was calculated on the basis of 10,000replicates. At distances .5 kb the average correlation is ,0.05.

TABLE 1

Power of the multilocus test (data set S)

Selection coefficient Recombination rate Mutation rate Sample size

s Simul Theor 4Nr Simul Theor u Simul Theor n Simul Theor

10�5 6.60 4.20 2 3 10�4 100.00 100.00 5 99.40 99.60 10 85.20 85.205 3 10�5 6.20 4.00 10�3 99.60 99.60 7.5 99.60 99.60 20 96.40 96.8010�4 13.60 13.00 2 3 10�3 99.40 99.60 10 100.00 100.00 30 98.80 99.005 3 10�4 93.40 93.60 10�2 43.25 41.00 50 100.00 100.00 40 99.00 99.2010�3 99.40 99.60 50 99.80 100.002 3 10�3 100.00 100.00 60 99.40 99.60

100 99.60 99.80200 99.40 99.60

Recombination rate R ¼ 4Nr, selection coefficient s, mutation rate u, and sample size n are varied one at atime, while the others are held fixed at the reference values R ¼ 2 3 10�3, s ¼ 10�3, u ¼ 5, t ¼ 0.001, and n ¼ 60(italic entries). Effective population size was assumed to be N ¼ 5 3 105. In all cases 29 loci are analyzed, whichare distributed at equal distances of 20 kb across a genomic region of 600 kb. The locus at position x* ¼ 500 kbis assumed to coincide with the target of selection (marker topology A, see Figure 2). Simul, simulated; Theor,theoretical.

Multilocus Test for Selective Sweeps 211

Page 6: Identification of Selective Sweeps Using a Dynamically ...Thomas Wiehe,* Viola Nolte,† Daniel Zivkovic* and Christian Schlo¨tterer†,1 *Institut fu¨r Genetik, Universita¨tzuKo¨ln,

than those based on the variance in repeat number(data not shown). Another parameter that strongly in-fluences the detection rate of selective sweeps is the timet since a selective sweep was completed. Recent sweepsare easy to detect, while older ones are not. We observe a

dramatic loss in power when the ratio s/t , 0.1; sweepsfor which the ratio s/t , 0.01 are virtually undetectable(results not shown).

False positives: A set of linked microsatellites can beanalyzed in two different experimental contexts. Either

TABLE 2

False-positive rate: constant population size

Recombination rate Mutation rate Sample size

4Nr Simul Theor u Simul Theor n Simul Theor

a. In the genome-scan scenario and constant population size (data set N)2 3 10�4 5.00 6.72 5 5.00 3.42 10 5.00 3.7310�3 5.00 3.56 7.5 5.00 3.38 20 5.00 3.392 3 10�3 5.00 3.42 10 5.00 3.21 30 5.00 3.3010�2 5.00 2.50 50 5.00 3.46 40 5.00 3.30

50 5.00 3.1060 5.00 3.42

100 5.00 3.04200 5.00 2.96

b. In the candidate region scenario and constant population size (data set NL)2 3 10�4 6.00 10.60 5 7.60 5.80 10 8.00 6.6010�3 6.80 6.00 7.5 6.80 5.80 20 7.00 6.202 3 10�3 7.60 5.80 10 9.60 7.00 30 7.60 5.6010�2 9.20 6.60 50 10.00 8.60 40 6.00 4.40

50 8.00 6.6060 7.60 5.80

100 8.60 5.40200 8.80 6.40

See Table 1 legend for details.

TABLE 3

False-positive rate: bottleneck

Bottleneck severity (durationb/depth)

10 4 2 1 0.5 0.2 0.1

Onseta Simul Theor Simul Theor Simul Theor Simul Theor Simul Theor Simul Theor Simul Theor

a. In the genome-scan scenario with population 1 exposed to a bottleneck (data set B)0.002 5.51 3.73 7.43 3.41 48.41 40.21 26.46 25.16 14.64 13.91 8.60 7.11 6.17 4.400.01 5.06 2.91 6.81 3.50 38.96 31.50 24.62 23.30 15.09 13.90 8.56 7.00 5.63 4.500.02 5.43 3.37 6.62 3.92 27.86 21.21 24.92 23.40 14.56 13.40 8.08 6.54 5.74 3.740.05 4.71 3.13 5.76 3.70 17.35 12.26 22.15 20.60 13.92 12.41 7.27 5.69 6.25 4.230.1 4.52 2.78 5.48 3.57 12.38 8.50 17.51 15.00 11.54 9.54 6.46 4.81 5.63 3.850.15 4.48 2.90 5.01 3.31 8.90 6.14 12.83 10.66 9.41 7.32 6.15 4.21 5.13 2.980.2 4.53 3.10 5.45 3.70 7.64 5.19 10.27 7.96 7.62 5.90 5.47 3.74 4.98 3.31

b. In the candidate region scenario with population 1 exposed to a bottleneck (data set BL)0.002 9.82 6.70 15.81 5.84 66.80 58.55 53.63 53.23 38.00 37.20 17.20 15.40 11.20 8.800.01 6.40 4.20 8.00 5.40 56.60 48.60 53.40 52.60 36.20 34.40 16.60 16.20 13.00 11.000.02 9.40 7.80 8.80 6.80 40.80 32.60 53.20 52.60 37.20 36.60 20.00 17.20 8.00 6.600.05 8.20 6.00 6.00 4.60 25.00 21.40 50.40 50.00 32.40 31.20 140 12.60 9.60 7.200.1 8.00 5.40 9.40 9.00 20.40 16.80 35.80 32.00 28.00 24.60 13.40 11.00 10.80 8.600.15 6.00 4.00 8.80 7.40 13.60 11.40 28.00 24.40 20.00 16.80 11.40 9.20 10.20 6.800.2 6.20 5.00 8.40 6.20 11.60 10.60 21.80 20.60 16.40 15.20 7.00 5.60 8.80 6.20

See Table 1 legend for details.a End of the bottleneck, where time t (in units of 2N generations) is measured backward from the present to the past.b Duration d of the bottleneck was held fixed at 10�5 generations (in units of 2N) and depth d was modified.

212 T. Wiehe et al.

Page 7: Identification of Selective Sweeps Using a Dynamically ...Thomas Wiehe,* Viola Nolte,† Daniel Zivkovic* and Christian Schlo¨tterer†,1 *Institut fu¨r Genetik, Universita¨tzuKo¨ln,

the genomic region was chosen in the absence of a prioriinformation (first-pass genome scan) or previous worksuggested that a surveyed region might not evolve neu-trally and a dense marker analysis was subsequently per-formed for that region (candidate region analysis). Anystatistical test using linked microsatellite data has todistinguish between these two scenarios.

False positives in first-pass genome scans: We de-termined the false-positive rate for a first-pass genomescan using linked microsatellites by the comparison oftwo neutral data sets (data set N). In this case the false-positive rate for the Monte Carlo test strategy is a fortioriequal to 5%, since the Monte Carlo simulations are usedonly to determine the critical value. However, the false-positive rate for data set N can be determined for theapproximate analytical test strategy. We find that, irre-spective of the mutation and recombination rates,the number of false positives is close to the expectedvalue of 5% (Table 2a). Only in the case of very low re-combination rates (2 3 10�4) is the false positive rateslightly .5%.

False positives in a candidate region analysis: Theanalysis of linked microsatellite data for a fine-scale can-

didate region analysis is more complicated. Dependingon the choice of the level a of the test, each genomescan will identify a fraction of putatively selected loci,even if both populations have been evolving neutrally.Thus, an important question is whether the analysis offlanking microsatellite loci could reduce the numberof false positives. We used a neutral data set consistingof 10,000 simulations of linked microsatellites andselected those runs for which the ln Ru-values of oneparticular locus fell in the lower 5% tail (data set NL, asexplained in materials and methods). The resulting500 simulations all had low ln Ru-values at the samelocus and were subject to the multilocus test. The resultsshow that the analysis of linked loci drastically reducesthe number of false positives. Compared to an analysisof linked loci with no a priori information (‘‘first-passgenome-scan’’ scenario), the false-positive rate in-creased generally less than twofold (Table 2b). Interest-ingly, even for the case of low recombination theapproximate analytical test resulted only in 10.6% falsepositives. Hence, the analysis of flanking markers canprovide an efficient tool to enhance the specificity ofhitchhiking mapping studies.

TABLE 4

Description of simulated data sets and symbols used

Symbol Short description Description

N Neutral data sets Data sets for varying sample sizes, recombination and mutation rates,and different marker topologies and with 10,000 replicates each.All simulations are carried out under the assumption of constantpopulation size and selective neutrality.

NL Extreme 5% of sets N Subsets of sets N generated by selecting 500 replicates with the mostextreme, i.e., smallest, ln RH-values of a particular marker (at position 500 kb)

B Bottleneck data sets Data sets for varying sample sizes, recombination and mutation rates,and population bottleneck parameters and with 10,000 replicates each.All simulations are carried out under the assumption of a three-phasepopulation bottleneck model and selective neutrality.

BL Extreme 5% of sets B Subsets of B, generated by selecting 500 replicates with the most extreme,i.e., smallest, ln RH-values of a particular marker.

S Selective sweep data sets Data sets for varying sample sizes, recombination and mutation rates, andselection coefficients and with 500 replicates each. All simulations arecarried out under the assumption of constant population size and a selectivesweep at a particular marker.

N Effective, haploid population size.n No. of sampled chromosomes (sample size).r Rate of recombination per site per generation.R R ¼ 4Nr Scaled recombination rate.m Microsatellite mutation rate per locus per generation.u u ¼ 4Nm Scaled microsatellite mutation rate.d Factor by which population size N is reduced during a population

bottleneck (‘‘depth’’).d Duration of bottleneck in units of 2N generations.g g ¼ d/d Severity of bottleneck.s Selection coefficient for codominant alleles.t Time (in units of 2N) since completion of selective sweep or since

completion of population bottleneck.T(k) Test statistic for k linked microsatellite loci.

Multilocus Test for Selective Sweeps 213

Page 8: Identification of Selective Sweeps Using a Dynamically ...Thomas Wiehe,* Viola Nolte,† Daniel Zivkovic* and Christian Schlo¨tterer†,1 *Institut fu¨r Genetik, Universita¨tzuKo¨ln,

Dependence on the number and spacing of micro-satellites: In the analyses described above, we consid-ered 29 loci evenly spaced over 600 kb of genomic DNAand the target of selection coincided with one of themicrosatellite loci (Figure 2, marker topology A). Wealso tested how the number and distribution of micro-satellites affect the power and false positives. First, weanalyzed the case that the target of selection falls in themiddle between two microsatellites. The region wasagain of length 600 kb and contained 30 microsatelliteloci spaced at a distance of 20 kb (Figure 2, markertopology B). We observed a slight reduction in power.The loss in power is more pronounced only for the caseof high recombination and the approximate analyticaltest strategy. On the other hand, the false positive rate isslightly increased for the case of low recombination. Ingeneral, the loss in power and increase in false positivesdo not severely compromise the test (supplemental Ta-ble S3 at http://www.genetics.org/supplemental/). Thus,the position of markers relative to the target of selectionhas no major effect on the test statistic.

Second, we varied the physical distance of micro-satellites. Although this analysis is equivalent to varyingthe recombination rate, we included it as a guideline forexperimental design as experimentalists are often in-terested to what extent the additional expenses of anincreased marker density are paid off by an increase instatistical power. Our analysis of different recombina-tion rates already indicated that more densely distrib-uted microsatellites might yield higher power. Weinvestigated this effect in detail by fixing the recombi-nation rate, the selection coefficient, and the number ofmarkers (supplemental Table S2 at http://www.genetics.org/supplemental/). Consistent with our results for dif-ferent recombination rates, we also observe an increasein power with marker density. The trade-off is anincrease in the false-positive rate of the test for tightlyspaced markers. However, even for a marker distance asclose as 5 kb the false-positive rate in the candidateregion scenario is only 10.8% when the approximateanalytical test strategy is applied.

Third, we determined the influence of the number ofloci genotyped. We analyzed an increasing number ofloci (5, 10, 15, and 20 loci), but the spacing of the lociremained constant at 10 kb. As expected, we observe no

reduction in statistical power when fewer loci are avail-able. The false-positive rate also remains under 7% inthe candidate region scenario and when the approxi-mate analytical test strategy is applied. For the MonteCarlo test strategy, the false-positive rate increasesalmost fourfold in the case of only five available markers(supplemental Table S1 at http://www.genetics.org/supplemental/).

Finally, we determined power and false-positive rateof the multilocus test for a set of 15 microsatellites withphysical distances as in the experimental data set fromHarr et al. (2002) (Figure 2, marker topology C). In linewith the above results we find that the power of the testdepends most critically on the recombination rate, withlow power for high recombination rates. The number offalse positives remains under 10.6% for all cases in-vestigated (supplemental Tables S4a–S4c at http://www.genetics.org/supplemental/).

Impact of demography: Until now, we were consid-ering only stable populations with no changes in pop-ulation size. Most genome scans for selected genomicregions are, however, performed in populations thatexperienced a recent reduction in population size.While genome scans using the ln Ru-test statistic forunlinked loci are not strongly affected by changesin population size, an analysis of linked loci needs toaccount for demographic events.

Typically, three parameters are used to characterizepopulation bottlenecks, the depth of the bottleneck (d),the duration of the bottleneck (d), and the time t whenthe bottleneck was completed, looking backward fromthe present. We also call the latter the time of onset. Thedepth is the factor by which the original population sizeN is reduced during the bottleneck. To shrink theparameter space, we use the ratio g ¼ d/d of bottleneckduration and depth, called severity, and fix the durationat d ¼ 10�5 (in units of 2N). While this simplification isunproblematic for typical parameter values, it should benoted that it becomes incorrect for very long andshallow bottlenecks. However, as far as our test is con-cerned such cases do not inflate the false-positive rate(see supplemental Figure S1 at http://www.genetics.org/supplemental/). Very recent and very severe bot-tlenecks result in a high proportion of invariableloci (.15%). For such demographic scenarios the

Figure 2.—Schematic of three different micro-satellite marker topologies that have been used togenerate the simulated data sets. Topology A:equidistant marker spacing of 29 markers at a dis-tance of 20 kb. The selected site (vertical bar atposition 500 kb) coincides with one of themarkers. Topology B: equidistant marker spacingof 30 markers at a distance of 20 kb. The selectedsite (vertical bar at position 500 kb) is in the mid-

dle between two adjacent markers. Topology C: 15 markers positioned as those in the experimental study by Harr et al. (2002).Results for topology A are given in Tables 1–3. Results for topologies B and C are given in supplemental Tables S1–S4e at http://www.genetics.org/supplemental/.

214 T. Wiehe et al.

Page 9: Identification of Selective Sweeps Using a Dynamically ...Thomas Wiehe,* Viola Nolte,† Daniel Zivkovic* and Christian Schlo¨tterer†,1 *Institut fu¨r Genetik, Universita¨tzuKo¨ln,

ln Ru-statistic is not applicable since the high numberof invariant loci makes ln Ru nonnormally distributed(Schlotterer 2002). Nevertheless, such extreme bot-tlenecks can be easily distinguished from selectivesweeps by the large number of invariant loci occurringthroughout the entire genome. Furthermore, the multi-locus test did not result in an increased number of falsepositives for old bottlenecks, irrespective of their sever-ity. Recent bottlenecks with a low severity (g , 0.2) werealso unproblematic. In contrast, bottlenecks with in-termediate severities (g between 0.5 and 2) produce asignal that the multilocus test often misinterprets asdeviation from neutrality. In the worst case (g ¼ 2, t ¼0.002) we detected 40.2% false positives (Table 3a,Figure 3B, and supplemental Table S4d at http://www.genetics.org/supplemental/). We also tested towhat extent population bottlenecks compromised theability to reduce the number of false positives in a set ofcandidate loci derived from a first-pass genome scan. Asfor the constant population case described above, weselected those data sets for which the ln Ru-values of oneparticular locus fell in the lower 5% tail. In contrast tothe previous analysis, the low 5% were taken frombottleneck simulations. As expected for old bottlenecksand low severity values, the analysis of linked lociresulted in a considerable reduction of false positives.Bottlenecks with intermediate severities remain prob-lematic in being distinguishable from selective sweeps.Nevertheless, even in the worst case the analysis oflinked loci reduces the number of false positives from100 to 58.6% (Table 3b, Figure 3B, and supplementalTable S4e at http://www.genetics.org/supplemental/).

Application: We analyzed a genomic region that haspreviously been identified as a putative target of aselective sweep. Most important, despite that linkedmicrosatellites were genotyped, only a single locus wasfound to be deviating from neutral expectations. Wegenotyped the same 15 microsatellites that were ana-lyzed previously by Harr et al. (2002) in one additionalEuropean (Poland) and one African (Uganda) popula-tion. Figure 4 indicates the ln Ru-values for all 15 locicovering a 600-kb region on the third chromosome inD. melanogaster. Similar to the results of Harr et al.(2002) we found one locus with a pronounced re-duction in variability. Our multilocus test statistic wasT(10) ¼ �5.795. For a level of a ¼ 0.05 the quantile isqtheo(a, 10)¼ �5.201 and the P-value is P¼ 0.0334 [i.e.,the cumulative probability of the normal distributionwith mean 0 and variance 10 evaluated at T(10) ¼�5.795]. Therefore, the reduction of variability at the 10loci adjacent to the one with the strongest reduction ofln RH provided sufficient information to reject the nullhypothesis of neutrality.

Given that the habitat expansion of D. melanogasterwas associated with a pronounced population bottle-neck, it is important to consider its impact on themultilocus test. For instance, the population bottleneck

Figure 3.—(A) Constant population size: plot of the cumu-lative distribution functions of the test statistic T(K) for neutralevolution (lines with dark and light shading) and a selectivesweep (solid line). Simulations are for a region of 600 kb with15 microsatellite markers (with spacings as shown in Figure 4)and an average distance of 37.6 kb. Additional parameters: re-combination rate R¼ 2 3 10�3, mutation rate u¼ 5, sample sizen ¼ 60, selection coefficient s ¼ 10�3, effective population sizeN¼ 5 3 105, time of completion of sweep t¼ 0.001 (in units of2N). Dashed vertical line: 5% quantile of the distribution ofT(K) for the neutral data set N (constant population size,10,000 samples). Dashed horizontal lines: power (99.8%) ob-tained on data set S (recent selective sweep, 500 samples) andfalse-positive rate (15.8%) on data set NL (candidate regionscenario, 500 samples; see materials and methods). (B) Bot-tlenecked population: plot of the cumulative distributionfunctions of the test statistic T(K) for a bottlenecked popula-tion (dashed lines with dark and light shading) and a selectivesweep (solid line). For comparison also the distribution for aneutral constant size population is shown (solid line with lightshading). Parameters are g ¼ 2 (bottleneck severity) and t ¼0.05 (end of bottleneck; in units of 2N); other parameters areas in A. The false-positive rates are 15.8% for data set B (10,000samples) and 37.2% for data set BL (candidate region scenario,500 samples). Note that for this case the false-positive rates ob-tained by the analytical approach are only 7.9 and 24.0%, re-spectively (see supplemental Tables S4d and S4e at http://www.genetics.org/supplemental/).

Multilocus Test for Selective Sweeps 215

Page 10: Identification of Selective Sweeps Using a Dynamically ...Thomas Wiehe,* Viola Nolte,† Daniel Zivkovic* and Christian Schlo¨tterer†,1 *Institut fu¨r Genetik, Universita¨tzuKo¨ln,

parameters that had been inferred by Haddrill et al.(2005) for non-African populations of D. melanogasterwould yield a false-positive rate of 20.6% in the genome-scan scenario and of 49.6% in the candidate locus sce-nario. Assuming that a similar parameter range mayapply to European populations of D. melanogaster, therewould be an odds ratio of �1:2 for the observed data tobe due to a bottleneck instead of a selective sweep.

DISCUSSSION

Neutrality tests based on microsatellites focus mainlyon unlinked loci and are based on either reducedvariability (ln Ru) or population differentiation (FST).Tests of linkage disequilibrium are rarely used formicrosatellite data due to the difficulty of haplotypeinference of multiallelic loci. Furthermore, tests oflinkage disequilibrium are extremely sensitive to thedemographic history. In contrast, our proposed testconsiders ln Ru of several linked loci for the inference ofselection. While the single-locus ln Ru-test is largelyindependent of the demographic past, the additionalpower of linked loci is balanced by the cost of anincreasing dependence on the demographic past.

Previously, an alternative method using linked micro-satellites for the identification of selection was proposed(Pollinger et al. 2005). The authors show that a pro-nounced reduction in variability at three linked micro-

satellite loci provides a very strong signal of a selectivesweep. However, with this test strategy it is not clear howto account for a theoretical null distribution. A generalmethod for the identification of selection at linkedmicrosatellites should be more flexible and cope withthe following challenges:

1. The method should be independent of the numberof microsatellites genotyped, as the target of selec-tion is often not known.

2. The heterogeneity in microsatellite density couldresult in a clustering of markers and thus lead to falsepositives.

3. The mutation rate of microsatellites often differs byorders of magnitude; to distinguish whether low vari-ability is due to selection or due to a low mutation rate,it is important to filter out mutation rate differences.

4. It should be suitable for a candidate region study toconfirm or dismiss a selective sweep detected by asingle-locus test in a genome scan.

Our proposed multilocus test is designed to meetthese criteria. Its key property is the independence froman a priori choice of the number of loci to be typed.Through a dynamic, a posteriori, determination of thenumber of loci, the multilocus test can be applied to anynumber of available loci, allowing for an arbitraryspacing of markers. Nevertheless, if the genotypedmarkers do not adequately cover the selected region,the power of the test statistic may be reduced, as only afraction of the loci carrying the signal of the selectivesweep are included. Furthermore, because the ratio ofvariability in two populations is considered for eachlocus, mutational differences between loci are mostlyeliminated (Equation 1)—at least as far as first-ordermoments are concerned. Therefore, the test statisticT(K) ¼

Pi¼1; ... ;K ðziÞ, as described above, does depend

only marginally on the mutation rate u. We haveanalyzed heterozygosity (ln RH in the notation ofSchlotterer 2002) as well as the variance in repeatnumber at microsatellites (ln RV) as measures ofvariability. We find qualitatively corresponding resultsfor ln RV and ln RH. However, the multilocus testgenerally is more powerful and has a similar or evenlower false-positive rate for ln RH than for ln RV.Therefore, we decided to report here only the resultsfor ln RH. Finally, as our multilocus test is based only onmarkers flanking the candidate locus, the false-positiverate can be kept low and is in fact close to the theoreticallevel a of the test. While this strategy is conservative itleads to a minor loss in power compared to an approachthat uses all markers. Nevertheless, the power to detectrecent sweeps remains close to 100%, except in regionsof high recombination. The power also drasticallydecreases with the age of selective sweeps.

As in all tests that use information from linked sites,the most difficult problem remains to disentangle theeffects of a selective sweep from a medium severe, recent

Figure 4.—Variability profile of 15 microsatellite loci lo-cated on chromosome 3L of D. melanogaster. x-axis: positionon chromosome in base pairs. y-axis: ln Ru as in Equation 1for population 1 from Poland and population 2 from Uganda(see materials and methods). Boxes denote the ln Ru-valuesat 15 microsatellites, and solid boxes refer to the subset of 10loci that had been included in the multilocus test. This subsetwas selected around the one locus with the most extremeln Ru-value (open circle), which itself is excluded from thetest statistic. The multilocus test applied to these data resultsin rejection of the null hypothesis (‘‘neutral evolution’’) withP ¼ 0.0334. For comparison with the single-locus test, thelower 5% quantile of the standard normal distribution isshown as a dotted horizontal line.

216 T. Wiehe et al.

Page 11: Identification of Selective Sweeps Using a Dynamically ...Thomas Wiehe,* Viola Nolte,† Daniel Zivkovic* and Christian Schlo¨tterer†,1 *Institut fu¨r Genetik, Universita¨tzuKo¨ln,

bottleneck. If the demographic past and other popula-tion parameters are known, it is possible to use theMonte Carlo test strategy and determine the distribu-tion of T(K) by simulations. As this information is oftennot available, we focused on an alternative test strategy.Assuming no linkage among the microsatellites, wedetermined the increase in false positives when onepopulation experienced a change in population size.We found that bottlenecks of intermediate severityresulted in the most pronounced increase in false posi-tives. Interestingly, a recent reanalysis of DNA sequencepolymorphism data in D. melanogaster (Haddrill et al.2005) also identified a demographic scenario that wascompatible with data for which previously multiple selec-tive sweeps were detected (Glinka et al. 2003). Recentpopulation bottlenecks with a severity roughly between0.5 and 2 can produce a signature in the genome thatis easily misinterpreted as selection. Similar resultshave been obtained for sequence variability data in-stead of microsatellites. Jensen et al. (2005) note thatthe composite-likelihood-ratio test of Kim and Stephan

(2002) may yield an up to 90% false-positive rate forcertain parameter combinations.

It can be shown that a high false-positive rate of ourmultilocus test for the mentioned range of bottleneckseverities is associated with a high variance in heterozy-gosity at the microsatellite loci. Recent populationbottlenecks of intermediate severity can lead to a morethan fourfold increase of the variance of heterozygositycompared to the variance in a constant-size population(Figure 5). In a very similar manner, selective sweeps,while decreasing average heterozygosity, also lead to anincrease of the variance of heterozygosity.

Our test assumes that variability data from two ge-netically isolated populations are compared. Migration

between populations generally results in a more distantcommon ancestor and increased genetic variabilitywithin a population. Therefore, migration might reducethe power but should not increase the false-positive rateof our test. Nevertheless, the details of the impact ofextended gene flow upon false-positive rate and powerof the test need further investigation. Furthermore, inclassical hitchhiking models—and also in the oneconsidered here—it is assumed that the selected alleleis codominant. As has been pointed out by Teshimaet al.(2006), recessivity or dominance may have a nonnegli-gible effect upon tests for selective sweeps.

We thank M. Thomas for many stimulating discussions and two anon-ymous reviewers for helpful comments. This work has been supportedby grants from the Fonds zur Forderung der wissenschaftlichenForschung to C.S. and from the German Ministry of Education andResearch (FK 0312705A) and the German Science Foundation (DFG-SFB680) to T.W.

LITERATURE CITED

Bachtrog, D., M. Agis, M. Imhof and C. Schlotterer, 2000 Mi-crosatellite variability differs between dinucleotide repeat motifs-evidence from Drosophila melanogaster. Mol. Biol. Evol. 17:1277–1285.

Charlesworth, B., 1992 Evolutionary biology. New genes sweepclean. Nature 356: 475–476.

Dieringer, D., and C. Schlotterer, 2003 Two distinct modes ofmicrosatellite mutation processes: evidence from the completegenomic sequences of nine species. Genome Res. 13: 2242–2251.

Ellegren, H., 2004 Microsatellites: simple sequences with complexevolution. Nat. Rev. Genet. 5: 435–445.

Glinka, S., L. Ometto, S. Mousset, W. Stephan and D. De Lorenzo,2003 Demography and natural selection have shaped genetic var-iation in Drosophila melanogaster: a multilocus approach. Genetics165: 1269–1278.

Haddrill, P. R., K. R. Thornton, B. Charlesworth and P. Andol-

fatto, 2005 Multilocus patterns of nucleotide variability andthe demographic and selection history of Drosophila mela-nogaster populations. Genome Res. 15: 790–799.

Harr, B., M. Kauer and C. Schlotterer, 2002 Hitchhiking map-ping: a population-based fine-mapping strategy for adaptive mu-tations in Drosophila melanogaster. Proc. Natl. Acad. Sci. USA99: 12949–12954.

Hudson, R. R., 1994 How can the low levels of DNA sequence var-iation in regions of the drosophila genome with low recombina-tion rates be explained? Proc. Natl. Acad. Sci. USA 91: 6815–6818.

Hudson, R. R., 2002 Generating samples under a Wright-Fisherneutral model of genetic variation. Bioinformatics 18: 337–338.

Ihle, S., I. Ravaoarimanana, M. Thomas and D. Tautz, 2006 Ananalysis of signatures of selective sweeps in natural populationsof the house mouse. Mol. Biol. Evol. 23: 790–797.

Jensen, J. D., Y. Kim, V. B. DuMont, C. F. Aquadro and C. D.Bustamante, 2005 Distinguishing between selective sweepsand demography using DNA polymorphism data. Genetics 170:1401–1410.

Kaplan, N. L., R. R. Hudson and C. H. Langley, 1989 The ‘‘hitch-hiking effect’’ revisited. Genetics 123: 887–899.

Kauer, M. O., D. Dieringer and C. Schlotterer, 2003 A micro-satellite variability screen for positive selection associated withthe ‘‘out of Africa’’ habitat expansion of Drosophila melanogaster.Genetics 165: 1137–1148.

Kayser, M., S. Brauer and M. Stoneking, 2003 A genome scan todetect candidate regions influenced by local natural selection inhuman populations. Mol. Biol. Evol. 20: 893–900.

Kim, Y., and W. Stephan, 2002 Detecting a local signature of genetichitchhiking along a recombining chromosome. Genetics 160:765–777.

Kohn, M. H., H. J. Pelz and R. K. Wayne, 2000 Natural selectionmapping of the warfarin-resistance gene. Proc. Natl. Acad. Sci.USA 97: 7911–7915.

Figure 5.—Variance of heterozygosity, V(H) at unlinkedmicrosatellites under a population bottleneck of varying se-verity. x-axis: severity of bottleneck g ¼ d/d (d in units of2N). y-axis: V(H); for comparison, the shaded horizontal lineshows V(H) under a constant-size population. Parameters areu ¼ 5, n ¼ 60, R ¼ 2 3 10�3; V(H) is calculated for 10,000 in-dependent loci.

Multilocus Test for Selective Sweeps 217

Page 12: Identification of Selective Sweeps Using a Dynamically ...Thomas Wiehe,* Viola Nolte,† Daniel Zivkovic* and Christian Schlo¨tterer†,1 *Institut fu¨r Genetik, Universita¨tzuKo¨ln,

Maynard Smith, J., and J. Haigh, 1974 The hitch-hiking effect ofa favorable gene. Genet. Res. 23: 23–35.

Nair, S., J. T. Williams, A. Brockman, L. Paiphun, M. Mayxay et al.,2003 A selective sweep driven by pyrimethamine treatment insoutheast Asian malaria parasites. Mol. Biol. Evol. 20: 1526–1536.

Ohta, T., and M. Kimura, 1973 A model of mutation appropriate toestimate the number of electrophoretically detectable alleles in afinite population. Genet. Res. 22: 201–204.

Payseur, B. A., A. D. Cutter and M. W. Nachman, 2002 Searchingfor evidence of positive selection in the human genome usingpatterns of microsatellite variability. Mol. Biol. Evol. 19: 1143–1153.

Pollinger, J. P., C. D. Bustamante, A. Fledel-Alon, S. Schmutz,M. M. Gray et al., 2005 Selective sweep mapping of genes withlarge phenotypic effects. Genome Res. 15: 1809–1819.

Pool, J. E., V. Bauer DuMont, J. L. Mueller and C. F. Aquadro,2006 A scan of molecular variation leads to the narrow localiza-tion of a selective sweep affecting both Afrotropical and cosmo-politan populations of Drosophila melanogaster. Genetics 172:1093–1105.

Schlenke, T. A., and D. J. Begun, 2004 Strong selective sweep asso-ciated with a transposon insertion in Drosophila simulans. Proc.Natl. Acad. Sci. USA 101: 1626–1631.

Schlotterer, C., 1998 Microsatellites, pp. 237–261 in MolecularGenetic Analysis of Populations: A Practical Approach 2/e, edited byA. R. Hoelzel. Oxford University Press, Oxford.

Schlotterer, C., 2000 Evolutionary dynamics of microsatelliteDNA. Chromosoma 109: 365–371.

Schlotterer, C., 2002 A microsatellite-based multilocus screenfor the identification of local selective sweeps. Genetics 160:753–763.

Schlotterer, C., 2003 Hitchhiking mapping—functional ge-nomics from the population genetics perspective. Trends Genet.19: 32–38.

Schlotterer, C., 2004 The evolution of molecular markers—Just amatter of fashion? Nat. Rev. Genet. 5: 63–69.

Schlotterer, C., and D. Dieringer, 2005 A novel test statistic forthe identification of local selective sweeps based on microsatellitegene diversity, pp. 55–64 in Selective Sweep, edited by D. Nurminsky.Kluwer Academic/Plenum Publishers, New York.

Schlotterer, C., C. Vogl and D. Tautz, 1997 Polymorphism andlocus-specific effects on polymorphism at microsatellite loci innatural Drosophila melanogaster populations. Genetics 146: 309–320.

Schofl, G., and C. Schlotterer, 2004 Patterns of microsatellitevariability among X chromosomes and autosomes indicate a highfrequency of beneficial mutations in non-African D. simulans.Mol. Biol. Evol. 21: 1384–1390.

Sokal, R. R., and F. J. Rohlf, 1995 Biometry: The Principles and Prac-tice of Statistics in Biological Research. W. H. Freeman, New York.

Stephan, W., T. Wiehe and M. Lenz, 1992 The effect of strongly se-lected substitutions on neutral polymorphism: analytical resultsbased on diffusion theory. Theor. Popul. Biol. 41: 237–253.

Storey, J. D., 2002 A direct approach to false discovery rates. J. R.Stat. Soc. Ser. B 64: 479–498.

Teshima, K. M., G. Coop and M. Przeworski, 2006 How reliable areempirical genomic scans for selective sweeps? Genome Res. 16:702–712.

Vigouroux, Y., M. McMullen, C. T. Hittinger, K. Houchins, L.Schulz et al., 2002 Identifying genes of agronomic importancein maize by screening microsatellites for evidence of selectionduring domestication. Proc. Natl. Acad. Sci. USA 99: 9650–9655.

Wootton, J. C., X. Feng, M. T. Ferdig, R. A. Cooper, J. Mu et al.,2002 Genetic diversity and chloroquine selective sweeps in Plas-modium falciparum. Nature 418: 320–323.

Communicating editor: R. Nielsen

218 T. Wiehe et al.