THE UNIVERSITY OF CHICAGO
POPULATION GENETIC APPROACHES TO THE STUDY OF SPECIATION
A DISSERTATION SUBMITTED TO
THE FACULTY OF THE DIVISION OF THE BIOLOGICAL SCIENCES AND
THE PRITZKER SCHOOL OF MEDICINE
IN CANDIDACY FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
DEPARTMENT OF HUMAN GENETICS
BY
CELINE BECQUET
CHICAGO, ILLINOIS
AUGUST 2008
The origin of isolating mechanisms is . . . a problem of fundamental importance,
and the paucity of our knowledge on this subject is felt as a glaring defect in the
whole doctrine of evolution.
−Theodosius Dobzhansky and Pius Charles Koller (1938)
Nothing in biology makes sense except in the light of evolution.
−Theodosius Dobzhansky 1973
Thank God for Evolution!
−Michael Dowd 2007
ABSTRACT
The mechanisms of speciation still escape our full understanding despite over a cen-
tury of research. Here, I take a population genetic approach to learn about the
processes by which populations diverge and eventually become distinct species, illus-
trating its application in fields varying from conservation biology to human evolution.
As a background to the dissertation, I present in Chapter 1 the state of research in
these fields at the time I started my PhD and provide brief introductions to the
projects described in the thesis. In Chapter 2, I use extensive variation data from
multiple loci to unravel the complex relationships and evolutionary history of chim-
panzee species and subspecies. In Chapter 3, I introduce a computational approach,
MIMAR, which uses summary statistics of polymorphism at multiple independently-
evolving loci and allows for intra-locus recombination to estimate the parameters of
simple speciation models. I apply this method to data from the species and subspecies
of chimpanzee, refining and completing the results of Chapter 2, as well as from the
other non-human ape species and populations. In Chapter 4, I study the behavior
of my and related computational methods when data are generated under realistic
deviations from the simple models considered to date. This work underscores the
limitation of computational approaches in resolving the question of whether the early
stages of speciation can occur in the presence of gene flow. Finally in Chapter 5, I use
my method to learn more about the demographic history of six human populations.
iii
ACKNOWLEDGEMENTS
I would not be at this stage of my scientific development without the guidance of
my adviser Molly Przeworski. Molly has provided me with the opportunity to work
on exciting and stimulating scientific projects, leading to my actual (even though
admittedly humble) contributions to the population genetics and evolutionary biology
communities. She has also given me countless comments and valuable advice leading
to the noticeable improvement (I hope) of my writing and presentation style to a
level that will allow me to grow into a, if not worthy, at least acceptable, scientist.
She has been a patient mentor and I hope to remember and apply during my future
experiences in the academic world everything she has taught me. Importantly, she
has been at times a remarkably understanding colleague and friend; I cannot thank
her enough for such things.
I thank all the teachers of Brown University and University of Chicago who put
up with me during the early years of my PhD. Thank you to Carol Ober, whose care,
scientific input and human understanding smoothed the emotionally rough transfer
from Brown to U of C. Thank you to all the members of the Przeworski, Pritchard
and Stephens labs (and friends of PPS, in particular the Gilad lab) for the discussions,
social framework and dynamic and friendly work environment they provided me these
last three years. In this group, I need to single out my wonderful office-mate Graham
Coop, who has put up with my flavorful swearing, aromatic food, dumb questions,
erratic moods and so much more for so long. Despite all this, Graham has always been
an extremely helpful, competent, interested, sympathetic, understanding and caring
colleague and friend. Thank you to Kevin Bullaughey for his availability in providing
iv
v
technical support and helpful discussions. Thanks to Dick Hudson for accepting to
sit as an ad hoc thesis committee member during my thesis defense, his friendly and
helpful discussions and his widely used program ms, which has been the basis of my
work since I was a baby population geneticist. I thank the members of my thesis
committee for their helpful comments and trustful support regarding my abilities in
completing this PhD: Anna Di Rienzo, Jonathan Pritchard, Matthew Stephens, and
Chung-I Wu.
Thank you to all my friends from Brown, and specifically to Carol and Walter
Casper who welcomed me as one of their own, and helped a lot in making the transi-
tion from Europe to the US painless. Thank you too all my friends from U of C (most
of them colleagues in Human Genetics, Ecology and Evolution or in the Biological
Sciences Division) and from the Hyde Park community for the social life, relaxing
outings and more generally well needed human warmth. Thank you to Margarida
Cardoso-Moreira and Joanna Kelley for their valuable comments on earlier versions
of this dissertation and Molly again, who must have read this thesis a million times.
Thanks Lucky for the much needed and well-appreciated distractions and moral sup-
port during my thesis writing.
TABLE OF CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 GENETIC STRUCTURE OF CHIMPANZEE POPULATIONS . . . . . . 112.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Author summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Results/Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . 192.4.2 Principal components analysis . . . . . . . . . . . . . . . . . 232.4.3 Testing for additional populations . . . . . . . . . . . . . . . 262.4.4 Evidence for inbreeding . . . . . . . . . . . . . . . . . . . . . 272.4.5 First and second generation hybrids . . . . . . . . . . . . . . 272.4.6 Allele frequency differentiation . . . . . . . . . . . . . . . . . 282.4.7 Central and Eastern chimpanzees are most closely related in
time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.4.8 Population separation times . . . . . . . . . . . . . . . . . . . 32
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.6 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 372.7 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.8 Appendix A: Testing for inbreeding . . . . . . . . . . . . . . . . . . . 43
vi
vii
3 A NEW APPROACH TO ESTIMATE PARAMETERS OF SPECIATIONMODELS, WITH APPLICATION TO APES . . . . . . . . . . . . . . . . 463.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.1 Performance of MIMAR under the allopatric model . . . . . . . 553.3.2 Comparison to IM for the case of no recombination . . . . . . 563.3.3 Assessing the evidence for gene flow . . . . . . . . . . . . . . . 593.3.4 Sensitivity to intra-locus recombination rates . . . . . . . . . . 613.3.5 Application to ape data . . . . . . . . . . . . . . . . . . . . . 63
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743.4.1 Advantages and limitations of MIMAR . . . . . . . . . . . . . . 743.4.2 Analyses of ape polymorphism data . . . . . . . . . . . . . . . 78
3.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793.5.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793.5.2 Data summaries . . . . . . . . . . . . . . . . . . . . . . . . . . 813.5.3 Estimation method . . . . . . . . . . . . . . . . . . . . . . . . 823.5.4 Simulated data and performance analyses . . . . . . . . . . . 873.5.5 Analysis of ape polymorphism data . . . . . . . . . . . . . . . 89
3.6 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 933.7 Appendix A: Supplemental Materials . . . . . . . . . . . . . . . . . . 943.8 Appendix B: Supplemental Figures . . . . . . . . . . . . . . . . . . . 94
4 CAN WE LEARN ABOUT MODES OF SPECIATION BY COMPUTA-TIONAL APPROACHES? . . . . . . . . . . . . . . . . . . . . . . . . . . . 1084.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1094.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1094.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.3.1 The isolation model and violations . . . . . . . . . . . . . . . 1164.3.2 The isolation-migration model and violations . . . . . . . . . . 1194.3.3 Estimating the parameters of the isolation-migration model . . 1214.3.4 Goodness-of-fit test . . . . . . . . . . . . . . . . . . . . . . . . 123
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1254.4.1 Performance of MIMAR and IM under the isolation and isolation-
migration models . . . . . . . . . . . . . . . . . . . . . . . . . 1264.4.2 Effect of violations of the isolation model (Figure 4.1a) . . . . 1274.4.3 Effect of violations of the isolation-migration model (Figure
4.1b): Parapatry with gene flow only at an early stage . . . . 1324.4.4 Detecting loci with unusual history . . . . . . . . . . . . . . . 136
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1394.6 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1424.7 Appendix A: Supplemental Materials . . . . . . . . . . . . . . . . . . 142
viii
4.8 Appendix B: Supplemental Tables . . . . . . . . . . . . . . . . . . . 1464.9 Appendix C: Supplemental Figure . . . . . . . . . . . . . . . . . . . 152
5 ESTIMATING THE DEMOGRAPHIC PARAMETERS OF HUMAN POP-ULATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1535.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1545.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1545.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.3.1 Raw data description . . . . . . . . . . . . . . . . . . . . . . . 1555.3.2 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . 1565.3.3 Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1575.3.4 Goodness-of-fit test . . . . . . . . . . . . . . . . . . . . . . . . 158
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595.4.1 Split between African populations . . . . . . . . . . . . . . . . 1605.4.2 Split between African and non-African populations. . . . . . . 1655.4.3 Split between non-African populations . . . . . . . . . . . . . 169
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1715.6 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1755.7 Appendix A: Supplemental Figures . . . . . . . . . . . . . . . . . . . 175
6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
LIST OF FIGURES
2.1 STRUCTURE analysis, blinded to population labels, recapitulatesthe reported population structure of the chimpanzees. . . . . . . . 21
2.2 PCA, without using population labels, divides the 84 chimpanzeesinto four apparently discontinuous populations of Western, Central,Eastern, and bonobo. . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 The significant fourth eigenvector from the analysis of all 84 chim-panzees is correlated to the first eigenvector from analysis of Westernchimpanzees only. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Goodness-of-fit of the SMM. . . . . . . . . . . . . . . . . . . . . . . 302.5 Distribution of mean heterozygosity within individuals (Hw, orange). 442.6 Distribution of mean squared difference in repeat units within indi-
viduals (Rw, orange). . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.1 The “isolation-migration” model. . . . . . . . . . . . . . . . . . . . 553.2 Performance of MIMAR (x-axis) and IM (y-axis). . . . . . . . . . . . 583.3 Performance of MIMAR in the presence of gene flow. . . . . . . . . . 593.4 Sensitivity of MIMAR to intra-locus recombination. . . . . . . . . . . 623.5 Smoothed marginal posterior distributions estimated by MIMAR from
bonobo and common chimpanzee polymorphism data. . . . . . . . 643.6 Smoothed marginal posterior distributions estimated by MIMAR from
the common chimpanzee subpopulations polymorphism data. . . . 653.7 Smoothed marginal posterior distributions estimated by MIMAR from
the gorilla (a) and orangutan (b) subspecies polymorphism data. . . 703.8 Smoothed posterior distributions estimated by MIMAR from simulated
data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 953.9 Smoothed posterior distributions estimated by IM (black) and MIMAR
(grey) from simulated data sets. . . . . . . . . . . . . . . . . . . . 983.10 Goodness-of-fit of the isolation-migration model for the ape species
and subspecies data. . . . . . . . . . . . . . . . . . . . . . . . . . . 100
ix
x
4.1 Simple models of speciation. . . . . . . . . . . . . . . . . . . . . . 1184.2 Estimates provided by MIMAR and IM from data simulated under the
isolation (Figure 4.1a) and isolation-migration models (Figure 4.1b). 1274.3 Estimates provided by MIMAR (a and c) and IM (b and d) from data
simulated under models of isolation from a structured ancestral pop-ulation (Figure 4.1c). . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.4 Estimates provided by MIMAR (a and c) and IM (b and d) from datasimulated under models of divergence in isolation followed by sec-ondary contact (Figure 4.1d). . . . . . . . . . . . . . . . . . . . . . 133
4.5 Estimates provided by MIMAR (a and c) and IM (b and d) when dataare simulated under models of isolation with migration only at anearly stage (Figure 4.1e). . . . . . . . . . . . . . . . . . . . . . . . 135
4.6 Estimates provided by MIMAR when the data at one locus with adifferent history are simulated with models i (blue) and ii (red). . . 152
5.1 Raw data description. . . . . . . . . . . . . . . . . . . . . . . . . . . 1565.2 Cartoon summarizing pairwise models estimated by MIMAR for six
human populations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1725.3 Estimated model from the Biaka Pygmies and Mandenka population
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1765.4 Estimated model from the Biaka Pygmies and Tsumkwe San popu-
lation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1775.5 Estimated model from the Mandenka and Tsumkwe San population
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1785.6 Estimated model from the French Basque and Biaka Pygmies popu-
lation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1795.7 Estimated model from the Han Chinese and Biaka Pygmies popula-
tion data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1805.8 Estimated model from the Nan Melanesian and Biaka Pygmies pop-
ulation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1815.9 Estimated model from the French Basque and Mandenka population
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1825.10 Estimated model from the Han Chinese and Mandenka population
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1835.11 Estimated model from the Nan Melanesian and Mandenka population
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1845.12 Estimated model from the French Basque and Tsumkwe San popu-
lation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1855.13 Estimated model from the Han Chinese and Tsumkwe San population
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
xi
5.14 Estimated model from the Nan Melanesian and Tsumkwe San pop-ulation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
5.15 Estimated model from the French Basque and Han Chinese popula-tion data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
5.16 Estimated model from the French Basque and Nan Melanesian pop-ulation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
5.17 Estimated model from the Han Chinese and Nan Melanesian popu-lation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
5.18 Isolation-migration models estimated by MIMAR with a poor fit tothe human population data. . . . . . . . . . . . . . . . . . . . . . . 191
LIST OF TABLES
2.1 Details of the 84 samples in this study. . . . . . . . . . . . . . . . . 162.2 Individuals with >5% ancestry from more than one cluster. . . . . . 212.3 Information on the top 30 microsatellites used for this study ranked
by informativeness. . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4 Genetic differentiation among populations. . . . . . . . . . . . . . . 292.5 Eastern and Central chimpanzees are phylogenetically most closely
related. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.6 Estimates of divergence time from ASD. . . . . . . . . . . . . . . . 352.7 Mean heterozygosity within individuals from each population. . . . 432.8 Mean squared difference in repeat units within individuals in each
population. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1 Performance of MIMAR and IM. . . . . . . . . . . . . . . . . . . . . . 563.2 Performance of MIMAR when detecting gene flow. . . . . . . . . . . . 603.3 Results for chimpanzee species. . . . . . . . . . . . . . . . . . . . . 673.4 Results for chimpanzee subspecies. . . . . . . . . . . . . . . . . . . . 683.5 Results for gorilla and orangutan subspecies. . . . . . . . . . . . . . 72
4.1 Proportion of analyses in which non-zero gene flow was detected. . . 1304.2 Results of the locus-specific goodness-of-fit tests. . . . . . . . . . . 1384.3 Proportion of MIMAR and IM analyses with parameter estimates within
two fold of their true value when the data are simulated under the iso-lation and isolation-migration models (Figure 4.1a−b in main text). 146
4.4 Proportion of MIMAR analyses with parameter estimates within twofold of their true value when data are simulated under models ofisolation from a structured ancestral population (Figure 4.1c in maintext). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.5 Proportion of IM analyses with parameter estimates within two foldsof their true value when data are simulated under models of isolationfrom a structured ancestral population (Figure 4.1c in main text). 147
4.6 Proportion of MIMAR analyses with parameter estimates within twofold of their true value when data are simulated under models ofisolation followed by secondary contact (Figure 4.1d in main text). 148
4.7 Proportion of IM analyses with parameter estimates within two foldsof their true value when data are simulated under models of isolationfollowed by secondary contact (Figure 4.1d in main text). . . . . . 148
xii
xiii
4.8 Proportion of MIMAR analyses with parameter estimates within twofold of their true value when data are simulated under models ofisolation with migration only at an early stage (Figure 4.1e in maintext). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.9 Proportion of IM analyses with parameter estimates within two foldsof their true value when data are simulated under models of isolationwith migration only at an early stage (Figure 4.1e in main text). . 149
4.10 Proportion of MIMAR analyses with parameter estimates within twofold of their true value when data at a locus with an unusual historyare simulated under models i and ii. . . . . . . . . . . . . . . . . . 150
4.11 Results of the best locus-specific goodness-of-fit tests. . . . . . . . 151
5.1 Estimates of the descendant effective population sizes. . . . . . . . . 1615.2 Estimates of the ancestral effective population sizes. . . . . . . . . . 1625.3 Estimates of the split time (lower half) and gene flow rate (upper
half) for each population pair. . . . . . . . . . . . . . . . . . . . . . 1635.4 Estimates from joint posterior distributions for the African populations.1645.5 Estimates from joint posterior distributions between the Biaka Pyg-
mies and non-African populations. . . . . . . . . . . . . . . . . . . . 1665.6 Estimates from joint posterior distributions between the Mandenka
and non-African populations. . . . . . . . . . . . . . . . . . . . . . . 1675.7 Estimates from joint posterior distributions between the Tsumkwe
San and non-African populations. . . . . . . . . . . . . . . . . . . . 1685.8 Estimates from joint posterior distributions for the non-African pop-
ulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
CHAPTER 1
INTRODUCTION
1
2
The biological species concept, “groups of interbreeding natural populations that
are reproductively isolated from other such groups”, introduced by Mayr (1963) is
now widely accepted. However, this definition is not always applicable and, in par-
ticular, controversy remains about whether certain groups of individuals are species,
subspecies or just different populations of the same species. In particular, when
groups of individuals are geographically isolated, it is not always clear whether they
fit the definition of species (Price, 2007), i.e., whether they are reproductively iso-
lated and thus would not merge into a homogeneous group upon secondary contact
(Coyne and Orr, 2004d). Whether they can interbreed successfully can sometimes
be evaluated if there is a recent hybrid zone between the two species. For instance,
human disturbance may bring into contact what turn out to be “true” species. Since
reproductively isolation will lead to reduce hybrid fitness, the hybrid zone thus cre-
ated should remain stable over time or quickly disappear if reinforcement increases
the reproductive isolation (Coyne and Orr, 1989). In other cases, human disturbance
can bring into contact groups in the process of speciating and not fully reproductively
isolated, thus threatening biodiversity if the genetic differentiation become lost upon
the merging of the nascent species. Laboratory crosses can also be performed to
measure the fitness of the hybrids between species and study the processes of repro-
ductive isolation. Although molecular research has yielded important insight on the
processes of speciation (Orr, 2005), the reproductive isolation of some species may be
impossible to study in laboratory conditions. Notable examples are cases of ecological
speciation, when the ecology cannot be modeled in the laboratory (Coyne and Orr,
2004b) or cases when research is prevented because of ethical and practical reasons,
as in the case of the great apes.
3
The great apes are an example of controversial species and subspecies. They are
classified into five accepted species: human, gorilla, orangutans, as well as two species
of chimpanzee: The common chimpanzee (Pan troglodytes) and its sister species,
the Pygmy chimpanzee or bonobo (P. paniscus), which occur on either side of the
Congo River. This body of water has never experienced a major drought since its
formation 1.5−3.5 million years ago (Mya) (Beadle, 1981; Myers Thompson, 2003),
so that − given that apes cannot swim − it represents a complete barrier to gene
flow. But should chimpanzee and bonobo be considered different species? There
have been reports of hybrids born in captivity and there is no clear evidence that the
hybrids have reduced fitness (de Waal, 1997), as would be expected if they were true
species. In addition, the non-human great apes species are often subdivided further
into groups that are defined, depending on the classification, as species, subspecies
or populations. For example, the common chimpanzee species are usually described
with three subspecies labels: the Western (P. t. verus), Central (P. t. troglodytes),
and Eastern (P. t. schweinfurthii) chimpanzee (Hill, 1969). Until my and others
contribution to the field (Becquet et al., 2007), this classification was mostly based
on their separation by main bodies of water and other geographical barriers, with
little evidence from morphological, behavioral or even genetic studies. This question
of what constitutes a species as opposed to a locally adapted population is not only
semantic but is central to conservation biology and the attempt to focus conservation
efforts on endangered species such as the chimpanzees and all other non-human great
apes.
Linked to the question of whether some groups of individuals are true species are
unresolved and highly debated questions in evolutionary biology, such as how many
loci are involved in the emergence of reproductive isolation between nascent species
4
and whether speciation can be initiated in the presence of gene flow (Wu, 2001b; Mayr,
2001; Orr, 2001; Wu, 2001a). Until recently, allopatric speciation was widely accepted
as the only or at least most common mode of speciation. Allopatry assumes that
divergence is initiated by and requires a phase of total geographical isolation between
the emerging species. As there is no gene flow during the early stage of speciation
under this model, the genomes are assumed to diverge homogeneously and thus many
genes may become involved in reproductive isolation as a by-product of divergence.
In contrast, the recently proposed “genic view of speciation” assumes that only a
few genes are required for species to become reproductively isolated (Wu, 2001b), as
suggested by the functions of the examples of reproductive isolation factors that have
been characterized to date (Wu, 2001a; Orr et al., 2004). This view allows for a pair of
species to experience some gene flow at the early stage of their divergence, i.e., with the
parapatric model of speciation, when two species diverged while occupying adjacent
geographical areas. In this model of species formation, natural selection is the force
that actively leads to the accumulation of reproductive isolation factors, a hypothesis
that has recently received support from the observation that the evolution of the few
characterized reproductive isolation factors was driven by positive Darwinian selection
(e.g., Orr, 2005).
Whether the early stages of speciation can occur with gene flow remains contro-
versial. Few clear-cut and undisputed cases of parapatric speciation have been docu-
mented, in part because when gene flow is detected, it is difficult to rule out a model
of allopatry followed by secondary contact (Coyne and Orr, 2004a). A large portion of
my PhD research focused on developing and using computational approaches to help
understand speciation mechanisms in an attempt to answer some of these enduring
questions in evolutionary biology.
5
When I started my PhD, I was interested in deciphering the population genetics
and demographic history of apes species and populations (Chapters 2 and 3). At
the time, the largest data set available for common chimpanzee was composed of
about 50 short loci sampled for genetic variation in less than 20 individuals of the
three commonly defined subspecies of chimpanzee (Yu et al., 2003). In contrast, the
data available for human at the time included hundreds of highly variable markers
genotyped in a thousand individuals from 52 populations (Rosenberg et al., 2002).
Thus, while the extent of population differentiation and structure was starting to
be well characterized in humans, the same was far from true in our closest living
evolutionary relatives, the chimpanzees.
The second Chapter of this dissertation is an article published in PLoS Genetics
(Becquet et al., 2007), which describes the largest data set collected to date in chim-
panzees: Approximately 300 microsatellite loci genotyped in over 80 chimpanzees and
bonobos. We collected the data specifically to inquire about the Genetic Structure of
Chimpanzee Populations. This work was supervised by Molly Przeworski and David
Reich and was done in collaboration with Nick Patterson and Anne Stone. I per-
formed most of the data analyses and wrote sections of the manuscript describing the
methods and results. Specifically, I applied the program STRUCTURE (Pritchard
et al., 2000) to these data and showed that the subspecies labels of chimpanzee cor-
respond to well defined genetic populations, with little evidence of recent admixture
between them in the wild. I also used simple statistical approaches to gain insights
about the demographic history of the chimpanzee populations and to estimate the
divergence times between the chimpanzee subspecies and species. The data suggested
that the Eastern and Central populations are more closely related than they are to
6
Western chimpanzee. However, these results are only qualitative because microsatel-
lite data provide unreliable estimates of demographic parameters such as divergence
times and effective population sizes owing to their complex, partly unknown − and
thus difficult to model − mutational mechanisms.
Demographic parameters are more reliably estimated when computational ap-
proaches are applied to nucleotide polymorphism data sampled at multiple, independ-
ently-evolving and ideally neutral loci. At the time I started my PhD, several such
computational methods had been developed to estimate the parameters of simple di-
vergence models in an attempt to learn about speciation mechanisms (Wakeley and
Hey, 1997; Kliman et al., 2000; Nielsen and Wakeley, 2001; Hey and Nielsen, 2004;
Leman et al., 2005; Putnam et al., 2007). These methods used genetic variation across
loci, usually polymorphism data sampled in a pair of closely related populations or
recently diverged species, and fit an isolation model (a simplification of allopatry, e.g.,
Wakeley and Hey, 1997) or an isolation-migration model (a simplification of parap-
atry, e.g., Hey and Nielsen, 2004) to the data. However, all the available methods
had important drawbacks and limitations, ranging from their use of a single locus to
ignoring gene flow between populations since their split and/or not allowing for intra-
locus recombination, all of which can lead to biased estimates of the demographic
parameters of interest (Takahata and Satta, 2002; Hey and Nielsen, 2004).
Chapter 3 is an article published in Genome Research (Becquet and Przeworski,
2007), which describes a new approach to estimate parameters of speciation mod-
els, with application to apes. The purpose of this study was to develop a computa-
tional method that would not have the limitations of previous approaches. I devel-
oped the program MIMAR (Markov Chain Monte Carlo estimation of the Isolation-
7
Migration model Allowing for Recombination) to estimate the parameters of the
isolation-migration model (Hey and Nielsen, 2004): The effective population sizes
of two recently diverged populations and of their common ancestor, their divergence
time and the constant rate of gene flow since they split. MIMAR considers summaries
of polymorphism data sampled in two populations adapted from statistics known
to contain information about the parameters of interest (Wakeley and Hey, 1997).
Importantly, in contrast to other approaches that can also use data from multiple
independently-evolving loci, MIMAR allows for intra-locus recombination.
I carried out a simulation study and encouragingly found that when there is no
intra-locus recombination the method performs similarly to IM, an approach that uses
the full polymorphism data from non-recombining loci (Hey and Nielsen, 2004). I also
confirmed the tendency of computational methods to provide biased estimates of the
parameters of interest when intra-locus recombination is ignored. I further illustrated
the potential of MIMAR by applying it to data from the species and subspecies of great
apes. The results suggested that the isolation-migration model provides a reason-
able approximation to the demographic histories of the great apes populations and
species. In accordance with the results of Chapter 2, I found that Western chimpanzee
diverged approximately 440 thousand years ago (Kya), before the split between Cen-
tral and Eastern chimpanzees. However, while the study presented in Chapter 2 did
not provide evidence of recent migration between the chimpanzee subspecies, MIMAR
suggested that these populations have experienced some historical gene flow. I fur-
ther refined previous estimates that bonobo and chimpanzee diverged in allopatry
∼850 Kya (Won and Hey, 2005; Fischer et al., 2006), and that Western and Eastern
gorilla subspecies split ∼90 Kya with subsequent gene flow (Fischer et al., 2006; Thal-
mann et al., 2006). Finally, I provided the first population parameter estimates for
8
the Bornean and Sumatran orangutan subspecies, which appear to have experienced
gene flow since they split over 250 Kya.
In Chapter 3, I also observe that when computational methods such as MIMAR and
IM are applied to real data, they often provide much larger estimates of the ancestral
effective population relative to that of the descendant populations. Chapter 4 is a
manuscript in preparation for Evolution, entitled: Can we learn about modes of speci-
ation by computational approaches? In this study, I inquired whether the observations
from Chapter 3 may be due to violated assumptions of the model considered by the
methods, specifically the assumptions of panmixia in the ancestral population and
constant gene flow since the split. More generally, I conducted a simulation study to
assess the reliability of MIMAR and IM when a realistic complication is ignored in the
model. I found that when there were population structure in the ancestral population
or gene flow only at the early stage of divergence, IM tends to provide estimates of
the ancestral effective population size that are biased upwards. In turn, when there
is structure in the ancestral population or a phase of isolation followed by a recent
secondary contact, both methods detect some gene flow, potentially lending spurious
support for parapatric speciation. I further introduce a goodness-of-fit test that could
potentially detect when the estimated model is inappropriate, but unfortunately, this
simple test often suggests that the estimated model fit the data even when it is incor-
rect. Taken together, these results suggest that existing computational approaches
available to date may be of limited use in distinguishing parapatry from allopatry.
However, for cases where a parapatric model of speciation is appropriate, I introduce
a simple locus-specific goodness-of-fit test that can help identify loci linked to candi-
date reproductive isolation factors.
9
Despite the limitations of available computational approaches, they can still en-
lighten us about the demographic history of recently diverged populations, including
our own species (e.g., Hey, 2005). As an example, in Chapter 5, I present a study
aimed at Estimating the demographic parameters of human populations. This Chapter
is an unpublished analysis, which is the first step of a collaboration with Jeff Wall at
UCSF, who provided the data. In this study, I applied MIMAR to data from six human
populations and found estimates roughly consistent with what was known about hu-
man demography (e.g., Cavalli-Sforza and Feldman, 2003). In particular, I estimate
the divergence time between African and non-African populations to ∼40−60 Kya.
MIMAR also detected evidence of extensive migration between populations, even highly
geographically diverged, suggesting that gene flow should not be ignored in models
of human demographic history. In light of the results from Chapter 4, it is not clear
whether these results reflect ongoing gene flow, range expansion, or recent migration.
10
The following chapters describe the research projects that compose my PhD. Un-
fortunately, these chapters do not describe the pleasure that I had and the knowledge
that I gained in the completion of these projects. I feel very fortunate to have worked
with remarkable scientists on exciting and relevant questions about the demographic
histories of the great apes, including human (Chapters 2 and 5). The development of
the program MIMAR represents the bulk of my PhD and is my main contribution to
the scientific community (Chapter 3). Despite the limitations of this − and similar
− methods that I explore in Chapter 4, MIMAR remains the only method available
that estimates parameters of isolation-migration models and allows for intra-locus
recombination. MIMAR might remain useful for some time in fields of study for which
there are no fully sequenced and annotated genomes. It has been especially satisfying
and rewarding to see MIMAR used increasingly to answer new population genetics and
speciation questions and to discuss with users ways to improve the method. I hope
the reader will enjoy this dissertation as much as I took pleasure in the completion
of my PhD.
CHAPTER 2
GENETIC STRUCTURE OF CHIMPANZEE
POPULATIONS
11
12
2.1 Abstract
Little is known about the history and population structure of our closest living rela-
tives, the chimpanzees, in part because of an extremely poor fossil record. To address
this, we report the largest genetic study of the chimpanzees to date, examining 310
microsatellites in 84 common chimpanzees and bonobos. We infer three common
chimpanzee populations, which correspond to the previously defined labels of ”West-
ern,” ”Central,” and ”Eastern,” and find little evidence of gene flow between them.
There is tentative evidence for structure within Western chimpanzees, but we do
not detect distinct additional populations. The data also provide historical insights,
demonstrating that the Western chimpanzee population diverged first, and that the
Eastern and Central populations are more closely related in time.
2.2 Author summary
Common chimpanzees have been traditionally classified into three populations: West-
ern, Central, and Eastern. While the morphological or behavioral differences are very
small, genetic studies of mitochondrial DNA and the Y chromosome have supported
the geography-based designations. To obtain a crisp picture of chimpanzee popula-
tion structure, we gather far more data than previously available: 310 microsatellite
markers genotyped in 78 common chimpanzees and six bonobos, allowing a high reso-
lution genetic analysis of chimpanzee population structure analogous to recent studies
that have elucidated human structure. We show that the traditional chimpanzee pop-
ulation designations − Western, Central, and Eastern − accurately label groups of
individuals that can be defined from the genetic data without any prior knowledge
about where the samples were collected. The populations appear to be discontinuous,
13
and we find little evidence for gradients of variation reflecting hybridization among
chimpanzee populations. Regarding chimpanzee history, we demonstrate that Central
and Eastern chimpanzees are more closely related to each other in time than either
is to Western chimpanzees.
2.3 Introduction
Standard taxonomies recognize two species of chimpanzees: bonobos (Pan paniscus)
and common chimpanzees (P. troglodytes), whose current ranges in Africa do not over-
lap. Common chimpanzees have been classified further into three populations or sub-
species based on their separation by geographic barriers (generally rivers): Western
(P. troglodytes verus), Central (P. t. troglodytes), and Eastern (P. t. schweinfurthii)
(Hill, 1969; Groves, 2001). While there are no or only slight morphological or behav-
ioral differences among the common chimpanzees (Albrecht and Miller, 1993; Shea
et al., 1993; Fischer et al., 2006), genetic studies of mitochondrial DNA (mtDNA)
(Morin et al., 1994; D’Andrade and Morin, 1996) and the Y chromosome (Stone
et al., 2002) have supported the geography-based population designations (Morin
et al., 1994; Stone et al., 2002), and mtDNA studies have led to the proposal of a
fourth common chimpanzee subspecies, P. t. vellorosus, around the Sanaga river in
Cameroon (Gonder et al., 1997, 2006). However, studies of single loci provide at best
partial information about history and population subdivision (Hudson and Coyne,
2002); for example, analyses of X and Y chromosome datasets (Kaessmann et al.,
1999) suggest that genetic diversity is highest in Central and lowest in Western chim-
panzees, while mtDNA suggests a different pattern (Stone et al., 2002). Resequenc-
ing and microsatellite-based datasets have also provided inconsistent evidence about
whether Eastern chimpanzees are more diverse than bonobos (Fischer et al., 2006;
14
Reinartz et al., 2000). To obtain a clear picture of chimpanzee population structure,
a large number of independently-evolving regions should be studied simultaneously.
The most comprehensive study of chimpanzees to date − including multiple loci
and samples from Western, Central, and Eastern chimpanzees and bonobos − found
few fixed genetic differences among chimpanzee populations and estimated autosomal
FST values between populations of 0.09−0.32, overlapping the range of differentiation
seen in humans. Fischer et al. (2006) argued from these results that there are no
chimpanzee subspecies and suggested instead that chimpanzee variation might be
characterized by continuous gradients of gene frequencies, with ongoing gene flow
across groups. This and the other multi-locus datasets that have been published to
date (Yu et al., 2003; Reinartz et al., 2000; Fischer et al., 2004) are small compared
with recent genetic assessments of human structure (Rosenberg et al., 2002), however,
and have not yet provided a clear picture. For example, mtDNA and Y chromosome
data have been interpreted as showing discontinuity among chimpanzee populations
(Morin et al., 1994; D’Andrade and Morin, 1996; Stone et al., 2002; Gonder et al.,
1997, 2006), potentially at odds with the model proposed by Fischer et al. (2006).
An accurate picture of chimpanzee population structure is also crucial for under-
standing their history. For example, Won and Hey (2005) estimated that common
chimpanzees and bonobos split ∼0.9 million years ago (Mya), and Western and Cen-
tral chimpanzees split ∼0.42 Mya, with low levels of migration from Western to Cen-
tral since that time. This analysis, which assumed that the populations split from a
common ancestor, would need to be reevaluated if the data were better described by
a model of stable isolation-by-distance (Fischer et al., 2006).
To clarify chimpanzee population structure, we gathered an order-of-magnitude
larger dataset than has previously been available. This allowed us to test whether
15
genetic data alone can be used to assign chimpanzees to the categories of Western,
Central, and Eastern chimpanzees, whether there is evidence for substantial admix-
ture between groups, and whether there is unrecognized substructure among the
chimpanzees (Gagneux, 2002).
2.4 Results/Discussion
We analyzed data from 310 polymorphic microsatellites in 84 individuals: 78 com-
mon chimpanzees and six bonobos. These samples were chosen to include multiple
representatives of each putative population. Of the common chimpanzees, 41 were
reported as Western, 16 as Central, seven as Eastern, three as hybrids, and 11 did not
have a reported subpopulation (Table 2.1). This dataset was designed to include a
similar number of genetic markers (and in fact included many of the same markers) as
the dataset analyzed by Rosenberg et al. (2002) to elucidate human population struc-
ture. Because of high mutation rates, microsatellite alleles often have arisen multiple
times, and hence it is difficult to resolve the genealogy at any locus. A benefit of the
high mutation rate, however, is that microsatellites provide more information about
recent historical events per locus compared with resequencing data (Rosenberg et al.,
2003).
16
IDO
ther
Iden
tifie
r(s)
Sex
Sam
ple
Rep
orte
dA
fter
Rep
orte
dC
lass
ifica
tion
Sour
ceC
ateg
ory
Gen
etic
Bir
thpl
ace
Bas
edon
Ana
lysi
sm
tDN
A/Y
Chr
omos
ome
Gen
otyp
e1
Am
elie
fL
eip
zig
Cen
tral
Cen
tral
Hau
t-O
gooue
2C
hiq
uit
af
Lei
pzi
gC
entr
al
Cen
tral
Hau
t-O
goou
e3
Ber
the
fL
eip
zig
Cen
tral
Cen
tral
Cap
tive
born
4B
akou
mb
am
Lei
pzi
gC
entr
al
Cen
tral
Hau
t-O
goou
eY
,C
entr
al
5N
oem
ief
Lei
pzi
gC
entr
al
Cen
tral
Est
uair
e6
Cla
raf
Lei
pzi
gC
entr
al
Cen
tral
Gab
on
7M
inkeb
em
Lei
pzi
gC
entr
al
Cen
tral
Cap
tive
born
8M
asu
ku
fL
eip
zig
Cen
tral
Cen
tral
Cap
tive
born
9G
emin
if
Lei
pzi
gC
entr
al
Cen
tral
Est
uair
e10
Hen
rim
Lei
pzi
gC
entr
al
Cen
tral
Nyan
ga
Y,
Cen
tral
11
Ivin
do
mL
eip
zig
Cen
tral
Cen
tral
Ogoou
e-Iv
ind
oY
,C
entr
al
12
Moan
da
mL
eip
zig
Cen
tral
Cen
tral
Hau
t-O
goou
eY
,C
entr
al
13
Lala
laf
Lei
pzi
gC
entr
al
Cen
tral
Est
uair
e14
Makata
mL
eip
zig
Cen
tral
Cen
tral
Hau
t-O
gooue/
Ogooue-
Ivin
do
Y,
Cen
tral
15
Makokou
fL
eip
zig
Cen
tral
Cen
tral
Cap
tive
born
16
Pt
197,
stu
dnu
mb
er277,
IPB
IR496
mA
rizo
na
Cen
tral
Cen
tral
Wild
cau
ght,
ori
gin
un
kn
ow
nm
tDN
A,
Cen
tral
17
Akila
fL
eip
zig
East
ern
Most
lyor
all
Bu
run
di
East
ern
18
Alley
fL
eip
zig
East
ern
East
ern
Sou
thea
stC
on
go
19
Am
izer
of
Lei
pzi
gE
ast
ern
East
ern
Bu
run
di
20
An
nie
fL
eip
zig
East
ern
East
ern
Nort
hea
stC
on
go
21
Ju
dy
fL
eip
zig
East
ern
East
ern
Sou
thea
stC
on
go
22
Mim
if
Lei
pzi
gE
ast
ern
East
ern
Nort
hea
stC
on
go
23
Pt
169,
ISIS
nu
mb
er3850
fA
rizo
na
East
ern
Wes
tern
/C
ap
tive
born
mtD
NA
,E
ast
ern
East
ern
24
An
nacl
ara
fL
eip
zig
Wes
tern
Wes
tern
Cap
tive
born
25
Fri
tsm
Lei
pzi
gW
este
rnW
este
rnS
ierr
aL
eon
e26
Hilko
mL
eip
zig
Wes
tern
Wes
tern
Cap
tive
born
27
Lis
bet
hf
Lei
pzi
gW
este
rnW
este
rnS
ierr
aL
eon
e28
Lou
ise
fL
eip
zig
Wes
tern
Wes
tern
Cap
tive
born
29
Marc
om
Lei
pzi
gW
este
rnW
este
rnS
ierr
aL
eon
e30
Osc
ar
mL
eip
zig
Wes
tern
Wes
tern
Cap
tive
born
Tab
le2.
1:D
eta
ils
of
the
84sa
mple
sin
this
study.
F,
fem
ale;
M,
mal
e;IP
BIR
,In
tegr
ated
Pri
mat
eB
iom
ater
ials
and
Info
rmat
ion
Res
ourc
e;IS
IS,
Inte
rnat
ional
Sp
ecie
sId
enti
fica
tion
Syst
em;
Pt,
P.
trog
lody
tes.
17
IDO
ther
Iden
tifie
r(s)
Sex
Sam
ple
Rep
orte
dA
fter
Rep
orte
dC
lass
ifica
tion
Sour
ceC
ateg
ory
Gen
etic
Bir
thpl
ace
Bas
edon
Ana
lysi
sm
tDN
A/Y
Chr
omos
ome
Gen
otyp
e31
Reg
ina
fL
eip
zig
Wes
tern
Wes
tern
Sie
rra
Leo
ne
32
Socr
ate
sm
Lei
pzi
gW
este
rnW
este
rnC
ap
tive
born
33
Son
jaf
Lei
pzi
gW
este
rnW
este
rnS
ierr
aL
eon
e34
Yora
nm
Lei
pzi
gW
este
rnW
este
rnC
ap
tive
born
35
Yvon
ne
fL
eip
zig
Wes
tern
Wes
tern
Sie
rra
Leo
ne
36
Pt
81,
stu
db
ook
nu
mb
er380
fA
rizo
na
Wes
tern
Wes
tern
Sie
rra
Leo
ne
mtD
NA
,W
este
rn;
Y,
Wes
tern
37
Pt
82,
stu
db
ook
nu
mb
er341
mA
rizo
na
Wes
tern
Wes
tern
Sie
rra
Leo
ne
mtD
NA
,W
este
rn;
Y,
Wes
tern
38
Pt
83,
stu
db
ook
nu
mb
er459
fA
rizo
na
Wes
tern
Wes
tern
Wild
cau
ght,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
39
Pt
87,
ISIS
nu
mb
er1149
mA
rizo
na
Wes
tern
Wes
tern
Wild
cau
ght,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn40
Pt
88,
ISIS
nu
mb
er1144
mA
rizo
na
Wes
tern
Wes
tern
Wild
cau
ght,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn41
Pt
90,
ISIS
nu
mb
er1339
mA
rizo
na
Wes
tern
Wes
tern
Wild
cau
ght,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn42
Pt
97,
ISIS
nu
mb
er2036
mA
rizo
na
Wes
tern
Wes
tern
Wild
cau
ght,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn43
Pt
98,
ISIS
nu
mb
er2724
mA
rizo
na
Wes
tern
Wes
tern
Wild
cau
ght,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn44
Pt
99,
stu
db
ook
nu
mb
er561
mA
rizo
na
Wes
tern
Wes
tern
Wild
cau
ght,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn45
Pt
100,
ISIS
nu
mb
er3000
mA
rizo
na
Wes
tern
Wes
tern
Wil
dca
ught,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn46
Pt
101,
ISIS
nu
mb
er3214
mA
rizo
na
Wes
tern
Wes
tern
Wil
d-c
au
ght,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn47
Pt
102,
ISIS
nu
mb
er1068
mA
rizo
na
Wes
tern
Wes
tern
Wil
dca
ught,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn48
Pt
103,
ISIS
nu
mb
er3340
mA
rizo
na
Wes
tern
Wes
tern
Wil
dca
ught,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn49
Pt
104,
ISIS
nu
mb
er3339
mA
rizo
na
Wes
tern
Wes
tern
Wil
dca
ught,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn50
Pt
105,
ISIS
nu
mb
er2435
mA
rizo
na
Wes
tern
Wes
tern
Wil
dca
ught,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn51
Pt
106,
stu
dnu
mb
er430,
ISIS
2377
mA
rizo
na
Wes
tern
Wes
tern
Wild
cau
ght,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn52
Pt
107,
stu
dnu
mb
er142,
ISIS
2474
fA
rizo
na
Wes
tern
Wes
tern
Wild
cau
ght,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
53
Pt
112,
stu
dnu
mb
er314
mA
rizo
na
Wes
tern
Wes
tern
Wild
cau
ght,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn54
Pt
114,
ISIS
nu
mb
er2412
mA
rizo
na
Wes
tern
Wes
tern
/W
ild
cau
ght,
ori
gin
un
kn
ow
nm
tDN
A,
Nig
eria
n;
Y,
Wes
tern
Cen
tral
55
Pt
115,
ISIS
nu
mb
er2738
mA
rizo
na
Wes
tern
Wes
tern
Wil
dca
ught,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn56
Pt
117,
ISIS
nu
mb
er1641
mA
rizo
na
Wes
tern
Wes
tern
Wil
dca
ught,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn57
Pt
120,
ISIS
nu
mb
er2216
mA
rizo
na
Wes
tern
Wes
tern
Wil
dca
ught,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn58
Pt
121,
ISIS
nu
mb
er2549
mA
rizo
na
Wes
tern
Wes
tern
Wil
dca
ught,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn59
Pt
122,
ISIS
nu
mb
er2417
mA
rizo
na
Wes
tern
Wes
tern
Wil
dca
ught,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn60
Pt
124,
ISIS
nu
mb
er2404
mA
rizo
na
Wes
tern
Wes
tern
Wil
dca
ught,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn
Tab
le2.
1−
conti
nued.
18
IDO
ther
Iden
tifie
r(s)
Sex
Sam
ple
Rep
orte
dA
fter
Rep
orte
dC
lass
ifica
tion
Sour
ceC
ateg
ory
Gen
etic
Bir
thpl
ace
Bas
edon
Ana
lysi
sm
tDN
A/Y
Chr
omos
ome
Gen
otyp
e61
Pt
125,
ISIS
nu
mb
er2554
mA
rizo
na
Wes
tern
Wes
tern
Wild
cau
ght,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn62
Pt
126,
ISIS
nu
mb
er1818
mA
rizo
na
Wes
tern
Wes
tern
Wild
cau
ght,
ori
gin
un
kn
ow
nm
tDN
A,
Wes
tern
;Y
,W
este
rn63
Cori
ell
NA
03448
mC
ori
ell/
IPB
IRW
este
rnW
este
rnC
ap
tive
born
mtD
NA
,W
este
rn;
Y,
Wes
tern
64
Cori
ell
NA
03450
mC
ori
ell/
IPB
IRW
este
rnW
este
rnC
ap
tive
born
mtD
NA
,W
este
rn;
Y,
Wes
tern
65
Mari
lyn
e(C
ori
ell
NS
03612)
fC
ori
ell/
IPB
IRU
nre
port
edW
este
rn/
Cap
tive
born
mtD
NA
,W
este
rnC
entr
al
66
Kip
per
(Cori
ell
NS
03629)
mC
ori
ell/
IPB
IRU
nre
port
edW
este
rnC
ap
tive
born
mtD
NA
,W
este
rn;
Y,
Wes
tern
67
Gay
(Cori
ell
NS
03639)
fC
ori
ell/
IPB
IRU
nre
port
edW
este
rnC
ap
tive
born
mtD
NA
,N
iger
ian
68
Ju
an
(Cori
ell
NS
03641)
mC
ori
ell/
IPB
IRU
nre
port
edW
este
rn/
Cap
tive
born
Y,
Wes
tern
Cen
tral
69
Liz
zie
(Cori
ell
NS
03646)
fC
ori
ell/
IPB
IRU
nre
port
edM
ost
lyor
all
Cap
tive
born
mtD
NA
,W
este
rnW
este
rn70
Sh
een
a(C
ori
ell
NS
03650)
fC
ori
ell/
IPB
IRU
nre
port
edW
este
rnC
ap
tive
born
mtD
NA
,W
este
rn71
Jim
oh
(Cori
ell
NS
03657)
mC
ori
ell/
IPB
IRU
nre
port
edW
este
rnC
ap
tive
born
mtD
NA
,W
este
rn;
Y,
Wes
tern
72
Alici
a(C
ori
ell
NS
03659)
fC
ori
ell/
IPB
IRU
nre
port
edW
este
rnC
ap
tive
born
73
Garb
o(C
ori
ell
NS
03660)
fC
ori
ell/
IPB
IRU
nre
port
edW
este
rnC
ap
tive
born
74
Tan
k(C
ori
ell
NS
03623)
mC
ori
ell/
IPB
IRU
nre
port
edW
este
rnC
ap
tive
born
mtD
NA
,W
este
rn;
Y,
Wes
tern
75
Kase
y(C
ori
ell
NS
03656)
fC
ori
ell/
IPB
IRU
nre
port
edW
este
rnC
ap
tive
born
mtD
NA
,W
este
rn76
Pt
13
mA
rizo
na
Hyb
rid
Most
lyor
all
Cap
tive
born
mtD
NA
,C
entr
al;
Y,
East
ern
Cen
tral
77
Pt
113,
stu
dnu
mb
er721
mA
rizo
na
Hyb
rid
Wes
tern
/C
ap
tive
born
mtD
NA
,C
entr
al;
Y,
Wes
tern
Cen
tral
78
Pt
123,
stu
dnu
mb
er662
mA
rizo
na
Hyb
rid
Most
lyC
ap
tive
born
mtD
NA
,N
iger
ian
;Y,
Cen
tral
Cen
tral
79
Ulin
di
fL
eip
zig
Bon
ob
oB
on
ob
oC
ap
tive
born
80
Yasa
fL
eip
zig
Bon
ob
oB
on
ob
oC
ap
tive
born
81
IPB
IRnu
mb
er092
fC
ori
ell/
IPB
IRB
on
ob
oB
on
ob
oC
ap
tive
born
82
IPB
IRnu
mb
er251
mC
ori
ell/
IPB
IRB
on
ob
oB
on
ob
oC
ap
tive
born
83
IPB
IRnu
mb
er367
fC
ori
ell/
IPB
IRB
on
ob
oB
on
ob
oC
ap
tive
born
84
IPB
IRnu
mb
er661
mC
ori
ell/
IPB
IRB
on
ob
oB
on
ob
oC
ap
tive
born
Tab
le2.
1−
conti
nued.
19
2.4.1 Cluster analysis
To explore the genetic evidence for subdivision among chimpanzees, we first applied
the program STRUCTURE to the dataset (Materials and Methods; Pritchard et al.,
2000). Each STRUCTURE analysis requires a hypothesized number of populations
and assigns individuals to these populations − without using any pre-assigned popula-
tion labels − in a way that minimizes the amount of Hardy-Weinberg disequilibrium
and linkage disequilibrium across widely separated markers. The analysis strongly
supports the division of the samples of common chimpanzees and bonobos into at
least four discontinuous subpopulations. Although the software does not provide a
formal statistical procedure for choosing the number of clusters, Pritchard et al. (2000)
suggest using the model with the highest likelihood. When we ran the software as-
suming models of one to six clusters (averaging results for three random number seeds
for each model), the likelihood of the data for four clusters was higher than for any
other model. The inferred clusters correspond remarkably well to the reported labels
of Western, Central, Eastern, and bonobo, and also agree well with the assignments
based on mtDNA or Y chromosome haplotypes (Figure 2.1; Table 2.1).
The multi-locus dataset also provides power to identify individuals with multiple
ancestries and to assess their ancestry proportions. This cannot be done reliably
using studies of single loci such as the Y chromosome or mtDNA, because individuals
can in fact be descendants of multiple ancestral populations without carrying DNA
from some of the populations at the locus being studied. The STRUCTURE analysis
identified nine individuals as having more than 5% genetic ancestry from two clusters
(Table 2.2).
Of the individuals identified by STRUCTURE as likely hybrids, seven were born
in captivity, and just two were wild-caught, consistent with what would be expected
20
if there were low rates of migration between Central and Western chimpanzees in
the wild (Table 2.2; see also Won and Hey, 2005). Interestingly, individual num-
ber 54, one of two wild-caught individuals identified as a hybrid by this analysis,
has an mtDNA haplotype hypothesized to correspond to P. t. vellorosus (Gonder
et al., 1997). The two captive-born chimpanzees with the putative P. t. vellorosus
haplotype, however, have markedly different estimates of ancestry proportions, and
thus there is no evidence from the STRUCTURE analysis that these individuals form
a distinct population: the population ancestry estimates are 45% Central and 55%
Western for number 54; 84% Central and 16% Western for number 78; and 100%
Western for number 67.
We also used STRUCTURE to validate a minimal set of markers that could be
useful for classifying chimpanzees in conservation studies (Table S1, Microsatellites
used for this study, found at doi:10.1371/journal.pgen.0030066.st001). The top
30 markers (ranked by informativeness for assigning individuals to populations; see
Rosenberg et al., 2003) provide excellent power for classification (Table 2.3). Of 75
chimpanzees estimated as having 100% ancestry in one group by all markers, we found
that 71 were classified identically by the top 30 markers (by the criterion that at least
90% of the ancestry is assigned to the same group). Of nine individuals identified as
hybrids with all the markers, six were also detected as hybrids with the reduced set.
In addition to quantitative precision, the microsatellite panel also has a qualitative
advantage over single marker studies in classifying chimpanzee hybrids: mtDNA and
Y chromosome analyses cannot detect first generation female hybrids (Table 2.1) or
reliably classify hybrids of the second or higher generation.
21
Ce
ntr
al
Ea
ste
rnW
este
rnB
on
ob
o
Fig
ure
2.1:
ST
RU
CT
UR
Eanaly
sis,
bli
nded
top
opula
tion
lab
els
,re
capit
ula
tes
the
rep
ort
ed
pop
ula
tion
stru
cture
of
the
chim
panze
es.
Indiv
idual
s76−
78ar
ere
por
ted
hybri
ds.
Only
two
indiv
idual
sw
ith
a>
5%pro
por
tion
ofan
cest
ryin
mor
eth
anon
ein
ferr
edcl
ust
erar
ew
ild
bor
n:
num
ber
54an
dnum
ber
17.
Red
,C
entr
al;
blu
e,E
aste
rn;
gree
n,
Wes
tern
;ye
llow
,b
onob
o.
IDSex
Rep
orte
dST
RU
CT
UR
EA
nal
ysi
sP
CA
(Est
imat
eO
ther
Gen
etic
Sta
tus
WC
Eof
Per
centa
geIn
form
atio
nfr
omm
tDN
AIs
Qual
itat
ive)
and
YC
hro
mos
ome
17F
Eas
tern
991
All
Eas
tern
23F
Eas
tern
491
50∼
50%
Eas
tern
,∼
50%
Wes
tern
mtD
NA
,E
aste
rn54
MW
este
rn55
45∼
50%
Wes
tern
,∼
50%
Cen
tral
mtD
NA
,ve
llor
osu
s65
FU
nknow
n39
61∼
50%
Wes
tern
,∼
50%
Cen
tral
mtD
NA
,W
este
rn68
MU
nknow
n74
26∼
65%
Wes
tern
,∼
35%
Cen
tral
Y,
Wes
tern
69F
Unknow
n89
11A
llW
este
rnm
tDN
A,
Wes
tern
76M
Hybri
d81
19A
llC
entr
alm
tDN
A,
Cen
tral
;Y
,E
aste
rn77
MH
ybri
d50
50∼
50%
Wes
tern
,∼
50%
Cen
tral
mtD
NA
,W
este
rn;
Y,
Cen
tral
78M
Hybri
d15
823
∼15
%W
este
rn,∼
85%
Cen
tral
mtD
NA
,ve
llor
osu
s;Y
Cen
tral
Tab
le2.
2:In
div
iduals
wit
h>
5%ance
stry
from
more
than
one
clust
er.
All
nin
ein
div
idual
sin
this
table
are
indic
ated
by
ST
RU
CT
UR
Eto
hav
e>
5%an
cest
ryfr
omat
leas
ttw
op
opula
tion
s.O
fth
ese,
two
are
wild
bor
n:
num
ber
17an
dnum
ber
54.
PC
Aco
nfirm
sth
em
ixed
ance
stry
ofsi
xin
div
idual
s(n
um
ber
23,
num
ber
54,
num
ber
65,
num
ber
68,
num
ber
77,
and
num
ber
78)
(com
par
eF
igure
s2.
1an
d2.
2).
F,
fem
ale;
M,
mal
e.
22
Nam
elo
cati
onon
Locu
sor
alia
sR
epP
hysi
cal
Map
Info
rmat
iven
ess
]al
lele
sA
llel
esi
zera
nge
chro
mos
ome
GA
TA
32C
10Y
DY
S391
413
1120
291.
0751
55
289
317
GG
AA
4B09
N3
D3S
2403
413
1474
031.
0156
317
204
269
GA
TA
104
74
1431
1540
80.
9976
0827
173
227
GA
TA
129H
041
D1S
3721
441
1397
840.
9650
4129
188
255
GA
TA
164B
08P
3D
3S45
454
8500
000
0.90
1269
2721
325
8G
AT
A11
A06
18D
18S5
424
1148
2759
0.89
8777
3016
721
0G
AT
A61
E03
6D
6S10
514
3667
9852
0.89
4083
1220
725
1A
TT
T03
06
495
4028
60.
8813
5412
104
136
GA
TA
176C
012
D2S
2972
410
2193
472
0.86
4309
2919
827
0G
AT
A71
H05
16D
16S7
694
2612
5312
0.86
3574
2524
230
0A
TA
27A
06P
12D
12S1
042
328
0000
000.
8625
7510
116
146
GA
TA
43A
041
D1S
1653
415
5149
568
0.85
6855
2911
622
9G
AT
A14
E09
8D
8S23
244
7430
4288
0.85
4694
1318
421
8G
AT
A11
6B01
N2
D2S
2952
480
9977
70.
8521
2920
142
207
GA
TA
50G
0615
D15
S643
457
4298
800.
8425
0719
187
259
AT
A25
F04
YD
YS3
923
2147
8942
0.84
0013
724
426
5U
T75
4419
D19
S559
450
0220
640.
8390
5517
148
177
GA
TA
43C
117
D7S
1804
413
1695
864
0.83
6512
2021
429
8G
GA
A21
G11
L14
D14
S617
490
1928
320.
8350
3215
131
187
GG
AA
3A07
M1
D1S
1612
478
2753
10.
8106
2326
123
185
GA
TA
29A
016
D6S
1959
420
0200
720.
8053
59
146
178
GA
TA
28F
034
D4S
3248
460
0248
760.
7955
2313
239
267
TC
TA
017M
94
1165
0000
00.
7951
3118
158
200
AG
AT
120
22SN
P34
3411
419
6651
740.
7910
6122
251
293
GA
TA
25A
0417
D17
S129
94
3936
7576
0.79
0268
1917
221
8G
AT
A69
C12
XD
XS6
810
441
9496
280.
7724
9217
195
234
GA
TA
91H
06M
12D
12S1
301
442
3489
120.
7654
5117
8712
9G
AT
A7F
053
D3S
3039
473
7632
320.
7651
649
280
312
GA
TA
129D
11N
21D
21S2
052
427
7404
360.
7592
556
109
129
GA
TA
8C04
17D
17S9
744
1071
9277
0.75
6707
1018
521
7
Tab
le2.
3:In
form
ati
on
on
the
top
30
mic
rosa
tell
ites
use
dfo
rth
isst
udy
ranked
by
info
rmati
ven
ess
.
23
2.4.2 Principal components analysis
We next carried out principal components analysis (PCA). This approach has been
shown to have similar power to capture population structure as STRUCTURE, but
also provides a formal way of assigning statistical significance to population subdi-
vision (Patterson et al., 2006a). When the PCA is applied to the chimpanzee data,
the results support four discontinuous populations into which almost all chimpanzees
and bonobos can be classified. The first three principal components (eigenvectors) are
all highly statistically significant (p < 10−12) and nearly perfectly separate Western,
Central, and Eastern chimpanzees, and bonobos (Figure 2.2). Only six chimpanzees
fall visually outside of the clusters, a subset of the nine identified by STRUCTURE as
having at least 5% genetic contribution from more than one population (Table 2.2).
The fourth eigenvector (p = 0.011) is also significant, and the fifth is not significant
(p = 0.44).
The eigenvectors are strongly correlated to the population labels. We used non-
parametric analysis (Kruskal-Wallis tests) to explore whether the values of each sam-
ple along the four significant eigenvectors were significantly correlated to the four
pre-existing population labels. The overall statistic is highly significant (p < 10−10)
for the first three eigenvectors but insignificant for the fourth (p = 0.97), indicating
that this eigenvector is capturing population subdivision that is different from the
traditional Western/Central/Eastern/bonobo designations.
To explore whether the fourth eigenvector might reflect an as-yet-undefined chim-
panzee population, we carried out analyses separately on the Western chimpanzee
(n = 49), Central chimpanzee (n = 16), Eastern chimpanzee (n = 6), and bonobo
(n = 6) samples (including all individuals that were clearly classified by both PCA
and STRUCTURE). Western chimpanzees are the only population with evidence for
24
Figure 2.2: PCA, without using population labels, divides the 84 chim-panzees into four apparently discontinuous populations of Western, Cen-tral, Eastern, and bonobo.Plots of eigenvectors 1 versus 2, and eigenvectors 2 versus 3, show clustering intopopulations, with the expected assignments for the 75 individuals identified as allof one ancestry by STRUCTURE (solid circles). The nine individuals identified bySTRUCTURE as hybrids (open circles) are for the most part identified as hybridsby PCA as well. There are two individuals (red open circles) reported as being of aparticular population but that in fact appear to be hybrids: number 23, reported asEastern but in fact a Western-Eastern hybrid, and number 54, a wild-born individualreported as Western but in fact a Western-Central hybrid.
internal substructure (p = 5.5× 10−5). The first eigenvector obtained when Western
chimpanzees are analyzed by themselves strongly correlates to the fourth eigenvector
in the main analysis (Figure 2.3, r2 = 0.92; p < 10−12), indicating that the fourth
eigenvector describes subdivision within Western chimpanzees.
Although the fourth eigenvector seems to be detecting real structure, it does
not mark out discontinuous subpopulations of Western chimpanzees (Figure 2.3).
The failure to reveal the details of the structure is evident not only in the PCA,
but also in an application of STRUCTURE to the Western chimpanzees only, in
which a model of only one cluster is most likely. There is also no pattern to the
classification of Western chimpanzees even when we consider a model of two clusters
25
-4
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
Fourth eigenvector from analysis of all 84 samples
First eigenvector from western-only analysis
Figure 2.3: The significant fourth eigenvector from the analysis of all 84chimpanzees is correlated to the first eigenvector from analysis of Westernchimpanzees only.
Here, we present the correlation (r2 = 1) for the 49 individuals that are clearlyidentified as Western chimpanzees by both STRUCTURE and PCA, demonstratingthat these eigenvectors are revealing the same population structure.
(unpublished data). The most likely explanation is that there is not enough data
to assign individuals to different ancestries. Understanding of the fourth eigenvector
in the PCA will require more genetic data and better information about geographic
origin. In particular, we note that the only wild-caught Western samples for which
we had geographic information are from one location (Sierra Leone), thus we could
not perform a test for correlation with geography.
26
2.4.3 Testing for additional populations
Could there be additional population structure among the chimpanzees that we have
not yet detected (Gagneux, 2002)? A particular concern is that our sample size
is limited, decreasing our power to detect further structure especially among non-
Western chimpanzees.
To place an upper bound on further structure especially among the Central chim-
panzees, we considered the possibility that, among the 16 Central chimpanzees, a
subset is from a different population. We performed PCA on 10 Central/6 Eastern,
11 Central/5 Eastern, 12 Central/4 Eastern, 13 Central/3 Eastern and 14 Central/2
Eastern chimpanzees and assessed what fraction of 1, 000 random resamplings of Cen-
tral and Eastern chimpanzees showed evidence of structure (p < 0.05). This allowed
us to assess power to detect an additional population as diverged as the Eastern
chimpanzees.
The resampling analysis found that 6, 5, 4, 3, and 2 Eastern chimpanzees could be
detected from amidst the Central chimpanzees with 100%, 100%, 99%, 54%, and 7%
probability, respectively. Since the FST between Central and Eastern chimpanzees
is 0.05−0.09 (Table 2.2 and Fischer et al., 2006), this allowed us to place an upper
bound on the undetected structure that might exist among Central chimpanzees given
that we did not detect further structure. If the three samples with the P. t. vellorosus
mtDNA haplotype in our study constitute members of a distinct population, their
differentiation from Central chimpanzees is likely to be FST ≤ 0.09, lower than those
observed between some pairs of human populations (Cavalli-Sforza et al., 1994). An
important caveat is that we have no power to detect population structure for chim-
panzees missed by our sampling (we also have little power if there are fewer than
three individuals from a population). Thus, a more geographically systematic survey,
27
including more animals from a denser grid in Africa, may detect further structure.
2.4.4 Evidence for inbreeding
To test for inbreeding among the chimpanzees, we examined whether heterozygosity
within individuals was significantly lower than would be expected from random mat-
ing in the population (Materials and Methods). Western and Central chimpanzees
both show evidence for a reduced number of heterozygous genotypes (p < 0.05; Ap-
pendix A). We had too few Eastern and bonobo samples to perform an informative
test). A caveat is that misscoring of heterozygous genotypes, or the presence of
polymorphisms under the primers used for genotyping, could both result in an arti-
factual excess of homozygotes. To follow up this initial evidence of inbreeding among
chimpanzees, further analyses could search for multimegabase contiguous stretches of
homozygosity (Carothers et al., 2006).
2.4.5 First and second generation hybrids
To test for first and second generation hybrids, we calculated the likelihood of the
data under the hypothesis that an individual is an F1 hybrid, compared with the
alternative hypothesis of an older 50%−50% mixture of the ancestral populations.
To test whether the individual is an F2/backcross − a mixture of an F1 with an
unadmixed individual − we compared the likelihood of this model compared with the
alternative hypothesis of an older 75%−25% mixture of the two ancestral populations
(Materials and Methods).
Of the nine putative hybrids identified by STRUCTURE, the F1 hybrid test
identifies captive-born individual number 23 (an approximately 50%−50% Eastern-
Western hybrid by STRUCTURE analysis) as an F1, with a likelihood ratio (LR) of
28
∼24, 000, 000:1. The F2/backcross test identifies the captive-born individual number
68 (a 74%−26% Western-Central hybrid by STRUCTURE analysis) as an F2/backcross,
with an LR of ∼37:1 (the evidence is weaker because the signal of an F2/backcross is
more subtle). There are no other hybrids identified by either the F1 or F2/backcross
test, suggesting that the other animals in Table 2.2 could descend from third gener-
ation or older admixture events, or be members of as-yet unidentified populations.
The F1 test produced a particularly intriguing pattern in number 54, a wild-
caught individual with mtDNA that has been hypothesized to be diagnostic of P.
t. vellorosus origin (Gonder et al., 1997, 2006). Individual number 54 is estimated
to be a 55%−45% Western-Central mixture (Table 2.2) and shows an LR of 7:1 in
favor of being an old mixture, compared with the alternative of a first generation
F1 hybrid. However, a careful examination shows that the pattern of variation at
number 54 fits neither the hypothesis of a first generation hybrid or an older mixture.
To demonstrate this, we simulated 100 different Western-Central F1 hybrids and
100 older Western-Central mixtures by random sampling from the population allele
frequencies. Simulated older mixtures always generated an LR of >100, 000:1 relative
to the alternative hypothesis of an F1. Simulated F1 hybrids always gave an LR
<1:2. The LR for individual number 54 of 7:1 falls outside of either expectation.
This individual fits neither model, suggesting ancestry from an as-yet undetermined
population.
2.4.6 Allele frequency differentiation
To estimate the degree of allele frequency differentiation between chimpanzee groups,
we computed the RST statistic, a microsatellite-based estimator of FST (Slatkin,
1995; Schneider et al., 2000). RST assumes the stepwise mutation model (SMM),
29
Location Eastern Central BonoboWestern 0.31 (0.32) 0.25 (0.29) 0.68 (0.68)Eastern − 0.05 (0.09) 0.57 (0.54)Central − − 0.51 (0.49)
Table 2.4: Genetic differentiation among populations.Pairwise RST (versus FST from Fischer et al., 2006) is reported here. RST (amicrosatellite-based estimator of FST , Slatkin, 1995) is calculated using the Arlequinsoftware (Schneider et al., 2000) comparing 49 Western, six Eastern, and 16 Centralchimpanzees, and six bonobos. Analysis is restricted to autosomal loci with <5%missing data (leaving 220 markers in all cases). For the 49 Western chimpanzees,we used the 48 individuals identified as Western by STRUCTURE plus individualnumber 54, which was born in the Western range. All RST values are significantlydifferent from zero (p < 0.002), as determined by 10, 000 permutations. The values inparentheses are quoted from the SNP-based study of Fischer et al. (2006). Our studyhas less sampling error but relies on imperfect assumptions about the microsatellitemutation process, and so is more subject to systematic error. The close agreementbetween the two studies is encouraging.
in which the number of repeats changes by one or two or more units with an equal
probability of increasing or decreasing. A goodness-of-fit test suggests that this simple
model provides a reasonable match to our data (Figure 2.4). Encouragingly, the
RST estimates of population differentiation, obtained based on assuming this model,
are very similar to estimates of FST from a smaller multi-locus dataset based on
resequencing (Table 2.4 and Fischer et al., 2006).
A particularly intriguing feature of the allele frequency differentiation results is
that the allele frequency differentiation between bonobos and Western chimpanzees is
higher than that between bonobos and Central or Eastern chimpanzees (Table 2.4).
This likely reflects greater genetic drift in the Western lineage since divergence, as
has also been suggested by an analysis of resequencing data (Won and Hey, 2005).
30
0 100 200 300 400
0.0
0.1
0.2
0.3
0.4
Simulated vs. observed E(σi2) for autosomal tetra−nucleotides
E(σi2)
ObservedSimulated
(a)
0 20 40 60 80 100 120 140
0.0
0.1
0.2
0.3
0.4
Simulated vs. observed E(σi2) for autosomal tri−nucleotides
E(σi2)
ObservedSimulated
(b)
20 40 60 80 100
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Simulated vs. observed E(σi2) for autosomal di−nucleotides
E(σi2)
ObservedSimulated
(c)
Figure 2.4: Goodness-of-fit of the SMM.
Distributions of the expected σ2i over markers, for data sets observed and simulated
under the SMM. The distributions of E(σ2i ) were computed separately for (a) 221
tetra- , (b) 62 tri- and (c) 11 dinucleotides genotyped in the 49 Western samples(shown in blue). In red are the results of 500 simulations for each class of microsatel-lites. In all three cases, the observed distribution is not significantly different from theexpected distribution, as assessed by a permutation test (see Materials and Methodsfor more details).
31
2.4.7 Central and Eastern chimpanzees are most closely related in
time
The high frequency differentiation of Western chimpanzees compared with other
groups (Table 2.4) is consistent with them having been the first population to di-
verge, but does not prove it. An alternative explanation for the data is that there has
been a smaller effective population size on the Western lineage since their divergence,
resulting in high genetic drift in this population. We therefore applied a formal test
to assess which pair of populations is most closely related.
We approached the problem by testing whether three unrooted phylogenetic trees
are consistent with the data for chimpanzees: (1) Western-Central and bonobo-
Eastern forming clades, (2) Western-Eastern and bonobo-Central forming clades, and
(3) Eastern-Central and bonobo-Western forming clades. If a tree provides a good de-
scription of the history of the population, then the allele frequency differences between
two populations should only reflect changes since they split. For example, the differ-
ence in allele frequency between Central and Eastern chimpanzees should have arisen
entirely since their divergence from a common ancestral population and so should
be uncorrelated to the allele frequency differences between Western chimpanzees and
bonobos.
To implement this idea, we calculated the difference in frequency within clades
for all alleles and then tested for a correlation across clades. When we carry out this
analysis for the first and second hypothesized trees, a correlation is observed, rejecting
these trees at a significance of p = 0.00025 and p = 0.0027, respectively (Table 2.5).
The hypothesized Central-Eastern/bonobo-Western clade is the only one consistent
with the data (p = 0.37). Thus, our analysis does not find any evidence for gene flow
between Western and Central chimpanzees since their initial split, as has previously
32
been hypothesized (Won and Hey, 2005). If gene flow did occur, it would have had
to be sufficiently low to fall below the threshold of detection with our present data
size and the test we applied. We conclude that Eastern and Central are more closely
related in time than either population is to Western chimpanzees.
2.4.8 Population separation times
To estimate the times of genetic divergence among chimpanzees, we used the average
squared distance (ASD) statistic (Goldstein and Pollock, 1997). For microsatellites
evolving under the SMM, the expected time since the most recent common ancestor
(tMRCA) of two chromosomes is expected to be ASD/2µ, where µ is the average
mutation rate per year per locus, averaged across loci. Because allele lengths change
according to a random walk, the ASD between allele lengths in two chromosomes is
expected to increase linearly with time and is thus expected to act like heterozygosity
in sequence comparisons. By averaging ASD over pairs of chromosomes within and
across populations, we can estimate the average tMRCA within and across popula-
tions.
The results confirm that genetic diversity (heterozygosity) is least for Western
chimpanzees and bonobos, higher for Eastern chimpanzees, and highest for Central
chimpanzees, consistent with results obtained from a nucleotide resequencing dataset
(Fischer et al., 2006). To estimate the absolute tMRCA within and across populations,
we used two estimates of the microsatellite mutation rate. The first, µ = 6.57 ×
10−3 per year, relies on a 7 Mya estimated average tMRCA between humans and
chimpanzees and the observation that two Western chimpanzees are ∼14.8 times less
genetically diverged than humans and chimpanzees (Patterson et al., 2006b). We
also obtained a second estimate, 4.71 × 10−5 per year, based on estimated rates of
33
microsatellite mutation in humans, and assuming 15 years per generation (Table 2.6).
The absolute values of the time estimates for these tMRCAs should be treated
with caution because of uncertainty about the microsatellite mutation rate process
and the calibrations used to obtain absolute dates. Nevertheless, the tMRCA esti-
mates are consistent with previous results from smaller resequencing based datasets
(Stone et al., 2002). We note that the Central-Eastern, Central-Western, and Central-
Central tMRCAs are all similar, which appears at first to contradict the claim that
the populations split at different times. However, most of the genetic divergence re-
flects ancestral diversity, which is shared among all the chimpanzees (explaining why
the tMRCA estimates are substantially older than estimates of population splitting
times; Reinartz et al., 2000; Won and Hey, 2005). More refined analyses are needed,
such as the allele frequency correlation analysis presented in Table 2.5, or model-
based approaches, to detect the subtle patterns that indicate the splitting order of
the chimpanzee populations.
34
Cla
de
1C
lade
2C
orre
lati
onb
etw
een
Allel
eF
requen
cyp-
Val
ue
Diff
eren
ces
inE
ach
Cla
de
(Tw
o-Sid
ed)
Cen
tral
-Wes
tern
Eas
tern
-bon
obo
0.09
0.00
025
Eas
tern
-Wes
tern
Cen
tral
-bon
obo
0.06
50.
0027
Cen
tral
-Eas
tern
Wes
tern
-bon
obo
-0.0
130.
37
Tab
le2.
5:E
ast
ern
and
Centr
al
chim
panze
es
are
phylo
geneti
call
ym
ost
close
lyre
late
d.
Ther
ear
eth
ree
pos
sible
unro
oted
tree
sre
lati
ng
toth
efo
ur
pop
ula
tion
s.If
the
clad
esin
tow
hic
hth
etr
ees
are
par
titi
oned
corr
ectl
yca
ptu
reth
ep
opula
tion
rela
tion
ship
s,th
eal
lele
freq
uen
cies
shou
ldb
eunco
rrel
ated
when
com
par
ing
clad
e1
and
clad
e2.
We
obse
rve
sign
ifica
nt
corr
elat
ion
acro
sscl
ades
for
all
phylo
genet
ictr
ees
other
than
the
one
inw
hic
hC
entr
alan
dE
aste
rnch
impan
zees
clust
er.
To
corr
ect
for
the
non
-indep
enden
ceof
mic
rosa
tellit
eal
lele
s,w
eca
lcula
ted
sign
ifica
nce
by
aw
eigh
ted
jack
knif
ean
alysi
sre
mov
ing
each
mar
ker
intu
rnto
gener
ate
nor
mal
lydis
trib
ute
dZ
-sco
res;
thes
ew
ere
then
conve
rted
top-
valu
es.
35
Tim
eSin
ceth
eM
ost
Rec
ent
t MR
CA
inM
ya,
Ass
um
ing
7M
yat M
RC
Ain
Mya
(Usi
ng
Mic
rosa
tellit
eC
omm
onG
enet
icA
nce
stor
(tM
RC
A)
for
Hum
an-C
him
pG
enet
icM
uta
tion
Rat
eE
stim
ates
from
Div
erge
nce
(Cal
ibra
tion
Tim
e)a
Hum
ans)
a
Wit
hin
-Wes
tern
0.47
0.71
(0.6
2−0.
81)
Wit
hin
-Cen
tral
0.85
(0.7
5−0.
98)
1.29
(1.1
5−1.
45)
Wit
hin
-Eas
tern
0.73
(0.6
1−0.
86)
1.09
(0.9
3−1.
28)
Wit
hin
-bon
obo
0.63
(0.5
3−0.
76)
0.95
(0.8
1−1.
10)
Cen
tral
-Eas
tern
0.89
(0.7
7−1.
03)
1.35
(1.2
0−1.
53)
Cen
tral
/Eas
tern
-Wes
tern
0.84
(0.7
3−0.
97)
1.30
(1.1
6−1.
43)
Cen
tral
/Eas
tern
/Wes
tern
-bon
obo
1.56
(1.2
9−1.
90)
2.36
(1.9
7−2.
79)
Tab
le2.
6:E
stim
ate
sof
div
erg
ence
tim
efr
om
ASD
.t M
RC
Are
pre
sents
the
aver
age
tim
eto
the
mos
tre
cent
com
mon
ance
stor
ofa
pai
rof
chro
mos
omes
sam
ple
din
the
sam
eor
diff
eren
tp
opula
tion
s.It
can
be
subst
anti
ally
older
than
the
split
tim
e,as
ital
sore
flec
tsdiff
eren
ces
accu
mula
ted
inth
ean
cest
ral
pop
ula
tion
(90%
confiden
cein
terv
als
from
10,0
00b
oot
stra
ps)
.aA
nab
solu
tet M
RC
Afo
rW
este
rnch
impan
zees
isob
tain
edby
assu
min
gth
atth
ehum
an-c
him
pan
zeet M
RC
Ais∼
7M
ya.
We
calibra
teth
et M
RC
Afo
rW
este
rnch
impan
zees
at0.
47M
ya,
since
hum
an-c
him
pan
zee
sequen
cediv
erge
nce
ises
tim
ated
tob
e14
.8ti
mes
hig
her
than
Wes
tern
-Wes
tern
div
erge
nce
(Pat
ters
onet
al.,
2006
b).
An
alte
rnat
ive
esti
mat
eof
the
abso
lute
dat
esco
mes
from
sett
ing
the
muta
tion
rate
for
mic
rosa
tellit
esfo
rhum
ans
tob
e7.
06×
10−
4p
erge
ner
atio
nan
das
sum
ing
15ye
ars
per
gener
atio
n.
This
isob
tain
edfr
omdir
ect
esti
mat
esof
muta
tion
rate
sin
hum
ans
for
tetr
a-,
tri-
,an
ddin
ucl
eoti
des
(Zhiv
otov
sky
etal
.,20
03),
adju
stin
gfo
rth
ere
lati
vepro
por
tion
sof
each
typ
eof
mar
ker
inou
rdat
aset
:22
2te
tran
ucl
eoti
de
(incl
udin
gm
arke
rD
12S29
7,w
hic
hw
asob
serv
edto
hav
ean
unusu
ally
hig
hm
uta
tion
rate
),62
trin
ucl
eoti
de,
and
11din
ucl
eoti
de.
36
2.5 Conclusions
We have carried out the largest analysis of chimpanzee genetic variation to date, which
shows that the Western, Central, and Eastern chimpanzee subspecies designations
correspond to clusters of individuals with similar allele frequencies that can be defined
from the genetic data without regard to the population labels. Moreover, we find little
evidence for admixture between groups in the wild. Our analysis also provides the
first formal test showing that the Central and Eastern chimpanzee populations are
more closely related to each other in time than either is to Western chimpanzees.
PCA also further suggests population structure within Western chimpanzees.
However, more data − more samples, genetic markers, and information about ge-
ographic origin − would be needed to understand this structure. We find no support
for the proposed fourth population of common chimpanzees P. t. vellerosus. However,
the failure to detect a distinct population cluster for these individuals could simply
reflect a lack of power. Our analysis does allow us to state that even if P. t. vellorosus
is a distinct population, its level of allele differentiation from either Western, Central,
or Eastern chimpanzees is not likely to exceed FST = 0.09.
We finally emphasize that although we attempted to include chimpanzees from as
wide a range of sites in Africa as possible, the geographic sampling of the chimpanzees
that we studied was likely nonrandom. The fact that our study did not include
chimpanzees from some regions − including where chimpanzees are now extinct −
could create the appearance of too much discontinuity (Kittles and Weiss, 2003).
Future studies including chimpanzees across a denser grid of populations within Africa
could, in principle, identify intermediate populations of chimpanzees and demonstrate
more graded patterns of variation.
37
2.6 Materials and Methods
Data collection. The samples for this study were assembled from four sources:
DNA collections at the Max Planck Institute (Leipzig, Germany), Anne Stone’s lab-
oratory at Arizona State University (Stone et al., 2002), the Coriell Cell Reposi-
tories (Coriell Institute for Medical Research, Camden, New Jersey, United States.
Available: http://locus.umdnj.edu/primates/species_summ.html. Accessed 20
March 2007), and the Integrated Primate Biomaterial and Information Resource
(Coriell Institute for Medical Research, Camden, New Jersey, United States. Avail-
able: http://www.ipbir.org/ipbir_cgi/tax.cgii?mode=8&id=init&lvl=0. Ac-
cessed 20 March 2007). A total of 91 samples were genotyped, and 84 were included
in analysis after filtering (below). We had information about the approximate geo-
graphic origin of 25 wild-caught chimpanzees (Table 2.1). For most remaining sam-
ples, we had a population designation, and sometimes Y chromosome and mtDNA
genotypes (A. Stone, unpublished data). As far as possible, we attempted to use
pedigree information to remove related individuals from the captive chimpanzees.
We assembled all the DNA samples at a single site (the Broad Institute) and car-
ried out whole-genome DNA amplification (WGA) for all samples to generate a quan-
tity sufficient for analysis (Hosono et al., 2003). The WGA samples were shipped to a
company that specializes in genotyping microsatellite markers for human disease gene
mapping studies (PreventionGenetics, http://www.preventiongenetics.com). The
microsatellite markers all contain tandem repeats of two, three, or four nucleotides
that vary in repeat number across individuals. For example, at a single marker, dif-
ferent individuals might have between three and 11 contiguous repeats of a GATA
sequence. The assays used for genotyping were designed for humans. However, we
hypothesized that many of them would work for chimpanzees because of the ∼98.8%
38
sequence similarity (The Chimpanzee Sequencing and Analysis Consortium, 2005).
Assays were attempted for 470 microsatellites. Most came from the Marshfield
Screening Set 13 (designed for linkage screens in humans; Ghebranious et al., 2003,
and Mammalian Genotyping Service [Internet], 2007. Marshfield (W. I.): National
Heart, Lung, and Blood. Available: http://research.marshfieldclinic.org/
genetics. Accessed 20 March 2007) and were supplemented with markers from a
separate mapping study. Genotyping quality was assessed by specialists at Preven-
tionGenetics using standard measures of genotyping performance. A score of one to
four was given to each marker (with one being the best and four the worst). Markers
were scored as > 2 because of uncertainty in allele assignment, or an excessive num-
ber of missing genotypes, or an excess in the numbers of homozygotes or non-integer
alleles (defined as alleles with non-integer length differences compared with frequent
alleles). We used the 310 markers that were designated as of highest or second-highest
quality because the two sets produced indistinguishable inferences on our data (un-
published). For all analyses other than the use of STRUCTURE, we considered only
autosomal or pseudoautosomal markers, since these could be treated uniformly. This
resulted in 295 markers; we also excluded two additional pseudoautosomal markers
for the PCA and F1/F2 hybrid analyses. Genotypes for all markers are available in
Dataset S1 (Raw genotype data in a format appropriate for STRUCTURE analysis,
found at doi:10.1371/journal.pgen.0030066.sd001).
A subset of 84 of the 91 genotyped samples were chosen for further study after
removing two due to a high missing data rate, one due to evidence for contamination
(more than two genotypes at many loci), and four due to evidence of genetic related-
ness: two accidental duplicates, and two apparent first degree relatives. For each pair
of related individuals, we dropped the one with the lower success rate. The duplicate
39
individuals allowed us to assess genotyping quality. For the two individuals studied
in duplicate, 1.18% of genotyping calls differed, suggesting an error rate per genotype
of ∼0.59% (i.e., we estimate that on average approximately two loci were mistyped
per individual).
Data analysis. We applied two complementary methods to characterize popula-
tion structure in chimpanzees. First, we used the software STRUCTURE in the
”admixture” mode, so that individuals were allowed to have ancestry from multiple
populations. We used a model of correlated allele frequencies, a ”burn-in” of 100, 000
Markov Chain Monte Carlo (MCMC) iterations, and 1, 000, 000 follow-on MCMC
iterations.
We also analyzed the data using a new implementation of PCA (Patterson et al.,
2006a) available online in the EIGENSOFT software package (Boston: Reich Labora-
tory, Harvard Medical School, Department of Genetics. Available: http://genepath.
med.harvard.edu/;reich/Software.htm. Accessed 20 March 2007). Briefly, the
analysis begins with a rectangular matrix, with the 84 rows corresponding to the
individuals, and the columns corresponding to the alleles (the cells give the number
of copies of each allele for each individual: zero, one, or two). To analyze the data,
we perform a singular value decomposition, a procedure that produces eigenvectors
and eigenvalues. The first eigenvector separates the samples in a way that explains
the largest amount of variability, while the second and subsequent ones explain lesser
amounts of variability. Thus, the first eigenvector distinguishes individuals from the
population that is most differentiated, and each subsequent eigenvector separates the
next most differentiated population. Eigenvalues above a significance cut-off repre-
sent significant population structure for the associated eigenvector (Patterson et al.,
2006a).
40
To examine evidence for inbreeding in these samples, we used the output of
STRUCTURE to assign individuals to populations, excluding individuals with >5%
ancestry from more than one population and focusing our study on the population
samples with more than ten individuals (treating captive and wild-caught individuals
separately). Thus, we analyzed 48 Western chimpanzees, including 34 wild-caught
and 14 captive-born individuals. We also analyzed 13 wild-caught Central chim-
panzees (Table 2.1). We considered two statistics: Hw, the average heterozygosity
within an individuals two chromosomes, and Rw, calculated as∑
m(m1−m2)2M , where
m1 and m2 are the alleles number of repeat units at marker m within an individual,
and M is the total number of markers considered. We used the average value of Hw
and Rw over individuals in a population to test the hypothesis of random mating and
assessed significance by a permutation test. Specifically, we generated 1, 000 samples
of n individuals, randomly assigning to each of them two alleles from the pool, then
counted how often Hw or Rw was as small or smaller than observed. Hw was signif-
icantly lower than expected for all samples considered (at the p < 0.05 level), while
Rw was significantly lower in wild-caught Western chimpanzees (Appendix A).
For other analyses in which we were interested in studying only individuals iden-
tified unambiguously as being from one population, we excluded the captive-born
individuals defined as putative hybrids by STRUCTURE.
Application of the stepwise mutation model. We examined whether our data
fit the SMM in the 49 individuals that are identified as Western chimpanzees by both
STRUCTURE and PCA. We focused on Western chimpanzees because they have the
largest sample size and hence provide us with the most power to detect a departure
from the SMM. Under the one-step SMM, σ2i , the variance of an allele with i repeats,
is an estimator of 2Nµ, the population mutation parameter (Valdes et al., 1993). For
41
each marker, we calculated E(σ2i ), the expectation over all alleles of a given marker,
thus obtaining an estimate of 2Nµ for the 221 tetra-, 62 tri-, and 11 dinucleotides
included in other analyses. Averaging this estimate over all markers of the same type
and dividing by N = 10, 000 (Won and Hey, 2005), we obtain an estimate of µ for
each type of marker, µ = 3.77 × 10−3, 1.91 × 10−3, and 2.17 × 10−3, respectively.
These estimates are roughly similar to independent estimates (Zhivotovsky et al.,
2003) based on microsatellites in humans: 6.40× 10−4, 7.10× 10−4, and 1.51× 10−3,
respectively.
To assess the goodness-of-fit of the SMM, we compared the observed distribution
of σ2i to the expected distribution, obtained by using the coalescent simulator SIM-
COAL2 (Excoffier et al., 2000). We generated 500 independent replicates for each
type of marker under a standard neutral equilibrium model, with an effective pop-
ulation size of 10, 000 (Won and Hey, 2005), a sample size of 98 chromosomes, and
a mutation rate per generation set to µ. The range constraints for the number of
repeat units were set to be equal to the maximum repeat observed in the sample for
each type of marker. We tested whether there was a significant difference in the dis-
tributions by an asymptotic Wilcoxon rank sum test, carrying out the test separately
for each type of marker. The observed distributions do not differ significantly from
the expectation under the SMM (Figure 2.4).
Tests for F1 and F2/backcross hybrids. We calculated a log Bayes factor to
test the hypothesis that a chimpanzee is an F1 hybrid of two known populations
versus the alternative that it is a 50%−50% mixture (i.e., it is an older hybrid).
For a given autosomal marker, one can compute a log-factor under the assumption
that the allele frequencies are known; these log-factors can then be summed across
all markers. In practice, our population samples are small, and so allele frequency
42
estimates are imprecise. To account for uncertainty in the allele frequencies, we used
a hierarchical Bayesian model for the unknown frequencies, with a Dirichlet prior
distribution for the frequencies (the details of this calculation are similar to those
described by Lockwood and Roeder, 2001). We verified the performance of the test
by simulation (see text).
2.7 Acknowledgments
We thank the Albuquerque Biological Park, Detroit Zoo, Lincoln Park Zoo, Riverside
Zoo, Sunset Zoo, the Primate Foundation of Arizona, New Iberia Research Cen-
ter, and the Southwest Foundation for Biomedical Research for sharing chimpanzee
samples. We thank Gavin McDonald for sample processing and Jim Weber at Pre-
ventionGenetics for personally ensuring that the genotypes for the chimpanzees were
carefully scored. We thank Jean Wickings, Svante Paabo, Philip Morin, Anne Fis-
cher, and four anonymous reviewers, for sharing their samples and/or comments on
earlier versions of the manuscript. We are also grateful to Jennifer Caswell, Graham
Coop, Sridhar Kudaravalli, Daniel Lieberman, Simon Myers, John Novembre, David
Pilbeam, Alkes Price, and Maryellen Ruvolo for useful discussions.
Author contributions. MP and DR conceived and designed the experiments.
CB, MP, and DR performed the experiments and wrote the paper. CB, NP, MP, and
DR analyzed the data. CB, NP, ACS, MP, and DR contributed reagents/materials/analysis
tools. MP and DR co-supervised the work.
Funding. This work was supported by a National Institutes of Health K-01 ca-
reer transition award to NP, an Alfred P. Sloan Fellowship to MP, and a Burroughs
43
Wellcome Career Development Award to DR. Chimpanzee-sampling and subspecies
identification analyses were supported by a grant from the National Science Founda-
tion (BCS-0073871) to AS.
2.8 Appendix A: Testing for inbreeding
Wild-caught Captive-born
Western 0.606 (< 10−3) 0.611 (< 10−3)
Central 0.697 (< 10−3) −
Table 2.7: Mean heterozygosity within individuals from each population.The samples are 34 wild-caught Western chimpanzees, 14 captive-born Western chim-panzees and 13 wild-caught Central chimpanzees. The p-value for being lower thatexpected, calculated from 1, 000 permutations, is indicated in parentheses (see Meth-ods).
Wild-caught Captive-bornWestern 61.9 (0.001) 65.5 (0.208)Central 117.6 (0.186) −
Table 2.8: Mean squared difference in repeat units within individuals ineach population.See details in Supplemental Table 2.7.
44
0.60 0.65 0.70
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Wild−caught western chimpanzees
Hw
HwMean observed HwMean randomized Hw
(a)
0.60 0.65 0.70
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Captive−born western chimpanzees
Hw
HwMean observed HwMean randomized Hw
(b)
0.60 0.65 0.70
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Wild−caught central chimpanzees
Hw
HwMean observed HwMean randomized Hw
(c)
Figure 2.5: Distribution of mean heterozygosity within individuals (Hw,orange).We carry out a separate analysis for (a) 34 wild-caught, (b) 14 captive-born Westernchimpanzees, and (c) 13 wild-caught Central chimpanzees. We assigned individualsto subpopulations according to the results of STRUCTURE, excluding individualswith > 5% ancestry in more than one population. The red vertical lines are the meanHw in each population and the blue lines are the mean Hw in permutations. Themean Hw is significantly lower (at the p < 0.05 significance level) than expected bythe randomized Hw in all populations (see Supplemental Table 2.7).
45
60 80 100 120
0.0
0.1
0.2
0.3
0.4
Wild−caught western chimpanzees
Rw
RwMean observed RwMean randomized Rw
(a)
60 80 100 120
0.0
0.1
0.2
0.3
0.4
Captive−born western chimpanzees
Rw
RwMean observed RwMean randomized Rw
(b)
60 80 100 120
0.0
0.1
0.2
0.3
0.4
Wild−caught central chimpanzees
Rw
RwMean observed RwMean randomized Rw
(c)
Figure 2.6: Distribution of mean squared difference in repeat units withinindividuals (Rw, orange).See details in legend of Figure 2.5. The red vertical lines are the mean Rw in eachpopulation and the blues lines are the mean Rw from permutation. Mean Rw issignificantly lower (p < 0.05) than expected by the randomized Rw in wild-caughtWestern and Central chimpanzees (see Supplemental Table 2.8).
CHAPTER 3
A NEW APPROACH TO ESTIMATE PARAMETERS OF
SPECIATION MODELS, WITH APPLICATION TO APES
46
47
3.1 Abstract
How populations diverge and give rise to distinct species remains a fundamental ques-
tion in evolutionary biology, with important implications for a wide range of fields,
from conservation genetics to human evolution. A promising approach is to estimate
parameters of simple speciation models using polymorphism data from multiple loci.
Existing methods, however, make a number of assumptions that severely limit their
applicability, notably, no gene flow after the populations split and no intra-locus
recombination. To overcome these limitations, we developed a new Markov Chain
Monte Carlo method to estimate parameters of an isolation-migration model. The
approach uses summaries of polymorphism data at multiple loci surveyed in a pair of
diverging populations or closely related species and, importantly, allows for intra-locus
recombination. To illustrate its potential, we applied it to extensive polymorphism
data from populations and species of apes, whose demographic histories are largely
unknown. The isolation-migration model appears to provide a reasonable fit to the
data. It suggests that the two chimpanzee species became reproductively isolated
in allopatry ∼850 Kya, while Western and Central chimpanzee populations split ap-
proximately 440 Kya, but continued to exchange migrants. Similarly, Eastern and
Western gorillas and Sumatran and Bornean orangutans appear to have experienced
gene flow since their splits ∼90 and over 250 Kya, respectively.
3.2 Introduction
Although central to evolutionary biology, the question of how species form remains
largely open. In fact, the very definition of species is a subject of active debate (Hey,
2006a). The most common definition is the “biological” one, in which species are
48
defined as groups of interbreeding organisms that are reproductively isolated from
other populations. The introduction of this concept over 60 years ago transformed
the study of speciation into a research program to examine the conditions under which
reproductive isolation emerges and to uncover its genetic architecture (Mayr, 1963).
Accumulating evidence suggests that incipient species arise primarily in popula-
tions with restricted gene flow, as alleles (or combination of interacting alleles) that
contribute to reproductive isolation reach fixation (e.g., Wittbrodt et al., 1989; Sawa-
mura et al., 1993; Ting et al., 1998; Wang et al., 1999; Fossella et al., 2000; Barbash
et al., 2003; Presgraves et al., 2003; Coyne and Orr, 2004a). The speciation process
initiates after two populations become completely isolated from one another (i.e., are
in allopatry) or as they continue to exchange migrants (i.e., in parapatry).
Under a model of allopatric speciation, the process occurs through the homoge-
neous divergence of the genome. Shortly after the split, the two populations share
alleles due to the persistence of ancestral polymorphism (more so if the ancestral pop-
ulation size is large). Eventually, however, the shared alleles are lost or reach fixation
and the two populations start to accumulate fixed differences, either by genetic drift
or due to differential adaptation (Coyne and Orr, 2004d). Under a simple allopatric
model with no selection, it will take approximately 9−12N generations (where N
is the effective size of the descendant population) for the genealogies of >95% of
loci to be reciprocally monophyletic, and hence for the two populations not to share
alleles that are identical by descent (Hudson and Coyne, 2002). Given these assump-
tions, humans and common chimpanzees should almost never share alleles (as they
are thought to have diverged ∼20−25N generations ago; Wall, 2003; Hobolth et al.,
2006; Patterson et al., 2006b), while bonobos and common chimpanzees are expected
to share alleles at about 50% of loci (since they are estimated to have diverged ∼4N
49
generations ago, Won and Hey, 2005).
If the incipient species are in parapatry, however, divergence is not believed to oc-
cur homogeneously across the genome, but instead in a number of stages (Wu, 2001b).
First, alleles that cause a decrease in hybrid fertility or viability reach fixation in the
parental populations. The populations may continue to exchange migrants but in the
genomic regions carrying functionally divergent or incompatible alleles, gene flow is
selected against and hence effectively restricted. By contrast, in unlinked (or loosely
linked) genomic regions, alleles can be brought in by migrants with no associated
fitness costs. Thus, at neutral loci, populations share alleles longer than expected
under allopatric speciation. Eventually, reproductive isolation factors accumulate in
sufficient numbers as to prevent gene flow throughout the genome − the final stage of
speciation. This model predicts variation in the number of shared alleles and levels of
divergence along the genomes of closely related species. While shared alleles are also
expected under a model of recent allopatric speciation, greater variation is expected
along the genome under parapatry, such that, with enough data, the two scenarios
should be distinguishable.
In these simple speciation models, the salient parameters are the split times, ef-
fective population sizes, and, in the case of parapatry, the gene flow rates. Thus,
learning about these parameters should greatly deepen our understanding of spe-
ciation. This realization has motivated the development of statistical methods to
estimate the parameters from multi-locus patterns of polymorphism in closely related
species.
Existing methods all assume that genetic variation data are available from both
populations, at a number of independently-evolving loci, but differ in their assump-
tions about gene flow and recombination, and in whether they use all the polymor-
50
phism data or summaries. Loosely, they can be classified into two groups. The first
set assumes an extreme model of allopatry, in which a panmictic (i.e. randomly
mating) ancestral population instantaneously splits into two panmictic descendant
populations, with no subsequent gene flow. In this model, there are four parameters:
the three effective population sizes and the split time. The parameters are estimated
using summaries of the polymorphism data, either by a moment estimator (Wakeley
and Hey, 1997; Kliman et al., 2000), or by maximum likelihood (Leman et al., 2005;
Putnam et al., 2007). While, in its current version, the method of Leman et al. (2005)
can only be applied to one, non-recombining locus, other methods can be applied to
multiple loci, and incorporate recombination (Wakeley and Hey, 1997; Putnam et al.,
2007). They use highly summarized versions of the data, however, at the potential
cost of much information. Moreover, in the presence of gene flow after the split,
their estimates will be biased − the ancestral effective population size will tend to be
over-estimated (Wall, 2003) and the split time under-estimated (Leman et al., 2005).
The second set of methods considers a more general model, often called the
“isolation-migration” model, in which there is gene flow between incipient species
throughout the genome, either at fixed (Hey and Nielsen, 2004) or at locus-specific
rates (Won and Hey, 2005). The parameters are estimated from all the polymor-
phism data at a single locus (Nielsen and Wakeley, 2001) or at multiple loci (Hey and
Nielsen, 2004), using Markov Chain Monte Carlo (MCMC). The Hey and Nielsen
method, henceforth called IM, has been applied to a number of species, from He-
liconius (Bull et al., 2006) to cichlids (Hey et al., 2004; Won et al., 2005). These
applications suggest that speciation often occurs in the presence of some gene flow
(Hey, 2006a).
While IM considers a wide range of models, it assumes that haplotypes are known
51
and that there is no intra-locus recombination. Although not ideal, the first assump-
tion is not restrictive, as a two-step procedure can be used, in which haplotype phase
is inferred (e.g., using the program PHASE, Stephens et al., 2001), then IM is run on
the phased data. In contrast, the assumption of no recombination is more limiting,
because the method can only be applied to autosomal loci by excluding segments or
haplotypes with evidence for recombination. This practice is likely to bias estimates
of the parameters, as excluding segments with visible recombination will tend to lead
to shorter genealogical histories (Hey and Nielsen, 2004). Moreover, if intra-locus
recombination is not taken into account, a small variance in divergence times across
segments may be confounded with a small ancestral effective population size (Taka-
hata and Satta, 2002). The assumption of no intra-locus recombination represents an
especially severe limitation in species in which the ratio of recombination to mutation
is thought to be high (e.g., Drosophila melanogaster, Andolfatto and Wall, 2003; or
Papilio glaucus, Putnam et al., 2007). In such species, any segment with polymor-
phisms in a sample is likely to have experienced numerous recombination events in
its genealogical history, making recombination hard to ignore (Hudson and Kaplan,
1985; Nordborg and Tavare, 2002).
To overcome this limitation, we developed a new Bayesian approach to estimate
parameters of an isolation-migration model from recombining loci. We have in mind
data sets similar to the ones most commonly collected to date: short non-coding
sequences distributed throughout the genome. Our approach is to summarize the
polymorphism data at each locus by four statistics known to be sensitive to the
parameters of interest (Wakeley and Hey, 1997; Leman et al., 2005). We then estimate
the posterior probability of the parameters given these summaries using MCMC.
Simulations suggest that, in the absence of recombination, our method performs as
52
well or almost as well as the full likelihood approach. Moreover, the approach presents
the advantage of being quite flexible in the demographic model that it can consider,
and in allowing for intra-locus recombination.
We illustrate the potential of our method by applying it to multi-locus polymor-
phism data from non-coding loci in chimpanzees, gorillas and orangutans. Very little
is known about the evolutionary history of great apes, in part because of a poor
fossil record. Chimpanzees, the closest living relatives of humans, are classified into
two species, common chimpanzees (Pan troglodytes) and bonobos (P. paniscus), both
found exclusively in Africa. The two chimpanzee species were thought to have di-
verged as a result of the formation of the River Congo 1.5−3.5 million years ago
(Mya) (Beadle, 1981; Myers Thompson, 2003), but recent estimates of their split
time based on genetic data appear to be too recent for this scenario to be plausi-
ble (Fischer et al., 2004; Won and Hey, 2005). Common chimpanzees are usually
subdivided further into three (or sometimes four) “subspecies”, including Eastern
(P. troglodytes schweinfurthii), Central (P. troglodytes troglodytes), and Western (P.
troglodytes verus) (Hill, 1969). The meaning of the term “subspecies” is unclear, at
least to us, but the labels are thought to correspond to the most pronounced popula-
tion structure within the species. This view is supported by a recent analysis of 310
microsatellites, which found three populations within common chimpanzees, which
correspond to the three subspecies, and little evidence of recent gene flow between
them (Becquet et al., 2007).
Gorillas, in turn, are classically subdivided into two subspecies: Western (Gorilla
gorilla) and Eastern gorilla (G. beringei), found in western and central African forest,
respectively (Groves, 1970). Some controversy surrounds this classification; the range
of the two populations does not currently overlap in the wild, but on the basis of
53
morphological and genetic diversity, it has been proposed that the subspecies should
be elevated to the rank of species (e.g., Grubb et al., 2003). Here, we refer to Western
and Eastern gorillas as subspecies or populations. A recent application of IM to
polymorphism data from the two gorilla populations suggests that they split between
0.08 and 1.6 Mya, and experienced low levels of gene flow since (Thalmann et al.,
2006).
Even less is known about the history of orangutans (Pongo pygmaeus), currently
found only in Indonesia and Malaysia, but whose range is thought to have spanned
much of South East Asia until recently (Smith and Pilbeam, 1980). Some taxonomies
consider Sumatran (P. p. abelii) and Bornean (P. p. pygmaeus) orangutans to be
subspecies (e.g., Groves, 1971), and others to be species (e.g., Zhi et al., 1996). Again,
these populations do not overlap in their range, so that the classification is based on
morphology and behavior, as well as on patterns of genetic diversity. The islands
of Sumatra and Borneo were fully formed 500 thousand years ago (Kya), but were
reconnected by land bridges during the two last glaciations, ∼130−200 Kya and
∼10−100 Kya, respectively (Muir et al., 2000; Hughes et al., 2006). Estimates of
the average time to the most recent common ancestor for both populations based on
mitochondrial DNA (mtDNA) and a small number of microsatellites and autosomal
and X-linked loci are 1.5−2.5 Mya (Zhi et al., 1996; Kaessmann et al., 2001; Zhang
et al., 2001), but to our knowledge, there are no published estimates of the population
split time.
54
Here, we analyze a compilation of multi-locus polymorphism data recently pub-
lished in the three great ape species (Yu et al., 2003; Fischer et al., 2006; Thalmann
et al., 2006), refining population parameter estimates for chimpanzees and gorillas
and providing the first estimates for orangutans.
3.3 Results
We developed a method that estimates the demographic parameters of an
“isolation-migration” model from recombining loci (Figure 3.1). There are five param-
eters of interest: the population mutation rates for the two descendant populations,
θ1 and θ2, and the ancestral population, θA, the time since the populations split
in generations, T , and the migration rate, m. To estimate these parameters, the
method requires resequencing data from two populations (or closely related species)
at independently-evolving loci, and an outgroup sequence. Briefly, the polymorphism
data for each locus are summarized by the four statistics studied by Wakeley and Hey
(1997), as these carry information about the divergence time and other parameters
of interest (Wakeley and Hey, 1997; Leman et al., 2005). We choose the parame-
ters of the model from prior distributions and for each locus, we generate a set of
genealogies under a model with those parameters. We then estimate the likelihood
by calculating the probability of the data summaries at all the loci given the set of
genealogies and the parameters. Finally, we obtain a sample from the posterior dis-
tribution of the parameters given the data summaries using MCMC (see Methods).
Thus, our method follows a number of Bayesian approaches that use summaries of
the data, but differs in that we update the parameters using MCMC (see Methods
for more details). Hereafter, we refer to our method as MIMAR: MCMC estimation of
the isolation-migration model allowing for recombination.
55
Figure 3.1: The “isolation-migration” model.Two populations diverged T generations ago from a common ancestral population.The parameters θ1, θ2 and θA are the population mutation rates per base pair for pop-ulations 1, 2 and the ancestral population, respectively. The split time in generationsis T and m is the symmetrical migration rate between populations per generation(see Methods for details).
3.3.1 Performance of MIMAR under the allopatric model
In order to assess the performance of our method, we generated 30 simulated data
sets, each consisting of 20 non-recombining loci, with parameter values applicable
to Drosophila species in which related studies have been conducted (Llopart et al.,
2005, see Methods). Supplemental Figure 3.8 (Appendix B) shows the 30 posterior
distribution samples for the four parameters of interest. As can be seen, the posterior
distributions estimated by MIMAR for θ1, θ2 and T are centered around their true
values with relatively little variance, suggesting that the summaries that we use carry
much information to estimate these parameters precisely and accurately. However, for
these parameters, 20 non-recombining loci do not seem to contain as much information
about the ancestral effective population size, leading to a wider posterior distribution
estimate for θA. This does not appear to be a feature of our statistics, since the use
56
of IM yields similar results, even though it is based on the full polymorphism data
set (results not shown). As expected, our estimates of θA become more precise with
larger data sets (results not shown).
3.3.2 Comparison to IM for the case of no recombination
Mean absolute error Mean of the estimatedivided by the true value
Parameters MIMAR IM MIMAR IMθ1 0.0003 0.0002 1.000 0.983θ2 0.0004 0.0003 1.001∗ 0.968∗
θA 0.0027 0.0037 0.927 0.875
T 5.19× 105 5.94× 105 1.004 0.980
Table 3.1: Performance of MIMAR and IM.Precision and accuracy for the four parameters of the allopatric model (using themode as a point estimate). MIMAR and IM were applied to 30 multi-locus data setssimulated under the allopatric model (see Methods for details).∗ The biases in θ2 estimates from IM and MIMAR are significantly different at the 5%level, after Bonferroni correction (p = 0.006 using a Wilcoxon signed rank test).
Next, we studied the performance of MIMAR by generating 30 simulated data sets
under the allopatric model for 20 non-recombining loci, but this time drawing the
parameters from prior distributions (see Methods for details); the parameter values
are, as above, applicable to Drosophila species. The results confirm that the estimates
of θ1, θ2 and T are precise and have very little bias, while the estimates of θA are less
precise (Table 3.1 and Figure 3.2).
We analyzed the same simulated data sets with IM to compare the estimates from
MIMAR, which are based on summaries, to a full likelihood approach (since IM does
not allow for recombination, we set the intra-locus recombination rate to 0 when
generating the 30 data sets). We found that the two methods perform similarly well,
in terms of both accuracy and precision (see Table 3.1). For example, the mean
57
absolute error over the 30 simulated data sets for the estimate of T is 5.19 × 105
using MIMAR and 5.94 × 105 using IM. Similarly, if we consider the estimate divided
by the true value as a measure of bias, the mean over the 30 data sets is 1.004 for
MIMAR and 0.980 for IM. Similar results were obtained for all parameters, with the
possible exception of the current effective population sizes, for which MIMAR appears
to yield slightly more reliable estimates (see Table 3.1). Moreover, we found that
the two methods have similar coverage: For both, the central 95th percentiles of the
marginal posterior distribution sample for T included the true value in 29 out of 30
cases; for θA, this occurred in 29 out of 30 cases for IM and 30 out of 30 cases for
MIMAR. We also compared the results of MIMAR and IM on larger simulated data sets
of 100 loci, and found that, in this case, IM outperformed MIMAR. However, with such
large data sets, both methods provided highly accurate and precise estimates (results
not shown).
In the comparison, we ran both methods long enough for them to appear to have
reached convergence (Supplemental Figure 3.9). For the same number of iterations
of the MCMC, IM was 2−3 times faster than MIMAR (results not shown).
58
0.000 0.005 0.010 0.015
0.0
00
0.0
05
0.0
10
0.0
15
θA^ − θA
IM
0.0 0.5 1.0 1.5 2.0
0.0
0.5
1.0
1.5
2.0
θA^ θA
0 500000 1000000 1500000 2000000 2500000
0500000
1500000
2500000 T − T
MIMAR
IM
0.9 1.0 1.1 1.2
0.9
1.0
1.1
1.2
T T
MIMAR
Figure 3.2: Performance of MIMAR (x-axis) and IM (y-axis).Precision and accuracy of the estimates of θA (upper panel) and T (lower panel)obtained from the mode of the marginal posterior distribution samples. Each datapoint is the result from one of 30 data sets simulated under the allopatric model,with parameter values sampled from prior distributions (see Methods for details). Ifboth methods provided estimates with perfect precision, all the data points in the leftpanels would be located at (0, 0). If one method provided systematically more preciseestimates, the data points would align along its axis: x-axis for MIMAR or y-axis forIM. Similarly, if both methods provided estimates with perfect accuracy, all the datapoints in the right panels would be at the intersection of the two grey lines (1, 1). Ifone method provided more precise estimates, the data points would align along thevertical grey line for MIMAR or the horizontal grey line for IM. As can be seen, bothmethods perform similarly well (see Table 3.1; p-value from Wilcoxon signed ranktest >5%, after Bonferroni correction for multiple tests).
59
3.3.3 Assessing the evidence for gene flow
0 5 10 150.0
000
0.0
008
θA^ − θA
T − T
04.1
04
0 5 10 15
0 5 10 15
02
46
M − M
# data set
0 5 10 15
0.7
1.0
1.3
θA^ θA
T T
0.2
1.0
0 5 10 15
M M
# data set
1.0
5.0
0 5 10 15
Figure 3.3: Performance of MIMAR in the presence of gene flow.Precision and accuracy of the estimates of θA (upper panel), T (middle panel) and M(lower panel) obtained from the mode of the marginal posterior distribution samples.Each data point is the result from one of 20 data sets (randomly numbered alongthe x-axis) simulated under the parapatric model with intra-locus recombination andparameter values sampled from prior distributions. The black lines are the meansover the 20 simulated data sets (see Table 3.2 and Methods for details).
To assess our ability to distinguish a model with gene flow from one without, we
generated 20 simulated data sets (each consisting of 40 recombining loci) under both
an allopatric and a parapatric model, with parameter values applicable to apes. In
the parapatric model, we fixed the expected number of migrants, M = 4N1m, to 1,
which corresponds to an average of 11 migration events in the history of the sample.
We applied MIMAR to the 40 data sets, allowing for recombination and sampling the
expected number of migrants from the prior ln(M) ∼ U [−5, 2] (see Methods for
details). When applied to data sets generated under a model with no gene flow,
60
Mean absolute error Mean of the estimatedivided by the true value
Parameters M > 0 M = 0 M > 0 M = 0θ1 0.0008 0.0005 1.144 1.153θ2 0.0008 0.0005 1.092 1.085θA 0.0003 0.0004 1.000 0.880∗
T 1.81× 104 5.66× 103 0.721∗ 0.965M 1.0436 0.487 1.293 NA
Table 3.2: Performance of MIMAR when detecting gene flow.Precision and accuracy for the five parameters of the isolation-migration model (usingthe mode as a point estimate). MIMAR was applied to 20 multi-locus data sets simu-lated under parapatric and allopatric models (see Methods for details). When M = 0,the mean estimate of θA is significantly lower than the true value (p = 0.0003, usinga Wilcoxon signed rank test). This can be explained as follows: The prior on M doesnot include 0 (the true value) so M is necessarily an over-estimate and consequently,θA tends to be under-estimated slightly. This problem is likely to apply to IM as well,since the prior on M is likewise exclusive of 0. When M = 1, the mean estimate ofT is significantly lower than the true value (p = 0.005, using a Wilcoxon signed ranktest). This can be explained by the fact that, whenever M and/or θA are slightlyunder-estimated, T tends to be under-estimated (see Figure 3.3).∗ A significant bias in the estimates at the 5% level, after Bonferroni correction formultiple tests.
MIMAR suggested no migration (using the criterion that the mode of the marginal
posterior distribution, M , be <0.1) in 14 out of 20 cases; moreover, in 1 out of the 6
cases in which M ≥ 0.1, most of the posterior probability mass for M was close to 0
(results not shown). For the data sets simulated with gene flow, there was evidence
of migration (i.e., M ≥ 0.1) in 17 out of 20 cases. Other parameter estimates were
generally accurate and precise (see Table 3.2 and Figure 3.3), although the estimates
of θA were slightly under-estimated in data sets generated with M = 0 , and the
estimates of T were slightly under-estimated in data sets generated with M = 1 (see
Table 3.2 for possible explanations).
When we applied either MIMAR or IM to smaller simulated data sets (i.e., 20 loci
and no intra-locus recombination), estimates of the split times and migration rates
61
provided by both methods were much less reliable (results not shown).
3.3.4 Sensitivity to intra-locus recombination rates
Intra-locus recombination rates are often unknown, or estimated with substantial
error. To assess how this might affect the reliability of MIMAR, we generated 16
data sets under an allopatric model, with parameter values applicable to Drosophila
(see above). Each data set consisted of 10 recombining loci, with the locus-specific
recombination rates chosen from an exponential distribution with mean c/µ = 10.
These data sets were analyzed using MIMAR by fixing all the parameters but T to their
true values, and i) setting the locus-specific recombination rates to their true values,
ii) sampling the recombination rates from the same prior as used when generating the
simulated data and iii) setting the intra-locus recombination rates to 0 (see Methods
for details). The results from i) and ii) were virtually identical, suggesting that error
in the locus-specific recombination rates does not have much effect on the results,
so long as intra-locus recombination is taken into account. In contrast, when we
assumed no recombination in our analysis of recombining loci, the estimates of the
split time were significantly less accurate and precise (See Figure 3.4). These results
highlight the importance of allowing for intra-locus recombination when estimating
demographic parameters.
62
0 5 10 15
015
0000
0T − T
0 5 10 15
1.0
2.0
T T
# data set
Figure 3.4: Sensitivity of MIMAR to intra-locus recombination.Precision (upper panel) and accuracy (lower panel) of the estimate of T obtainedfrom the mode of the marginal posterior distribution samples. Each data point is theresult from one of 16 data sets (randomly numbered along the x-axis) simulated underthe allopatric model with intra-locus recombination and parameter values sampledfrom prior distributions. These data sets were analyzed by fixing all the parametersbut T to their true values, and i) fixing the locus-specific recombination rates to theirtrue value (red squares), ii) sampling the recombination rates from the same prioras used when generating the simulated data (blue discs) and iii) setting the intra-locus recombination rates to 0 (grey triangles). The lines are the means over the 16simulated data sets (see Methods for details). The precision and bias in the estimatesof T from iii) and i) (or ii) are significantly different at the 5% level (p = 6 × 10−5
and p = 0.002, respectively, using paired Wilcoxon signed rank tests).
63
3.3.5 Application to ape data
We compiled a set of recently published resequencing data in bonobo and common
chimpanzee, gorilla and orangutan populations (Yu et al., 2003; Fischer et al., 2006;
Thalmann et al., 2006). Won and Hey (2005) had previously reported evidence for
intra-locus recombination at some of the loci included in this study, and we found
further evidence of recombination, in spite of low power to do so (given the small sam-
ple sizes). We therefore analyzed these data sets with MIMAR allowing for intragenic
recombination (see Methods). For these analyses, we assumed that the recombination
rate is exponentially distributed across loci, but constant within a locus. This model
seems sensible for the short fragments (∼650 bp on average) that we considered, but
may not be appropriate for longer loci.
Chimpanzee species (Pan paniscus and P. troglodytes) and subspecies (P.
t. verus, P. t. troglodytes and P. t. schweinfurthii)
In Figures 3.5−3.6 are the marginal posterior distributions for the parameters of
the model, averaging the results for two independent runs (see Methods for details).
We considered each pair of populations in turn. Running MIMAR under a model that
allows for gene flow strongly suggests that the bonobo and the common chimpanzee
populations split without subsequent migration (Table 3.3). In contrast, there is
evidence of gene flow since the split of Western, Central and Eastern chimpanzee
populations (Table 3.4, see also Won and Hey, 2005). Figure 3.5 shows the posterior
distribution estimates for the parameters of the model for bonobo and common chim-
panzee populations and Figure 3.6, for Western, Central and Eastern chimpanzee
populations. We note slight support for gene flow between Eastern chimpanzees and
bonobos (see Figure 3.5c), whose ranges are closer together than those of bonobos
64
0 10000 20000 30000 40000 50000 60000
0.0
00
.05
0.1
00
.15
0.2
00
.25
0.3
0
Population size
BonobosWestern chimpanzeesAncestral population
0 10000 20000 30000 40000 50000 60000
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Population size
BonobosCentral chimpanzeesAncestral population
0 10000 20000 30000 40000 50000 60000
0.00
0.02
0.04
0.06
0.08
0.10
Population size
BonobosEastern chimpanzeesAncestral population
0.0 0.5 1.0 1.5 2.0 2.5
0.00
0.02
0.04
0.06
0.08
0.10
Migration
0.0 0.5 1.0 1.5 2.0 2.5
0.00
0.05
0.10
0.15
0.20
0.25
Migration
0.00 0.05 0.10 0.15 0.20 0.25
0.00
0.02
0.04
0.06
0.08
Migration
400000 600000 800000 1000000 1200000
0.00
0.05
0.10
0.15
0.20
Time
(a)
400000 600000 800000 1000000 1200000
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
Time
(b)
400000 800000 1200000 1600000
0.00
0.05
0.10
0.15
Time
(c)
Figure 3.5: Smoothed marginal posterior distributions estimated by MIMAR
from bonobo and common chimpanzee polymorphism data.See Methods for details. The range of the x-axis corresponds to the support of theprior. The distributions are for the analyses of a) bonobos and Western chimpanzees,b) bonobos and Central chimpanzees and c) bonobo and Eastern chimpanzees.
and other chimpanzee subspecies. However, more data and more precise geographic
information are needed to evaluate this possibility, especially in light of the relatively
unreliable estimates of migration from small data sets (see simulation results above).
65
0 20000 40000 60000 80000
0.0
00
.02
0.0
40
.06
0.0
80
.10
0.1
2
Population size
Western chimp.Central chimpanzeesAncestral population
0 20000 40000 60000 80000
0.0
00
.02
0.0
40
.06
0.0
80
.10
Population size
Western chimpanzeesEastern chimpanzeesAncestral population
Population size
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0 20000 40000 60000 80000
Central chimp.Eastern chimp.Ancestral pop.
0.0 0.5 1.0 1.5 2.0 2.5
0.00
0.02
0.04
0.06
0.08
Migration
0 1 2 3
0.00
0.01
0.02
0.03
0.04
0.05
0.06
Migration
0.0 0.5 1.0 1.5 2.0 2.5
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
Migration
0 500000 1000000 1500000
0.0
00.0
20.0
40.0
60.0
80.1
0
Time
(a)
0 500000 1000000 1500000
0.0
00
.01
0.0
20
.03
0.0
40
.05
Time
(b)
0 500000 1000000 1500000
0.0
00.0
20.0
40.0
60.0
8
Time
(c)
Figure 3.6: Smoothed marginal posterior distributions estimated by MIMAR
from the common chimpanzee subpopulations polymorphism data.See Methods and legend of Figure 3.5 for details. The distributions are for the analysesof a) Western and Central chimpanzees, b) Western and Eastern chimpanzees and c)Central and Eastern chimpanzees.
Assuming 20 years per generation and a mutation rate of 2×10−8 per base pair per
generation (see Methods), the estimates of the effective population sizes of bonobos
and Western chimpanzees are ∼10, 000 in all analyses involving these populations. In
66
turn, the estimates of split time for bonobos and common chimpanzee populations
range from 790 to 920 Kya, and the estimates of the ancestral effective population
size are ∼30, 000. These estimates are consistent with those obtained by Won and
Hey (2005), who applied IM to a smaller data set, which overlaps with ours. The
only exception is that they estimated a smaller ancestral effective population size
than we did, but the confidence intervals overlap slightly. These results confirm that
polymorphism data from bonobos and common chimpanzees are consistent with an
allopatric speciation model and that the divergence between the chimpanzee species
occurred more recently than the estimated formation of the River Congo.
In the analyses of Western and Central chimpanzees and Western and Eastern
chimpanzees, the time estimates range from 280−440 Kya, the ancestral effective
population sizes from 11, 000−15, 000, and the migration rate, M = 4N1m, from
0.32−0.43 (where N1 is the effective population size of Western chimpanzees). These
results are roughly consistent with those of Won and Hey (2005): using a model that
allowed for asymmetric migration rates, they estimated that M ' 0.28 from West-
ern to Central chimpanzees, but did not find evidence for gene flow in the opposite
direction.
For the analyses of Central and Eastern chimpanzees, the split time estimate is
∼220 Kya, the ancestral effective population size is ∼46, 000 and the migration rate,
M = 4N1m, is ∼0.80 (where N1 is the effective population size of Central chim-
panzees). Thus, it appears that the split time for Central and Eastern chimpanzees
is about half that of Western and Central (or Eastern) chimpanzees.
67
Anal
ysi
saL
oci
bn
1cn
2N
1dN
2N
AT∗e
Mf
Bon
obos×
Wes
tern
chim
pan
zees
6918
(16)
20(1
2)M
ode
9,79
09,
790
33,3
0087
3,00
00.
007
5th
per
centi
le8,
360
7,82
025,2
0068
1,00
00.
007
95th
per
centi
le12,0
0011,7
0044,3
001,
070,
000
0.03
1
Bon
obos×
Cen
tral
chim
pan
zees
6818
(16)
20(1
0)M
ode
9,90
021,9
0033,8
0091
8,00
00.
008
5th
per
centi
le7,
870
18,3
0027,3
0075
9,00
00.
007
95th
per
centi
le11,3
0027,0
0046,8
001,
170,
000
0.03
6
Bon
obos×
Eas
tern
chim
pan
zees
2618
20M
ode
11,5
0019,9
0031,6
0078
5,00
00.
062
5th
per
centi
le9,
150
15,3
0022,2
0061
6,00
00.
001
95th
per
centi
le15,2
0025,6
0048,7
001,
350,
000
0.10
0
Tab
le3.
3:R
esu
lts
for
chim
panze
esp
eci
es.
aE
stim
ates
are
obta
ined
from
two
indep
enden
tru
ns
(see
Met
hods)
.b
Num
ber
oflo
ciuse
din
the
anal
yse
s.cn
1an
dn
2ar
eth
enum
ber
ofch
rom
osom
esin
the
firs
tan
dse
cond
pop
ula
tion
ofth
ean
alysi
s,re
spec
tive
ly(t
he
sam
ple
size
vari
esb
ecau
sew
ep
ool
edlo
cifr
omm
ult
iple
studie
san
db
ecau
seof
mis
sing
dat
a).
dN
A,N
1an
dN
2ar
eth
ees
tim
ates
ofth
eeff
ecti
vep
opula
tion
size
for
the
ance
stra
l,firs
tan
dse
cond
pop
ula
tion
ofth
ean
alysi
s,re
spec
tive
ly.
eT∗
isth
ees
tim
ate
ofth
eti
me
since
the
pop
ula
tion
ssp
lit
inye
ars.
fM
=4N
1m,
wher
eN
1is
the
effec
tive
pop
ula
tion
size
ofth
efirs
tp
opula
tion
ofth
ean
alysi
s.
68
Anal
ysi
sL
oci
n1
n2
N1
N2
NA
T∗
MW
este
rn×
Cen
tral
chim
pan
zees
6820
(12)
20(1
0)M
ode
9,75
033,0
0015,0
0043
9,00
00.
315
5th
per
centi
le7,
690
24,2
006,
140
325,
000
0.09
7
95th
per
centi
le12,9
0059,7
0022,4
001,
100,
000
0.52
3
Wes
tern×
Eas
tern
chim
pan
zees
2620
20M
ode
10,8
0024,7
0011,0
0028
2,00
00.
425
5th
per
centi
le8,
040
18,6
002,
270
230,
000
0.14
3
95th
per
centi
le21,1
0071,8
0021,9
001,
210,
000
2.62
2
Cen
tral×
Eas
tern
chim
pan
zees
2620
20M
ode
14,4
008,
590
46,0
0021
9,00
00.
797
5th
per
centi
le8,
560
5,07
033,5
0014
3,00
00.
084
95th
per
centi
le22,3
0012,7
0075,1
001,
400,
000
1.38
9
Tab
le3.
4:R
esu
lts
for
chim
panze
esu
bsp
eci
es.
See
lege
nd
ofT
able
3.3
for
det
ails
.
69
While the estimates are generally consistent across pairwise analyses, the effective
population size estimates for Central and Eastern chimpanzees are not. In both anal-
yses of Central chimpanzees and bonobos and of Central and Eastern chimpanzees,
the effective population size of Central chimpanzees is estimated to be 15, 000−22, 000
(consistent with the results of Won and Hey, 2005). However, a larger population size
estimate is obtained from the analysis of Western and Central chimpanzees. Similarly,
in both analyses of bonobos and Eastern chimpanzees and of Western and Eastern
chimpanzees, estimates of the effective population size of Eastern chimpanzees are
20, 000−25, 000, while in the analysis of Eastern and Central chimpanzees, the es-
timate is smaller. These discrepancies may reflect complex histories of chimpanzee
populations not captured by the model (see the goodness-of-fit test below). For
example, analyses of other data sets suggest that Central chimpanzees may have
experienced a recent population expansion (Fischer et al., 2004; D. Reich, personal
communication).
70
0 10000 20000 30000 40000 50000 60000
0.00
0.01
0.02
0.03
0.04
0.05
Population size
Western gorillasEastern gorillasAncestral population
0 50000 100000 150000 200000 250000
0.0
00
.05
0.1
00
.15
Population size
Sumatran orangutansBornean orangutansAncestral population
0 1 2 3 4 5
0.00
0.01
0.02
0.03
0.04
0.05
Migration
0 1 2 3 4 5
0.00
0.02
0.04
0.06
0.08
0.10
Migration
0 500000 1000000 1500000
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
Time
(a)
0 500000 1000000 1500000 2000000
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
Time
(b)
Figure 3.7: Smoothed marginal posterior distributions estimated by MIMAR
from the gorilla (a) and orangutan (b) subspecies polymorphism data.See Methods and legend of Figure 3.5 for details. a) Distributions for the analysis ofWestern and Eastern gorillas. The apparent multi-modality of the marginal posteriordistribution estimated for the split time was also noted by (Thalmann et al., 2006).b) Distributions for the analysis of Sumatran and Bornean orangutans. Note thatthe posterior distribution for the split time is rather flat, suggesting that the data donot contain enough information about this parameter.
71
Gorilla subspecies, Western (Gorilla gorilla) and Eastern gorillas (G.
beringei)
Figure 3.7a shows the posterior distributions of the five parameters of the parap-
atric model of speciation. Assuming 15 years per generation and a mutation rate of
2 × 10−8 per base pair per generation (see Methods), the estimates of the effective
population sizes for Western and Eastern gorillas and their ancestral population are
∼9, 000, ∼8, 000 and ∼27, 000, respectively (see Table 3.5). The divergence time esti-
mate between Western and Eastern gorilla subspecies is ∼92 Kya and the migration
rate, M = 4N1m, is ∼0.87 (where N1 is the effective population size of Western
gorillas).
To compare our estimates to those recently obtained by Thalmann et al. (2006)
using IM, we considered their mutation rate estimate (1.44× 10−8 per base pair per
generation). Our estimates of the effective population sizes of Western gorillas and
ancestral population and the split time are of the same order (∼13, 000 vs. 17, 500,
∼37, 000 vs. 42, 000 and 92 vs. 78 Kya), but our estimate of the effective population
size of Eastern gorillas is larger (11, 000 vs. 3, 000). Whether this discrepancy reflects
differences in the use of summaries vs. the whole data, in allowing vs. ignoring
intra-locus recombination or in the prior distributions is unclear.
72
Anal
ysi
sL
oci
n1
n2
N1
N2
NA
T∗
MW
este
rn×
Eas
tern
gori
llas
1530
6(2
)M
ode
9,13
08,
140
26,4
0091,5
000.
867
5th
per
centi
le5,
090
3,57
05,
990
84,3
000.
282
95th
per
centi
le14,1
0018,1
0049,1
001,
440,
000
2.05
9
Sum
atra
n×
Bor
nea
nor
angu
tans
1912
20(1
8)M
ode
17,2
0010,2
0086,9
001,
390,
000
0.86
8
5th
per
centi
le10,2
006,
230
52,4
0025
4,00
00.
361
95th
per
centi
le26,6
0015,0
0014
3,00
01,
900,
000
2.23
5
Tab
le3.
5:R
esu
lts
for
gori
lla
and
ora
nguta
nsu
bsp
eci
es.
See
lege
nd
ofT
able
3.3
for
det
ails
.
73
Orangutan subspecies, Sumatran (Pongo pygmaeus abelii) and Bornean
orangutans (P. p. pygmaeus)
The posterior distributions of the five parameters of the parapatric model of spe-
ciation are shown in Figure 3.7b. Assuming 20 years per generation and a mutation
rate of 2 × 10−8 per base pair per generation (see Methods), the estimates of the
effective population sizes for Sumatran and Bornean orangutans and their ancestral
population are ∼17, 000, ∼10, 000 and ∼87, 000, respectively (see Table 3.5). The
estimate of the symmetrical migration rate, M = 4N1m, is ∼0.87 (where N1 is the
effective population size of Sumatran orangutans). The data further suggest that the
split time for Sumatran and Bornean orangutan populations is likely to be older than
250 Kya. However, the data (19 loci) do not appear to carry much information about
this parameter (see the posterior distribution estimate in Figure 3.7b), and in par-
ticular, the mode of the posterior distribution, 1.4 Mya, is likely to be an unreliable
estimate of the split time.
Since the islands of Borneo and Sumatra were connected during the two last
glaciations ∼130−200 Kya and ∼10−100 Kya ago, it is not surprising to find evidence
of gene flow between those two populations. Our results further suggest that the
Sumatran and Bornean orangutan populations diverged before the second to last
Ice Age. To our knowledge, this analysis provides the first estimates of population
parameters for the two orangutan subspecies.
Goodness-of-fit test
To examine whether the isolation-migration model is an appropriate description of
the history of the ape species and subspecies, we generated simulated data sets for pa-
rameters sampled from the posterior distributions estimated by MIMAR, and compared
the simulated data to what is observed for a number of statistics. Encouragingly, the
74
isolation-migration model appears to provide a reasonable fit to the four statistics
used in the inferences of MIMAR as well as to the mean FST , π and Tajima’s D across
loci (Supplemental Figure 3.10, and see Methods for details). The one exception is
for Central and Eastern chimpanzees (Supplemental Figure 3.10f): there is a poor fit
to FST and to Tajima’s D for the Central chimpanzees (see also Supplemental Figure
3.10b). This suggests either that an isolation-migration model is not appropriate for
these subspecies or that a crucial demographic feature is missing from the model.
Given the proximity of Central and Eastern chimpanzees and their low FST , one
possibility is that, rather than a split model, a model of isolation-by-distance is more
appropriate (Fischer et al., 2006). Interestingly, though, there does not appear to be
substantial gene flow between the Eastern and Central ranges (see the estimates of
the migration rate in this study and Becquet et al., 2007). We also find that, while
the model fits most aspects of the bonobo data quite well, the observed Tajima’s
D is lower than expected (Supplemental Figure 3.10a−c), perhaps reflecting recent
demographic events in bonobos not included in the model.
3.4 Discussion
3.4.1 Advantages and limitations of MIMAR
We have developed a new method to estimate parameters of simple allopatric and
parapatric speciation models. It considers summaries of the polymorphism data from
each locus, rather than the entire data set. Extensive simulations, and comparisons to
IM for the case of no recombination, suggest that the use of these summaries provides
accurate and precise estimates of parameters of interest from data sets comparable
in size to those analyzed to date (see Table 3.1).
75
The method presents the important advantage of allowing for intra-locus recom-
bination. This feature makes the approach applicable to autosomal data, even in
species where the ratio of recombination to mutation events is high (ρ/θ �1), such
as in Drosophila (Andolfatto and Wall, 2003) and Papilio (Putnam et al., 2007) and
hence where any segment containing polymorphisms is likely to have experienced re-
combination in its genealogical history. In contrast, when applied to recombining
regions, IM requires one to exclude loci that show evidence of recombination and
assumes that no recombination occurred at the other loci, potentially biasing the
estimates.
In other respects, the model of speciation that we consider is more restrictive than
the one used in IM. Mutation rates for each locus are estimated from divergence data
and then fixed, rather than co-estimated along with other parameters (see Methods).
We set the migration rate, m, to be symmetric between populations, which may be
inappropriate. Finally, we assume that the distribution of coalescent times only varies
across loci due to differences in the mode of inheritance, and therefore that it can be
specified a priori. In contrast, IM allows one to estimate inheritance scalars for each
locus from the data, which may be important if a subset of loci have experienced
recent selection (Hey and Nielsen, 2004). Our model could readily be extended to
allow for these features, notably for asymmetric migration rates (in fact, the MIMAR
program that we make available already allows for this feature). However, the data
from a given locus carry limited information, and it is unclear how many parameters
can reliably be estimated, even using all the information. Indeed, our simulations
suggested that IM and MIMAR estimates of the migration rate from a small data set
can be unreliable even in the absence of these complications (see Results).
In its current implementation, our method is also limited in the type of data
76
that it can consider, as it is not applicable to surveys of variation that suffer from
ascertainment bias. Moreover, it assumes an infinite site model, so only two alleles
can be present at a given site. As long as the ascertainment bias and mutation model
are known, however, it should be reasonably straight-forward to extend the model to
consider these cases (Nielsen and Signorovitch, 2003). MIMAR is further intended for
use on resequencing data from short, independently-evolving loci, in which there is
little information about how genealogies change along the genome, or viewed another
way, about linkage disequilibrium (McVean et al., 2004), and for which it is reasonable
to assume that recombination rates are uniform. Applying MIMAR to longer stretches
of sequence may require a change in the model of recombination to capture fine-
scale heterogeneity in recombination rates. In that setting, it may also be helpful to
consider summaries of linkage disequilibrium in addition to the four statistics used
here. More generally, our approach could be extended to consider a number of other
aspects of the data. For instance, one could consider the number of singletons in each
population (in addition to the four current statistics), or the joint frequency spectrum
in two population samples.
In addition to improving the inference method, it will also be important to consider
more realistic models of speciation. For example, detailed studies of closely related
species reveal that many apparent cases of parapatry may in fact reflect allopatric
speciation followed by secondary contact (Coyne and Orr, 2004a; Llopart et al., 2005).
One approach to distinguishing between the two scenarios might be to allow migration
between diverging populations to stop at different time points, and estimate which
times are most likely given the polymorphism data. Similarly, for sets of species (or
populations) that split over a short time period, it may be important to consider
more than two species at a time (Wall, 2000; Degnan and Rosenberg, 2006; Pollard
77
et al., 2006).
Another salient feature, ignored in existing methods, may be population structure
in the ancestral population. Indeed, in many of our analyses of ape data, as well as
in most analyses of the isolation-migration model published to date (e.g., Hey et al.,
2004; Hey, 2005; Won and Hey, 2005; Thalmann et al., 2006), the estimate of the
ancestral effective population size is larger than that of the descendant populations.
Since it seems unlikely that so many populations have shrunk over time, this suggests
that a salient and fairly common demographic feature is being ignored. One possibility
is that the assumption of a panmictic ancestral population is inappropriate. If so, it
may be relevant to consider a model of population structure in which a geographic
barrier becomes stronger over time (e.g., Innan and Watanabe, 2006). In this respect,
an attractive feature of our method is that it is easy to generalize to other demographic
settings (see Methods).
Finally, our approach could also be extended to scan the genome for regions that
contribute to reproductive isolation (Won et al., 2005; Bull et al., 2006; Geraldes et al.,
2006; Miller et al., 2006). Indeed, models of parapatric speciation predict that loci
involved in the formation of species will experience no or little gene flow since the split,
and therefore have more fixed differences and fewer shared alleles than do background
loci. Moreover, theoretical results suggest that, in this setting, and unless selection in
very strong, regions of marked differentiation should be relatively short (Barton and
Bengtsson, 1986). Thus, identifying regions with evidence for decreased gene flow
should be an effective way to find the specific loci that contribute to reproductive
isolation. This idea has been implemented by estimating gene flow for each locus
separately (Won et al., 2005). However, this approach may have limited power to
detect loci with reduced gene flow. An alternative may be to use the goodness-of-fit
78
test results for individual loci to identify outliers that behave as expected if they
contributed to reproductive isolation.
3.4.2 Analyses of ape polymorphism data
Analyses of genetic polymorphism data from apes can help to characterize the ge-
ographic distribution of variation (e.g., Becquet et al., 2007), shed light on their
demographic history and place the evolutionary history of humans in context (Stone
and Verrelli, 2006). Here, we considered the largest set of polymorphism data to date
for all three species of non-human great apes, and estimated parameters of a simple
isolation-migration model. Using a goodness-of-fit test, we find that this model pro-
vides a reasonable point of departure for analyzing ape data, other than for Eastern
and Central common chimpanzees.
The use of the model suggests that the effective populations sizes of the ape
populations range from 8, 000 to 33, 000, on the same order as estimates for human
populations (10, 000−15, 000; Frisse et al., 2001; Voight et al., 2005). In contrast, the
subspecies split times appear to be older than those of human populations (Cavalli-
Sforza and Feldman, 2003; Goebel, 2007), ranging from 92−440 Kya.
We find no evidence for gene flow since the split for chimpanzee species (with the
possible exception of Eastern chimpanzees and bonobo), consistent with the results
of Won and Hey (2005), but do detect limited migration (M≤1) for all ape sub-
species. The split time estimate for chimpanzee species is 790−920 Kya, suggesting
that speciation occurred after the formation of the River Congo, 1.5−3.5 Mya. These
estimates do not take into account possible error in the mutation rate per year. But
even if we consider a time to the most recent common ancestor between human and
chimpanzee at the upper limit of what has been estimated so far, 8 Mya, and a gen-
79
eration time of only 15 years, the central 95th percentile for the split time is 1−2.3
Mya. Moreover, the recent finding of a chimpanzee fossil in Kenya indicates that
common chimpanzees may have occupied a much wider range than inferred on the
basis of their current distribution (McBrearty and Jablonski, 2005). Thus, existing
data support a more recent speciation time for common chimpanzees and bonobos,
which may have occurred outside of their current habitats.
More generally, this application illustrates how the increasing availability of multi-
locus polymorphism data sets, together with development of novel statistical ap-
proaches, can yield insights into speciation, both in apes and in other organisms.
3.5 Methods
3.5.1 Model
We consider a neutral model in which an ancestral population suddenly splits into
two populations, which either diverge in isolation or continue to exchange migrants
(Figure 3.1). We further assume that n1 and n2 chromosomes have been sampled from
two populations and fully resequenced at Y randomly chosen, independently-evolving
loci.
The population model, often called “isolation-migration”, is described by the pop-
ulation split time in generations, T , and three population mutation rates, θ1 = 4N1µ,
θ2 = 4N2µ and θA = 4NAµ (Figure 3.1). Throughout, the subscripts 1, 2 and A refer
to parameters that describe populations 1, 2 and the ancestral population, respec-
tively. Following IM, we assume that there is an independent estimate of the average
mutation rate across loci, µ, which can be used to estimate the effective population
sizes from the population mutation rates (e.g., as N1 = θ1/4µ). In addition, there
80
is a symmetric migration rate, m, which corresponds to the fraction of a population
that is replaced by migrants from the other population each generation.
The parameters θ1, θ2 and θA are defined per base pair and are chosen from
uniform distributions; the time in generations, T , is also chosen from a uniform
distribution. The prior for the migration rate is on the expected number of individuals
in population 1 replaced by migrants (backward in time), M = 4N1m, where N1 is
obtained from θ1 by dividing by 4µ (µ is the estimate of µ). Specifically, ln(M) is
chosen from a uniform distribution.
In addition to the five demographic parameters, there are a number of locus-
specific parameters. We assume that each locus follows the infinite sites mutation
model (Kimura, 1969), then define an inheritance scalar, u, which, for example, is
equal to 1 for autosomal, 3/4 for X-linked and 1/4 for Y- and mtDNA-linked loci.
To allow for mutation rate variation among loci with the same mode of inheritance,
we introduce an additional scalar, v, for each locus. Given this parameterization,
the locus-specific mutation rate in population 1 is given by uvZθ1, where Z is the
length of the locus in base pairs; the locus-specific population mutation rates for other
populations are defined analogously.
The population recombination rate per base pair is defined as ρ = 4N1c, where c
is the per base pair per generation recombination rate. We ignore gene conversion,
treating all recombination as crossovers alone. We also define an inheritance scalar
for recombination, w (w = 0 for the mtDNA and Y, 2/3 for X and 1 for autosomes).
We then consider three options to specify the locus-specific population recombination
rate. We either fix ρ across loci, such that the population recombination rate at a
locus is wZρ. Alternatively, if an estimate, ρ, of the population recombination rate is
available for each locus, we set the scalar w to the inheritance scalar for recombination
81
multiplied by ρ to incorporate this knowledge in the estimation. The final option is
to allow rates to vary for each locus, in which case the locus-specific population
recombination rate is r · wZθ1 and we draw the ratio r = ρ/θ1 from an exponential
distribution with mean λ for each locus. Thus, we allow for rate variation among loci,
but assume a constant rate within a locus. This model should be a sensible description
of the rate variation if the loci are short (e.g., 1 kb), as in most data sets collected
to date. The set of locus-specific population recombination rates, (ρ1, · · · , ρY ), is
referred to as P.
3.5.2 Data summaries
Our goal is to estimate the parameters of the isolation-migration model illustrated in
Figure 3.1. We do so by estimating the posterior distribution π(Θ|D) ∝ p(D|Θ)p(Θ),
where Θ = (θ1, θ2, θA, T,M,P), D is the data and p(Θ) denotes the prior distribu-
tion. Unfortunately, when D is the entire polymorphism data set under our model,
estimating the likelihood of the data given the parameters, p(D|Θ), is computation-
ally extremely intensive, and becomes prohibitive when recombination is included
(Nielsen and Wakeley, 2001; Hey and Nielsen, 2004). In their program IM, Hey
and Nielsen (2004) address this problem by considering the full data set and using
a Markov Chain Monte Carlo (MCMC) approach, but restricting themselves to a
model with no intra-locus recombination (i.e., P = 0). Instead, we focus on a model
with intra-locus recombination, but summarize the polymorphism data from each lo-
cus with the summary statistics described below. To do so, we initially explored an
importance sampling approach, which provided reliable estimates but was inefficient.
We then implemented an MCMC algorithm, which is more efficient than our initial
algorithm when the prior and posterior distributions differ substantially.
82
To summarize the data, we use the statistics introduced by Wakeley and Hey
(1997) for this type of inference problem: for each locus, we consider the number of
polymorphisms unique to the samples from populations 1 and 2 (S1 and S2 respec-
tively), the number of shared alleles between the two samples (S3), and the number
of fixed alleles in either sample (S4). Previous work has shown that these statistics
contain considerable information about the demographic parameters of the isolation-
migration model (e.g., Clark, 1997; Wakeley and Hey, 1997; Hudson and Coyne,
2002; Leman et al., 2005). In what follows, we refer to the vector of summaries, Sk,
k ∈ [1, 4], for locus y as Dy. In turn, we refer to the set of statistics for the Y loci as
D = (D1, · · · ,DY).
In calculating these statistics, we assume that an outgroup sequence is available
and can be used to determine which allele is derived without error. We note that,
in practice, it may be advisable to use two outgroup sequences to minimize error in
inferring the ancestral state. We assign each polymorphic site to one of the statistics
depending of the frequency of the derived allele in the population i, fi. Specifically,
if 0 < fi ≤ 1 in each population sample, the allele is shared, if fi = 0, fj = 1, i 6= j,
the allele is fixed in the sample j, and if fi = 0 and fj < 1, i 6= j, the allele is specific
to sample j. The statistics are easy to calculate, and do not require determination of
haplotypes.
3.5.3 Estimation method
Our goal is to sample from the posterior distribution, π(Θ|D) ∝ p(D|Θ)p(Θ), which
is the likelihood of the data summaries given the parameters times the prior dis-
tributions of the parameters. The parameters are initially chosen from these prior
distributions and subsequently updated using MCMC, which requires information
83
about the likelihood of the data given the parameters. Very briefly, our strategy is
to estimate the likelihood of the data summaries at all the loci for a chosen set of
parameters by, for each of the Y loci, i) Generating a set of X ancestral recombi-
nation graphs (ARGs) (Hudson, 1983) given the parameters and ii) Calculating the
probability of the data summaries given the set of ARGs. Specifically, we estimate
the likelihood p(D|Θ) as:
Y∏y=1
1
X
X∑x=1
p(Dy|Gyx,Θ)p(Gyx|Θ), (3.1)
where Gyx is the xth ARG at locus y (Hudson, 1983). In other words, we estimate
p(D|Θ) by taking the average of p(Dy|Gyx,Θ) over X ARGs, then taking the product
over loci (since they are assumed to be independent). The term p(Gyx|Θ) is given
by the coalescent, using a modified version of the program ms (Hudson, 2002).
We can readily calculate p(Dy|Gyx,Θ). Given a coalescent genealogy, Gyx, we
compute the sum of the lengths of all the branches (in coalescent units), which would
lead to unique polymorphisms in sample 1 and 2 (L1 and L2 respectively), alleles
shared by both samples (L3) and alleles fixed in either samples (L4). Given the
infinite site mutation model, the numbers of mutations, Sk, randomly placed along the
branches of type k ∈ [1, 4], is Poisson distributed with mean LkuvZθ1. Conditional
on a genealogy, the probabilities of observing S1, S2, etc... are independent, so the
probability of the data Dy for the locus y is given by:
p(Dy|Gyx,Θ) =4∏
k=1
Po(Sk|LkuvZθ1). (3.2)
Equation 3.2 also applies to a recombining locus, but in this case, Gyx is an ARG and
84
Lk is computed as follows: With recombination, a locus of size Z has R segments of
length Zj , j ∈ [1, R], with different genealogical histories. The genealogy of a segment
has branch length Ljk, such that Lk =∑R
j=1 LjkZj/Z for the ARG.
Our prior distributions for the parameters, p(Θ), are uniform over a bounded
support (except for P and a uniform prior on ln(M)). For the MCMC, we use random
walk Metropolis transition kernels to propose parameter values, so that the proposed
value of a parameter is taken from a normal distribution with mean its previous value
and variance defined to maximize the acceptance rate (after exploratory simulations;
Gilks et al., 1996). If a parameter value lies outside the support of the prior, the
proposed set of parameters is rejected. In turn, P is a nuisance parameter and its
values are either fixed (when ρ is fixed), or drawn from the distribution described
above (see Model, subsection 3.5.1); in the MCMC, the values of P are sampled
independently at each step from the prior.
Our approach follows a number of Bayesian methods based on summaries of the
data, developed in other contexts (e.g., Tavare et al., 1997; Pritchard et al., 1999;
Beaumont et al., 2002; Przeworski, 2003). It differs in that we update the parameters
using MCMC rather than sampling them independently from the prior. This gen-
eral approach was described by Beaumont (2003). As pointed out to us by Matthew
Stephens (personal communication), our approach can also be viewed as a MCMC on
the set of all genealogies, G = (G11, · · · , G1X , · · · , GY 1, · · · , GY X), and the param-
eters. In this case, the X ARGs are independent samples from the coalescent prior
across the Y independent loci. Thus, for the MCMC, the set of ARGs is updated
using the transition kernel q(G → G′) = p(G′|Θ), while the parameters of interest
are updated using Metropolis transition kernels. We sample sets (G,Θ) from the
85
following target distribution:
π(G,Θ) ∝Y∏
y=1
1
X
X∑x=1
p(Dy|Gyx,Θ)
p(Θ)p(G|Θ). (3.3)
The marginal distribution of sampled values of Θ from the target distribution is
π(Θ|D) (as shown in the Supplemental Materials; see also the Appendix of Beaumont,
2003). A nice feature of viewing our approach in this way is that it demonstrates
that the stationary distribution of the Markov Chain is the correct distribution, i.e.,
that we are exploring the true posterior distribution rather than an approximation.
MIMAR − MCMC estimation of the Isolation-Migration model Allowing for
Recombination
To sample from the target distribution π(G,Θ), we use an MCMC approach
(MIMAR). In the initial step, Θ is chosen from the prior, p(Θ), and G is sampled
from the coalescent with those parameters. Subsequent sets (G,Θ) are updated
following a Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970).
More specifically, we proceed as follows:
I1. If now at (G,Θ), propose a move to (G′,Θ′) according to the transition kernels
q(Θ → Θ′) and q(G → G′) (i.e. Generate X ARGs given the parameters Θ′
for each of the Y loci).
I2. For the yth locus:
a. Calculate p(Dy|G′yx,Θ′) for each of the X ARGs using Equation 3.2.
b. If the average of p(Dy|G′yx,Θ′) over the X ARGs is 0, record (G,Θ) and
go to I1 ; else go to I2a for the locus y + 1.
86
I3. Calculate:
h = min
(1,A′
A
), (3.4)
where
A′ =Y∏
y=1
1
X
X∑x=1
p(D|G′yx,Θ′)
p(Θ′)p(G′|Θ′)q(Θ′ → Θ)q(G′ → G)
A =Y∏
y=1
1
X
X∑x=1
p(D|Gyx,Θ)
p(Θ)p(G|Θ)q(Θ→ Θ′)q(G→ G′).
I4. Move to (G′,Θ′) with probability h (i.e. record (G′,Θ′)) or else record (G,Θ).
Return to I1.
The choice of proposal distribution for G and P and normal kernel distributions
and uniform prior distributions for the parameters of interest lead to the following
simplification of Equation 3.4:
h = min
1,
∏Yy=1
(1X
∑Xx=1 p(D|G′yx,Θ
′))
∏Yy=1
(1X
∑Xx=1 p(D|Gyx,Θ)
) . (3.5)
In practice, we consider X = 100 (or X = 50 or 5, see below), thus generating
100 (50 or 5) ARGs given the locus-specific parameters. Generating so many ARGs
is computationally demanding, but we find that this approach has improved mixing
over X = 1.
We note that our approach presents the advantage of being flexible, since it can
easily be extended to consider any summaries for which p(Dy|Gyx,Θ) can be readily
calculated, such as the allele frequency spectrum at each locus.
MIMAR and its documentation are available at http://mplab.bsd.uchicago.edu/
dataNprograms.htm and http://pps-spud.uchicago.edu/~cbecquet/download.html.
87
Assessing the performance of the estimator
To assess the performance of our method, we ran MIMAR on simulated data sets
with two independent seeds (see below). We considered that MIMAR reached conver-
gence when the posterior distributions from the two independent runs were highly
similar (e.g., Supplemental Figure 3.9). In the documentation provided with MIMAR,
we describe a number of other criteria that can be used to assess convergence and
proper mixing). We took the mode and the central 95th percentile of the marginal
posterior distribution averaged over the two independent runs as the point estimate
and measure of uncertainty, respectively.
3.5.4 Simulated data and performance analyses
We generated simulated data sets under the isolation-migration model using a mod-
ified version of the program ms (called MIMARsim available at http://pps-spud.
uchicago.edu/~cbecquet/download.html; Hudson, 2002). Unless otherwise indi-
cated, we considered 20 loci of 1 kb each, and sampled 20 chromosomes from each of
two populations.
Performance of MIMAR under allopatry. We generated 30 simulated data sets
with no recombination and fixed parameter values relevant for Drosophila yakuba and
D. santomea (Llopart et al., 2005), assuming a per base pair per generation mutation
rate of µ = 2× 10−9 and 20 generations per year (Andolfatto and Przeworski, 2000).
We analyzed the 30 simulated data sets for 60 hours with 1× 105 burn-in steps and
prior distributions as indicated (see Supplemental Figure 3.8).
Comparison to IM under allopatry. In order to compare our estimates with
those generated by IM (Hey and Nielsen, 2004), which does not allow for intra-locus
88
recombination, we set the population recombination rate, ρ, to 0. To be comparable to
IM, we also chose uniform prior distributions with 0 as the lower limit. We generated
30 simulated data sets with parameters relevant for Drosophila species as above,
setting M to 0 and drawing the other parameters from prior distributions: θ1 and
θ2 from U (0, 0.01] and θA from U (0, 0.02] per base pair and T from U(0, 1.5× 107]
generations. We analyzed those 30 simulated data sets with MIMAR and IM using the
same prior distributions as used when simulating the data sets, 4×106 recorded steps
and 5× 105 burn-in steps.
Assessing the evidence for gene flow. We generated 40 data sets, consisting of
40 recombining loci with parameter values relevant for apes (see below). We assumed
that µ = 2 × 10−8 per base pair per generation to translate coalescent time units
into generations (Nachman and Crowell, 2000). This mutation rate estimate is also
obtained assuming a most recent common ancestor of human and chimpanzee of 7
Mya and an average nucleotide divergence of 1.28% (The Chimpanzee Sequencing and
Analysis Consortium, 2005). The intra-locus recombination rate was set for each locus
by choosing r = c/µ from the prior exp (1/0.6) (assuming that the mean c is 1.2×10−8;
Kong et al., 2002). The other parameter values were sampled from the following
prior distributions: θ1, θ2 and θA from U (0.0006, 0.006] per base pair and T from
U(0, 1× 105] generations. M was either fixed to 0 (for 20 data sets simulated under
the allopatric model) or to 1 (for 20 data sets simulated with parapatric divergence).
We analyzed the 40 simulated data sets with MIMAR, choosing ln(M) from U [−5, 2]
and the other parameters from the same prior distributions as used when simulating
the data sets, the number of ARGs per locus set to X = 50, 4 × 106 recorded steps
and 5× 105 burn-in steps.
89
Effect of uncertainty in the intra-locus recombination rates. We generated
16 simulated data sets, consisting of 10 recombining loci with parameter values rel-
evant for Drosophila species. The intra-locus recombination rate was set for each
locus by choosing r = c/µ from the prior exp (1/10) (assuming that the mean c is
2×10−10; Andolfatto and Przeworski, 2000). M was fixed to 0 and the other param-
eter values were sampled from the following prior distributions: θ1, θ2 and θA from
U (0.001, 0.01] per base pair and T from U(0, 1× 106] generation. We then analyzed
the data sets with MIMAR in three ways: i) the locus-specific population recombination
rates were fixed to their true values, ii) the locus-specific population recombination
rates were sampled from the same prior as used when generating the simulated data
and iii) the locus-specific population recombination rates were set to 0. For the three
sets of analysis, we fixed θ1, θ2 and θA to their true value and used the same prior
distribution for T as when generating the simulated data. MIMAR was run with X = 5
(i and ii) or X = 100 (iii), 4.5× 105 recorded steps and 5× 104 burn-in steps.
3.5.5 Analysis of ape polymorphism data
Polymorphism data. We analyzed the ape polymorphism data reported in Fischer
et al. (2006), Thalmann et al. (2006) and Yu et al. (2003). The first set was kindly pro-
vided by A. Fischer (Max Planck Institute for Evolutionary Anthropology, Leipzig,
Germany); we downloaded the two other data sets from Genbank (http://www.
ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=Nucleotide). The data
from Fischer et al. (2006) (and Thalmann et al., 2006) and Yu et al. (2003), con-
sisted of loci of median length ∼780 bp and ∼470 bp, respectively. The data sets were
as follows (see Tables 3.3−3.5): 69 loci surveyed in nine unrelated bonobos (pigmy
chimpanzee, Pan paniscus), 26 loci in ten and 43 loci in six Western chimpanzees
90
(P. t. verus), 26 loci in ten and 42 in five Central chimpanzees (P. t. troglodytes),
26 loci in ten Eastern chimpanzees (P. t. schweinfurthii), 15 loci in 15 Western
gorillas (Gorilla gorilla) and three Eastern gorillas (G. beringei), and 19 loci in six
Sumatran orangutans (Pygmaeus pygmaeus abelii) and ten Bornean orangutans(P.
p. pygmaeus).
For each locus, we obtained two outgroup sequences. For the chimpanzee data sets,
one orangutan and one human sequences were available for 26 and 19 loci, respectively
(Fischer et al., 2006); one human (Yu et al., 2002) and one gorilla sequences (G. g.
gorilla, Yu et al., 2004) were obtained for 43 loci. We blasted the seven remaining loci
and downloaded a homologous human sequence for each of them (BLASTN, http:
//www.ncbi.nlm.nih.gov/BLAST, Altschul et al., 1990). For the gorilla data set,
one orangutan and one human sequences were available for all loci; for the orangutan
data set, one chimpanzee and one human sequences were available for all loci (Fischer
et al., 2006). We used CLUSTALW in MEGA3.1 (Thompson et al., 1994; Kumar
et al., 2001) to align the resequencing data and the two outgroup sequences. We
then wrote a Perl script that requires both outgroup sequences to be consistent to
infer the ancestral state at each site, thus minimizing error in the reconstruction of
the ancestral state. We ignored sites with gaps, missing data and more than two
variants. (There were only one site with three alleles in the entire gorilla data set,
three in the orangutan data, and six in the chimpanzee/bonobo data.) We used a
Perl script to calculate for each locus the four statistics S1, S2, S3 and S4 (see above)
and FST (Hudson et al., 1992) for pairs of populations, as well as the mean pairwise
differences (π; Nei and Li, 1979) and Tajima’s D (Tajima, 1989) in each population.
Estimates of mutation rate variation. To allow for variation in mutation rates,
we used the scalars v described above. To do so, we calculated the mean pairwise
91
divergence per site between a human sequence and an ape sequence (div), using a Perl
script. We obtained the expected locus divergence given the number of base pairs,
E(div) · Z, where E(div) is the mean divergence per base pair over the loci, and
performed a goodness-of-fit test using Pearson’s χ2 (Frisse et al., 2001). The gorilla
and orangutan data did not deviate significantly from expectation (p-value= 0.24 and
0.85 respectively); therefore, we set v = 1 for all loci in the analysis of these two data
sets. However, data from the three common chimpanzee populations and the bonobo
rejected the null hypothesis of a homogeneous mutation rate across loci (at the 5%
level). Thus, for a pair of chimpanzee populations or species, we set v at a locus to
the observed divergences per base pair divided by E(div).
Recombination rates. Won and Hey (2005) found evidence of recombination in
bonobos, Central and Western chimpanzees in ten of the 43 short segments surveyed
by Yu et al. (2003) used in this study. We estimated the locus-specific recombina-
tion rate in the data sets using MAXDIP (http://genapps.uchicago.edu/maxdip/
index.html, Hudson, 2001), setting 0.005 as the initial value and the gene conver-
sion rate to 0. From each species, we chose the subspecies with the largest estimate
of the mean recombination rate across loci, which were Central chimpanzees, West-
ern gorillas and Sumatran orangutans. Then, to assess whether the point estimates
were significantly greater than 0, we simulated 1, 000 data sets using ms (Hudson,
2002), setting the number of segregating sites to what was observed and ρ to 0. We
ran MAXDIP on the simulated data sets and calculated how many times ρ (i.e.,
estimated by MAXDIP) was equal or larger than observed under the standard neu-
tral model. By this approach, we rejected ρ = 0 for four out of 35 loci in Central
chimpanzees, one out of seven loci in Western gorillas and three out of 15 loci in
Sumatran orangutans (at the 5% level). Given the small sample sizes, our power
92
to detect recombination was limited. Nonetheless, our results suggest that ignoring
recombination will result in a loss of data − even in species in which ρ/θ is rela-
tively small. In the analyses of the ape data, we chose r = ρ/θ1 for each locus from
exp (1/0.6) (see above). We chose this distribution because it has been shown to be
a good description of fine-scale recombination rate variation in humans and may also
apply to a number of other organisms, notably to other apes (Coop and Przeworski,
2007).
Analyses. We ran MIMAR for 2 × 107 recording steps with 1 × 106 burn-in steps,
X = 50, recording the parameters every 50 steps and using prior distributions chosen
after preliminary analyses. We repeated our analyses for two independent seeds, and
considered that convergence was reached when the posterior distributions of both
runs were very similar (results not shown). Results reported are for the average from
the two independent runs. We obtained estimates of the effective population sizes
and split times in years for all the ape species and subpopulations using µ = 2×10−8
per base pair per generation and assuming 20 years per generation for chimpanzees
and orangutans (Gage, 1998; Fischer et al., 2004) and 15 years per generation for
gorillas (Thalmann et al., 2006).
Goodness-of-fit test. We investigated how well the data fit the estimated model
by generating the posterior predictive distributions of the four statistics S1, S2, S3,
and S4 summed over all loci, the mean FST (Hudson et al., 1992), and, in each
population, the mean pairwise differences (π; Nei and Li, 1979) and the mean Tajima’s
D (Tajima, 1989) across loci. To do so, we simulated data sets under the isolation-
migration model, sampling the parameters from the posterior distribution estimated
by MIMAR. We then compared the observed values of the statistics to the simulated
93
distribution (see Supplemental Figure 3.10), conservatively considering the model to
be a poor fit if the observed value of a data summary fell in the 2.5th percentile tails
of any statistic. We note that, since this goodness-of-fit test takes into account the
uncertainties associated with the estimates, it is similar to the Bayesian posterior
predictive p-value (e.g., Meng, 1994).
3.6 Acknowledgments
Many thanks to G. Coop, R. Hudson, J. Novembre, J. Pritchard, D. Reich, M.
Stephens, K. Teshima and K. Zeng, as well as three reviewers for helpful discussions
and/or comments on earlier versions of the manuscript. This work was supported by
an Alfred P. Sloan fellowship in Computational Molecular Biology to M.P. C.B. would
also like to acknowledge support from the Summer Institute in Statistical Genetics
(2006).
94
3.7 Appendix A: Supplemental Materials
Recall that our goal is to sample from the posterior distribution:
π(Θ|D) ∝ p(D|Θ)p(Θ).
To do so, we perform MCMC on the set of all genealogies,
G = (G11, · · · , G1X , · · · , GY 1, · · · , GY X),
and the parameters, Θ = (θ1, θ2, θA, T,M,P), by sampling sets (G,Θ) from the tar-
get distribution, π(G,Θ) (see Equation 3.3 in Methods). By integrating Equation
3.3 over genealogies G, we show that the marginal distribution of sampled values of
Θ from the target distribution is the posterior distribution of interest, π(Θ|D):
∫π(G,Θ)dG ∝
∫ Y∏y=1
1
X
X∑x=1
p(Dy|Gyx,Θ)
p(Θ)p(G|Θ)dG
(realizing that the integration is over independent Gyx)
=Y∏
y=1
1
X
X∑x=1
∫p(Dy|Gyx,Θ)p(Gyx|Θ)dGyx
p(Θ)
=Y∏
y=1
1
X
X∑x=1
p(Dy|Θ)
p(Θ)
= p(D|Θ)p(Θ).
3.8 Appendix B: Supplemental Figures
95
0.000 0.004 0.008 0.012
0.00
00.
004
0.00
8
θ1
0.000 0.004 0.008 0.012
0.00
00.
004
0.00
8
θ2
0.000 0.004 0.008 0.012
0.00
00.
002
0.00
40.
006
θA
0.0e+00 5.0e+06 1.0e+07 1.5e+07
0.00
00.
010
0.02
00.
030
T(a)
Figure 3.8: Smoothed posterior distributions estimated by MIMAR from sim-ulated data sets.The distributions are the average over two independent runs. Each panel (a−c) showsresults for 10 simulated data sets consisting of 20 loci surveyed at 20 chromosomesfrom each population, with no intra-locus recombination or migration and θ1 = θ2 =0.005 per base pair, a) with θA = 0.005 per base pair and T = 2.5× 106 generations,b) with θA = 0.005 and T = 5× 106 and c) with θA = 0.008 and T = 2.5× 106. Therange of the x-axis corresponds to the support of the prior and the red vertical linedenotes the true value of the parameter (see Methods in main text for details).
96
0.000 0.004 0.008 0.012
0.00
00.
004
0.00
8
θ1
0.000 0.004 0.008 0.012
0.00
00.
004
0.00
8
θ2
0.000 0.004 0.008 0.012
0.00
000.
0010
0.00
200.
0030
θA
0.0e+00 5.0e+06 1.0e+07 1.5e+07
0.00
00.
005
0.01
00.
015
T(b)
Figure 3.8 − continued: Results for 10 data sets simulated with no intra-locusrecombination or migration, θ1 = θ2 = θA = 0.005 per base pair and T = 5× 106
generations.
97
0.000 0.004 0.008 0.012
0.00
00.
004
0.00
8
θ1
0.000 0.004 0.008 0.012
0.00
00.
004
0.00
8
θ2
0.000 0.004 0.008 0.012
0.00
00.
001
0.00
20.
003
0.00
4
θA
0.0e+00 5.0e+06 1.0e+07 1.5e+07
0.00
00.
010
0.02
0
T(c)
Figure 3.8 − continued: Results for 10 data sets simulated with no intra-locusrecombination or migration, θ1 = θ2 = 0.005 and θA = 0.008 per base pair andT = 2.5× 106 generations.
98
0.000 0.002 0.004 0.006 0.008 0.010
0.00
00.
004
0.00
8
θ1
0.000 0.002 0.004 0.006 0.008 0.010
0.00
00.
004
0.00
8
θ2
0.000 0.005 0.010 0.015 0.020
0.00
000.
0010
0.00
200.
0030
θA
0.0e+00 5.0e+06 1.0e+07 1.5e+07
0.00
00.
010
0.02
0
T(a)
Figure 3.9: Smoothed posterior distributions estimated by IM (black) andMIMAR (grey) from simulated data sets.Results from two independent runs are shown. Both methods ran for the same numberof steps and were smoothed similarly. a) and b) are the results for two data setschosen from the 30 data sets simulated under the allopatric model without intra-locus recombination to show a case in which IM (a) or MIMAR (b) performed better(see Figure 3.2 in main text). The range of the x-axis corresponds to the supportof the prior and the red vertical line denotes the true value of the parameter (seeMethods in main text for details).
99
0.000 0.002 0.004 0.006 0.008 0.010
0.00
00.
004
0.00
8
θ1
0.000 0.002 0.004 0.006 0.008 0.010
0.00
0.04
0.08
0.12
θ2
0.000 0.005 0.010 0.015 0.020
0.00
000.
0010
0.00
20
θA
0.0e+00 5.0e+06 1.0e+07 1.5e+07
0.00
00.
002
0.00
40.
006
T(b)
Figure 3.9 − continued: A case in which MIMAR performed better.
100
0 50 100 150 200
0.00
0.05
0.10
0.15
0.20
S statistics
S1 (Bo.)S2 (We.)S3 (sh.)S4 (fi.)
0.5 0.6 0.7 0.8
0.00
0.04
0.08
Fst
0.5 1.0 1.5
0.00
0.04
0.08
0.12
π
π for Bono.π for West.
−0.6 −0.2 0.2 0.4 0.6
0.00
0.04
0.08
0.12
D
D for Bono.D for West.
(a)
Figure 3.10: Goodness-of-fit of the isolation-migration model for the apespecies and subspecies data.Distributions of the four statistics used by MIMAR (the polymorphisms specific to thefirst and second sample and shared and fixed between the two samples, S1, S2, S3and S4, respectively), as well as Fst, π and Tajima’s D (see Methods in main textfor details). The vertical lines correspond to the observed values. Shown are resultsfor a) bonobos (blue) and Western chimpanzees (red), b) bonobos and Central chim-panzees, c) bonobos and Eastern chimpanzees, d) Western and Central chimpanzees,e) Western and Eastern chimpanzees, f) Central and Eastern chimpanzees, g) West-ern and Eastern gorillas and h) Sumatran and Bornean orangutans. In this case, theestimated model does not seem to provide a good fit to D for bonobos.
101
0 100 200 300
0.00
0.10
0.20
S statistics
S1 (B.)S2 (C.)S3 (s.)S4 (f.)
0.4 0.5 0.6 0.7
0.00
0.04
0.08
Fst
0.5 1.0 1.5 2.0
0.00
0.04
0.08
0.12
π
π for Bon.π for Cen.
−0.6 −0.2 0.2 0.4 0.6 0.8
0.00
0.04
0.08
0.12
D
D for Bon.D for Cen.
(b)
Figure 3.10 − continued: Results for bonobos (blue) and Central chimpanzees (red).In this case, the estimated model does not seem to provide a good fit to D for bothbonobos and Central chimpanzees and as well as to π for Central chimpanzees.
102
0 50 100 150 200
0.00
0.05
0.10
0.15
0.20
0.25
S statistics
S1 (Bon.)S2 (East.)S3 (shar.)S4 (fixed)
0.3 0.4 0.5 0.6 0.7
0.00
0.02
0.04
0.06
0.08
0.10
Fst
1 2 3 4
0.00
0.04
0.08
0.12
π
π for Bono.π for East.
−1.0 −0.5 0.0 0.5 1.0
0.00
0.04
0.08
0.12
D
D for Bono.D for East.
(c)
Figure 3.10 − continued. Results for bonobos (blue) and Eastern chimpanzees(red). In this case, the estimated model does not seem to provide a good fit to D forbonobos.
103
0 50 100 150 200 250 300
0.00
0.10
0.20
0.30
S statistics
S1 (Western)S2 (Central)S3 (shared)S4 (fixed)
0.1 0.2 0.3 0.4 0.5
0.00
0.05
0.10
0.15
Fst
0.5 1.0 1.5 2.0
0.00
0.04
0.08
0.12
π
π for West.π for Cent.
−1.5 −1.0 −0.5 0.0 0.5
0.00
0.04
0.08
0.12
D
D for West.D for Cent.
(d)
Figure 3.10 − continued: Results for Western (blue) and Central chimpanzees (red).
104
0 50 100 150
0.0
0.2
0.4
0.6
0.8
S statistics
S1 (Western)S2 (Eastern)S3 (shared)S4 (fixed)
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.00
0.04
0.08
Fst
0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.04
0.08
0.12
π
π for Westernπ for Eastern
−1.5 −1.0 −0.5 0.0 0.5 1.0
0.00
0.05
0.10
0.15
D
D for West.D for East.
(e)
Figure 3.10 − continued: Results for Western (blue) and Eastern chimpanzees (red).
105
0 50 100 150 200
0.0
0.1
0.2
0.3
0.4
S statistics
S1 (Central)S2 (Eastern)S3 (shared)S4 (fixed)
0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.00
0.02
0.04
0.06
Fst
0 1 2 3 4 5
0.00
0.04
0.08
π
π for Centralπ for Eastern
−1.0 −0.5 0.0 0.5 1.0 1.5
0.00
0.05
0.10
0.15
D
D for Cent.D for East.
(f)
Figure 3.10 − continued: Results for Central (blue) and Eastern chimpanzees (red).In this case, the estimated model appears to provide a poor fit to the observed FST ,and D for Central chimpanzees.
106
0 20 40 60 80 100
0.0
0.1
0.2
0.3
0.4
0.5
0.6
S statistics
S1 (Western)S2 (Eastern)S3 (shared)S4 (fixed)
0.0 0.2 0.4 0.6 0.8
0.00
0.05
0.10
0.15
Fst
0 1 2 3 4
0.00
0.05
0.10
0.15
0.20
π
π for Westernπ for Eastern
−1.5 −0.5 0.5 1.0 1.5 2.0
0.00
0.05
0.10
0.15
0.20
D
D for West.D for East.
(g)
Figure 3.10 − continued: Results for Western (blue) and Eastern (red) gorillas.
107
0 100 200 300 400
0.0
0.1
0.2
0.3
0.4
0.5
S statistics
S1 (Suma.)S2 (Born.)S3 (shared)S4 (fixed)
0.0 0.2 0.4 0.6
0.00
0.04
0.08
Fst
0 2 4 6 8
0.00
0.04
0.08
π
π for Suma.π for Born.
−1.5 −0.5 0.0 0.5 1.0 1.5
0.00
0.02
0.04
0.06
D
D for Suma.D for Born.
(h)
Figure 3.10 − continued: Results for Sumatran (blue) and Bornean (red)orangutans.
CHAPTER 4
CAN WE LEARN ABOUT MODES OF SPECIATION BY
COMPUTATIONAL APPROACHES?
108
109
4.1 Abstract
Enduring debate in evolutionary biology centers around the question of whether the
early stages of speciation can occur in the presence of gene flow. With this question
in mind, a number of recent papers have used multi-locus polymorphism data from
closely related species to estimate parameters of simple speciation models, usually an
isolation-migration model. Application of such computational approaches to a variety
of species has yielded extensive evidence for migration, with the results interpreted
as supporting the widespread occurrence of parapatric speciation. Here, we conduct
a simulation study to assess the reliability of such inferences, using a program that
we recently developed (MIMAR) as well as the program IM of Hey and Nielsen (2004).
We find that when one of many assumptions of the isolation-migration model are
violated, the methods tend to yield biased estimates of the parameters, potentially
lending spurious support for parapatric speciation. To some extent, this problem
can be avoided by carefully testing the fit of estimated models to the data. When a
parapatric model is appropriate, we propose a test that may help to pinpoint regions
involved in reproductive isolation.
4.2 Introduction
The biological species concept proposed by Mayr (1963) defines species as “groups
of interbreeding natural populations that are reproductively isolated from other such
groups”. Reproductive isolation can act either prezygotically, through mating pref-
erences for example, or postzygotically, in which case it stems from decreased hybrid
fitness (Coyne and Orr, 2004d). The decrease in hybrid fitness, in turn, may be due
to extrinsic environmental interactions, such as a higher predation on hybrids, or to
110
intrinsic genetic incompatibilities, i.e., genetic interactions between heterospecific al-
leles present in a hybrid genome that directly reduce the fitness of hybrids (so-called
Dobzhansky-Muller (DM) incompatibilities; Dobzhansky, 1937; Muller, 1940). To
date, the specific mechanisms by which loci involved in pre- or post-zygotic reproduc-
tive isolation fix between nascent species remain poorly understood. In particular,
an enduring question is how often reproductive isolation factors begin to accumulate
between populations in the face of gene flow rather than in complete isolation (e.g.,
Orr, 2005).
A common model for the origin of new species assumes that the early stage of
divergence occurs in the absence of gene flow, i.e., in allopatry. Many cases of al-
lopatric divergence have been documented and, until recently, it was believed to be
the only − or at least main − process by which species originate (e.g., Coyne and
Orr, 2004a). In this model, an exogenous barrier (e.g., a river, mountain...) leads to
the isolation of two populations, and divergence then occurs homogeneously as the
genomes of the species accumulate differences. Both DM incompatibilities that fix
through genetic drift (i.e., neutrally) and loci that underlie local adaptation will con-
tribute to the reproductive isolation of species should they come back into secondary
contact but importantly, in this scenario, reproductive isolation is a by-product of
divergence rather than its cause (Mayr, 2001; Orr, 2001).
Alternatives are that species start diverging while occupying either the same geo-
graphical area (i.e., they are in sympatry) or non-overlapping, contiguous areas (i.e.,
in parapatry). Because they are in close geographic proximity, the populations ex-
change migrants and form hybrids. In this context, divergence occurs in a number of
stages, which have recently been characterized as the “genic view of speciation” (Wu,
2001b). First, natural selection leads to the fixation of alleles that contribute to the
111
differential adaptation of the two populations. These differentially adapted loci are
sometimes referred to as “speciation genes” because they actively reduce the fitness
of the hybrids, either through differential geographic adaptation or by contributing
to DM incompatibilities. These alleles create an effective barrier to gene flow in the
linked genomic regions, while other, unlinked regions can introgress without causing
decreased hybrid fitness (Coyne and Orr, 1998; Wu, 2001a,b; Orr et al., 2004; Coyne
and Orr, 2004c; Wu and Ting, 2004). Thus, some regions of the genome experienced
a reduced effective gene flow rate between populations relative to others. As the
populations become increasingly diverged and adapted to their local environments,
more genes contributing to reproductive isolation accumulate − until the final stage
of speciation, when gene flow is prevented throughout the genome. Thus, in contrast
to the classical models of allopatric speciation, models where speciation occurs in the
presence of gene flow assume that natural selection plays a major role in the accumu-
lation of reproductive isolation loci between species. They predict that the genomes
of closely related species should be mosaics of recently introgressed and highly di-
verged regions, leading to far greater heterogeneity than expected under a model of
allopatric speciation.
Theoretical modeling suggests that speciation can occur in presence of high levels
of gene flow (e.g., Dieckmann and Doebeli, 1999; Navarro and Barton, 2003), and the
possibility of gene flow at the initial stage of speciation is supported by the observa-
tion that species parapatrically or sympatrically distributed tend to show pre-mating
isolation more often than species distributed allopatrically (Coyne and Orr, 1989).
But, these alternative models of speciation remain controversial (Wu, 2001a,b; Mayr,
2001; Orr, 2001; Mallet, 2005), in part because of the difficulty in distinguishing para-
patry or sympatry from secondary contact after an allopatric phase (e.g., Goodman
112
et al., 1999; Llopart et al., 2002; Austin et al., 2004; Aagaard et al., 2005; Hall, 2005;
Knaden et al., 2005; Petren et al., 2005; Geraldes et al., 2006; Fritz et al., 2007). There
is indirect evidence that the early stages of speciation may often occur in presence
of some gene flow (e.g., Dixon et al., 2007; Nosil, 2008). Notably, molecular studies
have recently characterized several genes that contribute to the reproductive isolation
of sister species in model organisms such as Drosophila (Barbash et al., 2003; Pres-
graves et al., 2003; Ting et al., 1998; Wang et al., 1999; Sawamura et al., 1993), mus
(Fossella et al., 2000; Greene-Till et al., 2000) and Xiphophorus (Wittbrodt et al.,
1989; Kallman and Kazianis, 2006) and the reproductive isolation factors all show ev-
idence of strong positive selection (Wu, 2001a; Orr, 2005). While it remains unclear if
these factors drove speciation or accumulated since the species formed, these findings
suggest that Darwinian selection is the major force in the emergence of reproductive
isolation (Orr, 2005), consistent with models of parapatric speciation. Unfortunately,
fewer than ten reproductive isolation factors have been identified to date, limiting the
generality of the conclusions that can be reached.
An alternative approach, facilitated by recent technical advances, is to examine
patterns of genetics variation at multiple loci surveyed in closely related species or
recently diverged populations. Such analyses suggest that divergence levels are highly
heterogeneous, consistent with the notion that speciation often occurs in presence of
gene flow (e.g., Machado et al., 2002; Emelianov et al., 2004; Mallet, 2005; Turner
et al., 2005; Barluenga et al., 2006; Savolainen et al., 2006; Stadler et al., 2008).
Whether the detected introgressions occurred in the early or late stages of speciation
remains unclear however (Barton, 2001), as well as whether the variation in divergence
rates across loci is due to introgression or to the persistence of polymorphism that
was present in the common ancestor of the two populations.
113
Computational approaches have recently been developed to help distinguish be-
tween these two cases by estimating the parameters of divergence models (e.g., Wake-
ley, 1996; Wakeley and Hey, 1997; Kliman et al., 2000). The simplest such model,
referred to as the isolation model, is illustrated in Figure 4.1a: It describes the case
where an ancestral population suddenly split into two populations, which subse-
quently diverged neutrally in total isolation, and assumes panmixia and constant
effective population sizes. Under this cartoon of allopatric divergence, the descen-
dant populations will differ by an increasing number of fixed polymorphisms and
will share fewer polymorphisms as the divergence time increases, with some variation
across loci due to genetic drift. For example, after approximately 9−12N genera-
tions (where N is the effective size of the descendant population), over >95% of loci
will not share polymorphisms across the descendant populations (Hudson and Coyne,
2002). A second model, termed isolation-migration, also assumes a similar sudden
split from an ancestral population, but the two descendant populations experienced
a constant rate of gene flow since (Figure 4.1b). This scenario predicts that regions
that experienced gene flow in the history of the sample will harbor shared polymor-
phisms and few fixed polymorphisms, while regions that did not experience gene flow
will have more fixed polymorphisms, with any shared polymorphisms resulting from
the persistence of ancestral polymorphisms. Thus, in the isolation-migration model,
the variation in number of shared and fixed polymorphisms across loci is expected to
be much greater than in the isolation model. In that sense, the isolation-migration
model can be thought of as a neutral approximation to parapatric (or with a high
enough migration rate, sympatric) speciation.
Currently, there are two main methods, IM (Hey and Nielsen, 2004) and MIMAR
(Becquet and Przeworski, 2007) to estimate parameters of the isolation-migration
114
model (Figure 4.1b). Both use multi-locus polymorphism data collected in two species
or populations and Markov Chain Monte Carlo to estimate posterior distributions of
the parameters. The first method, IM, relies on the full polymorphism data from
a variety of genetic markers (i.e., AFLP, microsatellites, resequencing data...) and
allows for a wide range of complex demographic models, including population bottle-
necks and growth. However, it assumes no intra-locus recombination, so that regions
with evidence of recombination in the polymorphism data need to be excluded and
non-detectable recombination is ignored. This limitation can lead to biased estimates
of the isolation-migration model parameters (Takahata and Satta, 2002; Hey and
Nielsen, 2004; Becquet and Przeworski, 2007). IM has nonetheless been widely used
to analyze autosomal data, by considering subsets of the data with no apparent re-
combination (Hey and Nielsen, 2004). While some studies did not detect gene flow
(e.g., Dolman and Moritz, 2006), most did (e.g., Llopart et al., 2005; Won et al., 2005;
Bull et al., 2006; Geraldes et al., 2006; Kronforst et al., 2006; Carling and Brumfield,
2008; Mazzoni et al., 2008). The authors usually interpreted the gene flow as recent
or as reflecting secondary contact (e.g., Llopart et al., 2005; Geraldes et al., 2006),
under the assumption that the early stages of speciation likely occur in the absence
of gene flow. However, two recent reviews have taken the evidence to mean that the
gene flow occurred in the first stages of species formation, concluding that parapatric
speciation may be common (Hey, 2006b; Niemiller et al., 2008).
To circumvent the assumption of no intra-locus recombination made by IM, we
recently developed a second approach, MIMAR (Becquet and Przeworski, 2007). In
contrast to IM, the method uses summaries of the polymorphism data (from each
of the independently-evolving loci) rather than the full data. Specifically, it uses
four statistics known to be sensitive to the parameters of the isolation-migration
115
model (Wakeley and Hey, 1997; Leman et al., 2005). In simulations, MIMAR performs
comparably to IM for medium to large size data sets, providing reliable estimates of
the parameters and reasonable power to detect gene flow. MIMAR, like the original
version of IM, assumes a constant and uniform gene flow rate since the split, so
cannot distinguish between models of speciation with gene flow occurring at an early
or late stage. Thus, interpreting the results in terms of modes of speciation is not
straightforward.
In drawing inferences about speciation from computational methods, a further
complication is that the approaches may not be robust to realistic departures from the
assumptions of the models. For instance, in many analyses of the isolation-migration
model published to date (e.g., Hey et al., 2004; Hey, 2005; Won and Hey, 2005;
Thalmann et al., 2006; Becquet and Przeworski, 2007), the estimate of the ancestral
effective population size is more than two fold larger than that of the descendant
populations. Although in some cases these results may reflect a true reduction in
effective population sizes after the split, this is unlikely to be the case for so many
populations. Instead, the large estimates of ancestral population size may indicate
that a salient and fairly common demographic feature is being ignored. For instance,
geographical structure in the ancestral population or changing gene flow rates since
the split could lead to biased estimates of the parameters when not taken into account
(e.g., Innan and Watanabe, 2006).
Motivated by these considerations, we wanted to assess the reliability of these two
methods and in particular to investigate how realistic violations of model assumptions
affect parameter estimates. We find that parameter estimates tend to be biased in
the presence of realistic complications. In particular, non-zero gene flow is often
detected under models that would not be considered cases of parapatric or sympatric
116
speciation. When a model of parapatry is appropriate, we propose a simple approach
to identify candidate regions that contribute to reproductive isolation between species.
4.3 Method
4.3.1 The isolation model and violations
Figure 4.1a depicts an isolation model in which, T generations ago, a panmictic an-
cestral population of constant effective population size NA suddenly split into two
panmictic populations of constant effective population sizes N1 and N2, respectively,
which subsequently diverged neutrally in total reproductive isolation (i.e., the pro-
portion of a population that is replaced by migrants from the other population each
generation, m, is 0). This model can be viewed as representing a simple case of
allopatric speciation.
We wanted to assess the reliability of the methods with parameters that are
roughly similar to those of real data sets analyzed to date. For example, a num-
ber of papers have analyzed common chimpanzee and bonobo polymorphism data,
two species whose divergence time is estimated to be approximately 1 million year
and whose effective population sizes are on the order of N = 10, 000 (Won and Hey,
2005; Becquet and Przeworski, 2007). Assuming 20 years per generation (Gage, 1998;
Fischer et al., 2004), these great ape species therefore split ∼5N generations ago. In
turn, one of the most studied cases of Drosophila speciation are D. santomea and
D. yakuba, who are thought to have split roughly 500 thousand years (Kya), which
corresponds to 5 million generations (assuming 10 generations per year). Their ef-
fective population sizes are about N = 2× 106 (Bachtrog et al., 2006), so they split
roughly 2N generations ago. Motivated by these two cases, we arbitrarily considered
117
T = 3.2N . With this parameter value, on average ∼25−34% of loci will have complete
lineage sorting (i.e., will not share polymorphisms) between the two species under the
isolation model (Hudson and Coyne, 2002). We further set all effective population
sizes to be the same, at N1 = N2 = NA = 6.25× 105. Computational burden limited
our ability to consider other parameters, but we did run a subset of the models with
T = 0.32N generations, and results were qualitatively similar (results not shown).
For the performance of IM and MIMAR to be comparable, we set the intra-locus
recombination rate to 0 (since IM does not allow for intra-locus recombination).
Thus, in what follows, we consider data sets of 20 independently-evolving and non-
recombining 1kb loci. In real data, there usually is intra-locus recombination so a
data set of this size would generally contain more information than modeled here.
Given these parameters, we generated 20 data sets under the isolation model. We
further simulated 10 data sets under two models in which one of the assumptions of the
isolation model is violated. The parameters were as above, other than in the following
respects: We took T1= 175 , 2 or 4·T (Figure 4.1c) or T2= 0.25, 0.50 or 0.75·T (Figure
4.1d), where the split time, T , was fixed in all cases to 3.2N generations, and the
gene flow rate after T1 (or T2) was m1 (or m2)= 4× 10−7 or 4× 10−6, while directly
after the split m = 0. For these gene flow rates, the average number of migrants
exchanged each generation by the two populations, M1 = 4N1m1 (or M2 = 4N1m2),
is 1 or 10, respectively. We use a modification of the program ms (Hudson, 2002) to
generate the simulated data sets. The command lines used to generate each case are
given in the Supplemental Materials (Appendix A).
118
1N
2N
AN
Present
Past
T
m = 0
(a)
T
m>0
(b)
T1
m1>0
T
m = 0
(c)
T2
m2>0
m = 0
T
(d)
T2
m>0
m2 = 0
T
(e)
Fig
ure
4.1:
Sim
ple
models
of
speci
ati
on
.a)
The
isol
atio
nm
odel
,b)
The
isol
atio
n-m
igra
tion
model
,c)
Am
odel
ofis
olat
ion
from
ast
ruct
ure
dan
cest
ralp
opula
tion
,d)
Am
odel
ofis
olat
ion
follow
edby
seco
ndar
yco
nta
ctan
de)
Am
odel
ofis
olat
ion
wit
hm
igra
tion
only
atan
earl
yst
age.
Not
eth
atin
c)th
ep
opula
tion
stru
cture
forT<t<T
1diff
ers
from
that
fort<T
.T
he
model
sdis
pla
yed
ina)
,c)
and
d)
are
neu
tral
cart
oon
sof
allo
pat
ric
spec
iati
on,
while
model
sb)
and
e)hav
eb
een
vie
wed
asneu
tral
cart
oon
sof
par
apat
ric
spec
iati
on.m
,m
1an
dm
2den
ote
the
gener
atio
nal
frac
tion
ofa
pop
ula
tion
that
isre
pla
ced
by
mig
rants
from
the
other
pop
ula
tion
.N
1,N
2an
dN
Aar
eth
eeff
ecti
vep
opula
tion
size
s,T
isth
ediv
erge
nce
tim
ean
dT
1an
dT
2ar
eti
mes
ofot
her
even
tsin
gener
atio
ns;
see
Met
hods
for
furt
her
det
ail.
119
Simulations under a model of isolation from a structured ancestral pop-
ulation (Figure 4.1c). We generated simulated data sets under a neutral model
of isolation from a structured ancestral population. Specifically, we assumed that,
T1 generations ago, an ancestral population became structured into two populations
that exchanged migrants at constant rate m1 > 0. Then, T generations ago, this
structured ancestral population suddenly split into two descendant populations that
subsequently diverged in total reproductive isolation (i.e., m = 0). Note that the
population structure in the ancestral population does not correspond to the two
populations that subsequently diverged. This scenario can be viewed as allopatric
speciation from a structured ancestral population, since the split leading to the two
contemporary populations occurs in the absence of gene flow.
Simulations under a model of isolation followed by secondary contact (Fig-
ure 4.1d). We generated simulated data sets under a neutral model of isolation
followed by secondary contact as follows: Starting T generations ago, the descendant
populations diverged in total reproductive isolation (i.e., m = 0) until T2 generations
ago, when they experienced gene flow at constant rate m2 > 0. This scenario can be
viewed as allopatric speciation followed by secondary contact, since the split initially
occurs in the absence of gene flow.
4.3.2 The isolation-migration model and violations
In Figure 4.1b, we represent an isolation-migration model, in which T generations
ago, the ancestral population suddenly split into two populations, which subsequently
diverged in presence of a constant rate of gene flow (i.e., a constant m > 0). This
model can be viewed as representing a neutral cartoon of parapatric speciation (see
120
Introduction). We generated 20 simulated data sets under this model, each consisting
of 20 non-recombining 1 kb loci (Note that to make the results comparable, the
results for this model in Figure 4.5 are for a randomly chosen subset of 10 data
sets). The parameter values were as described above for the isolation model, but
with m = 4× 10−7 since the split.
Simulations under a model of isolation with migration at an early stage
(Figure 4.1e). We generated simulated data sets under an isolation-migration
model, but with changing gene flow rates over time: Starting T generations ago,
the descendant populations diverged in the presence of gene flow at constant rate
m > 0 until T2 generations ago, when they subsequently diverged in total reproduc-
tive isolation (i.e., m2 = 0). This scenario can be viewed as a neutral representation
of parapatric speciation, since the split initially occurs in the presence of gene flow.
Six sets of 10 data sets were simulated under this model, each with combinations of
T2= 0.25, 0.50 or 0.75·T and m = 4× 10−7 or 4× 10−6.
Simulations under an isolation-migration model with a locus with an un-
usual history. We generated simulated data sets, consisting of 20 non-recombining
1 kb loci, under the isolation-migration model, but varying the introgression rates
among loci. Specifically, the polymorphism data for 19 of the loci were simulated
under the isolation-migration model with the same parameters as described above for
the simple isolation-migration (i.e., model depicted in Figure 4.1b with m = 4×10−7
since the split). The remaining locus was used to mimic a region closely linked to a
reproductive isolation factor, which experienced restricted effective migration because
of selection against introgression. To do so, we simply generated polymorphism data
at the remaining locus under the same model as the 19 other loci, but with the gene
121
flow rate, m, set to 0 (model i, see Figure 4.1a).
We were also interested in considering a model where the data included a locus
where a recent selective sweep occurred at a nearby locus (e.g., a reproductive isolation
factor). We did so by simulating a reduced effective population size in one of the
populations (Galtier et al., 2000). In this model ii, the data were generated with the
same parameters as above for the simple isolation-migration (i.e., model depicted in
Figure 4.1b with m = 4 × 10−7 since the split), but the data at one of the 20 loci
was simulated with N1 = 110N2, while the data at the other 19 were generated with
N1 = N2.
4.3.3 Estimating the parameters of the isolation-migration model
From the simulated data, we calculated the statistics used by MIMAR, namely for each
locus: the number of derived polymorphisms unique to the samples from populations
1 and 2 (S1 and S2, respectively), the number of shared derived alleles between the
two samples (Ss), and the number of fixed derived alleles in either sample (Sf ),
assuming a known ancestral state (Becquet and Przeworski, 2007).
We then analyzed the summarized simulated data sets using MIMAR and the full
simulated data sets using IM to estimate the five parameters of the isolation-migration
model (see Figure 4.1b). For the two methods to be comparable the five parameters of
the model, including m, were sampled from the same uniform prior distributions with
0 as the lower limit (as required by IM). Each set of 10 (or 20) data sets generated
under a specific model was analyzed using the same uniform prior distributions. In
a subset of cases, we increased the range of the priors after preliminary analyses
to ensure that marginal posterior distributions estimated by both methods fit in
the prior ranges for all 10 (or 20) data sets. Specifically, we set the upper limit of
122
the uniform prior distributions to be either 1.25 × 106 or 2.5 × 106 for N1 and N2,
2 × 10−5, 4 × 10−6 or 4 × 10−5 for m, 2.5 × 106 for NA and 1 × 107 generations
for T (i.e., either 4N1 or 8N1). We ran both MIMAR and IM with two different seeds
for each analysis, to try to gauge that the methods had converged properly. We also
varied the number of burn-in and recorded steps to this end. We note that when
the assumptions of the isolation model are violated, MIMAR sometimes takes much
longer to reach convergence than when the model is valid. The problem is even more
acute with IM, often requiring more burn-in and recorded step to reach convergence,
presumably because the approach uses the full polymorphism data. For instance,
we ran the analyses with MIMAR on a cluster of 2.60 GHz dual and 2.40 GHz quad
AMD processors, using 5 × 105 burn-in and 4 × 106 recorded steps. When the data
were simulated under an isolation model (Figure 4.1a), the analyses took about 3
days on average; in turn, when the data were simulated under a model of isolation
followed by secondary contact (Figure 4.1d), MIMAR ran for about 5 and 13 days when
m2 = 4× 10−7 and 4× 10−6, respectively. We ran IM on a cluster of 2.40 GHz Intel
processors, which is about 3.8 times slower than the cluster described above. IM
required 5× 107 burn-in and 5× 107 recorded steps for most of the analyses to reach
convergence. When the data were simulated under an isolation model (Figure 4.1a),
IM analyses took about 9.5 days on average; in turn, when the data were simulated
under a model of isolation followed by secondary contact (Figure 4.1d), IM required
5 independent heated chains to converge and took about 22.5 and 25.5 days when
m2 = 4× 10−7 and 4× 10−6, respectively (see also Results). In this particular case,
we do not report results for 11 of the 60 simulated data sets, as the estimates of the
posterior distribution did not seem to have converged, even after 5× 107 burn-in and
5× 107 recorded steps for 5 independent heated chains.
123
We used the mode of the marginal posterior distribution estimated by MIMAR or
IM as our point estimate of the parameter. We considered that the method detected
evidence of gene flow when m ≥ 4 × 10−8 (corresponding to M = ˆ4N1m ≥ 0.1).
We introduced this admittedly arbitrary criterion in previous work (Becquet and
Przeworski, 2007), because it captures the approach taken more informally in other
studies (e.g., Hey and Nielsen, 2004; Won et al., 2005; Bull et al., 2006; Geraldes
et al., 2006; Kronforst et al., 2006; Carling and Brumfield, 2008; Mazzoni et al.,
2008). Other criteria, such as whether most of the density of the estimated marginal
posterior distribution of m clusters around 0 or peaks away from 0 (e.g., Hey and
Nielsen, 2004) yield the same qualitative results (not shown).
4.3.4 Goodness-of-fit test
Brief description of the test. We previously described the use of posterior predic-
tive probabilities to test whether the model estimated by MIMAR provides an adequate
fit to the data (Becquet and Przeworski, 2007). In this study, we used this goodness-
of-fit test on models estimated by MIMAR and IM from a simulated data set (hereafter
referred to as the “observed” data). We first summarized the polymorphism data
at each locus by the four statistics described above (S1, S2, Ss and Sf ). We also
calculated the following statistics: a measure of the differentiation between the two
populations, FST (Hudson et al., 1992), and, in each population, the number of pair-
wise differences, π (Nei and Li, 1979) and Tajima’s D (Tajima, 1989). From those, we
obtained a set {S+1 , S
+2 , S
+s , S
+f , FST , π1, π2, D1, D2} where each element is the sum
(X+) or mean value (X) of statistic X across the 20 loci in the data set. We wanted
to use these statistics to gauge how well the estimated isolation-migration model fits
the data. To do so, we generated 5, 000 or more data sets by sampling the parameters
124
from the posterior distributions estimated by MIMAR (or IM) from the observed data
set. From these simulations, we obtained the distributions of the nine statistics under
the estimated model. These distributions can be used to obtain “posterior predictive”
p-values (Meng, 1994), i.e., to calculate the probability of the “observed statistic” or
a more extreme value under the estimated model.
Some of the statistics that we considered are discrete and take on few values, so
that the p-values obtained in this way are not continuous. We therefore calculated
randomized probabilities, PR, as follows: We obtained the probabilities P (X < X∗)
and P (X ≤ X∗), where X∗ is the value of the statistic X in the “observed” data
set, then chose a randomized probability uniformly on (P (X < X∗), P (X ≤ X∗)).
We conservatively considered the model to be a poor fit if the observed value of a
statistic falls in the 2.5th percentile tails of any statistic, i.e., if PR(X < X∗) < 0.025
or > 0.975.
Locus-specific goodness-of-fit test. We took a simple approach to identify loci
with an unusual history. The strategy is the same as described for the goodness-
of-fit of the model, but here, the idea is to apply the test to each locus, i.e., use
the simulated distribution of the nine statistics for a single locus (as opposed to
the full data sets of 20 loci) to look for unusual patterns. To do so, we considered
the randomized probabilities PR(X < X l) and PR(Y > Y l), where, X l and Y l
are the values of the statistics X and Y at locus l, X ∈ {S1, S2, Ss, π1, π2, D1, D2}
and Y ∈ {Sf , FST }. We focus on these p-values to detect potentially interesting
loci because one expects fewer shared and more fixed polymorphisms and larger FST
values at loci involved in the reproductive isolation of two species (e.g., Hey, 1991,
2006b), and fewer polymorphism and lower π and D values at loci that have been
subject to recent positive selection (Maynard Smith and Haigh, 1974).
125
A concern is whether such probabilities behave as regular p-values, i.e., are uni-
formly distributed under the null model. To assess this, we performed a locus-specific
goodness-of-fit test at each of 20 loci in 20 data sets simulated under the isolation
and isolation-migration model (Figure 4.1a and b and see above), and obtained the
posterior predictive probabilities of the nine statistics for each locus for both models.
We then assessed whether the posterior predictive p-values for a given statistic are
approximately uniformly distribution (using a Kolmogorov-Smirnov test). We find
that while they are not strictly uniform, the fit is reasonable (results not shown) and
we therefore used these Bayesian posterior predictive p-values as we would regular
p-values, considering a locus to be significant when PR(X) < 0.05. We also investi-
gated the power to detect loci with an unusual history by combining the statistics.
To do so, we took the product of the randomized probabilities PR(X < X l) and
PR(Y > Y l) even though the statistics are not independent (following Voight et al.,
2005).
4.4 Results
We previously simulated data sets under the isolation-migration model and used
MIMAR to estimate parameters, finding that the method provides reliable estimates
of the parameters, and that, in the absence of recombination, the performance is
comparable to that of IM for medium to large data sets. We further found that
MIMAR has reasonable power to distinguish between a model with and without gene
flow since the split (Tables 1 and 2 of Becquet and Przeworski, 2007). In the present
study, we investigated the reliability of such methods when the data do not conform
to all the assumptions of the model.
126
4.4.1 Performance of MIMAR and IM under the isolation and
isolation-migration models
We began by investigating the reliability of MIMAR and IM under the simple isolation
(Figure 4.1a) and isolation-migration models (Figure 4.1b). We confirmed our previ-
ous results that MIMAR reliably distinguishes between a model with or without gene
flow, usually estimating m < 4× 10−8 for the isolation model and m ≥ 4× 10−8 for
the isolation-migration model (Methods and Supplemental Table 4.3 in Appendix B;
see also Table 2 of Becquet and Przeworski, 2007); as expected, IM also distinguishes
well between the two models. Both methods usually provide comparable and reliable
estimates of the parameters of interest. Specifically, the estimates of the ancestral
effective population size, in particular, tend to be fairly precise and are close to un-
biased (Figure 4.2 and Supplemental Table 4.3). In turn, the gene flow rates are
over-estimated a bit more by MIMAR than by IM under the isolation model (result not
shown), but tend to be well estimated under the isolation-migration model (results
not shown).
The least reliable estimate appears to be that of the split time, which is under-
estimated by MIMAR (and to a lesser extent, IM) under the isolation-migration model
(Supplemental Table 4.3 and Figure 4.2; see also Table 2 of Becquet and Przeworski,
2007). Because the split time, ancestral effective population size and gene flow rate
are correlated, estimates of other parameters also tend to be off in cases where T is
poorly estimated.
We performed a goodness-of-fit test of the estimated models and found that in
over 90% of the simulations, the models estimated by MIMAR or IM fit the data. In
the remaining few cases, the model were not well fitted for Tajima’s D calculated in
each population.
127
MIMAR IM MIMAR IM
0.0
0.5
1.0
1.5
Method
NA^
NA
m=0 m=4x10−7
(a)
MIMAR IM MIMAR IM
0.5
1.0
1.5
2.0
Method
TT
m=0 m=4x10−7
(b)
Figure 4.2: Estimates provided by MIMAR and IM from data simulated underthe isolation (Figure 4.1a) and isolation-migration models (Figure 4.1b).The y-axis shows the estimate provided by the mode of the marginal posterior dis-tributions divided by the true value. The results are shown for two parameters: theancestral effective population size, NA (a), and the divergence time, T (b), for 20 datasets simulated under the isolation model (m = 0, black) and the isolation-migrationmodel (m = 4× 10−7, blue).
In summary, under the isolation-migration model, the models estimated by MIMAR
and IM are usually comparable, providing concordant marginal posterior distributions
(results not shown), and a good fit to the data. Both methods detect whether or not
the data were generated in the presence of a constant gene flow rate since the split
and provide close to unbiased estimates of the ancestral effective population size.
However, MIMAR tends to under-estimate the time of divergence when there is some
gene flow since the split.
4.4.2 Effect of violations of the isolation model (Figure 4.1a)
Allopatric divergence from a structured ancestral population
Isolation-migration models estimated by MIMAR or IM from empirical data often
128
yield an estimate of the ancestral effective population size that is much larger than
that of either descendant populations (e.g., Hey et al., 2004; Hey, 2005; Won and
Hey, 2005; Thalmann et al., 2006; Becquet and Przeworski, 2007). Yet, MIMAR and
IM do not tend to over-estimate this parameter when the data are generated under
the isolation-migration model (Figure 4.3a and Supplemental Table 4.3; and see Table
2 of Becquet and Przeworski, 2007), so these results do not simply reflect a bias in
the estimators. Instead, they suggest that an aspect of real data not taken into
account by the models considered by the two methods leads to large estimates of the
ancestral effective population size. We investigated whether such results could reflect
geographical structure in the ancestral population. To do so, we simulated data sets
under a model of isolation from a structured ancestral population (Figure 4.1c), for
a range of parameter values. We then applied MIMAR and IM to these data sets to
estimate the parameters of the isolation-migration model (see Methods).
As shown in Figure 4.3, the estimates provided by MIMAR and IM tend to be less
reliable and precise than when the data are simulated under a simple isolation model
(case T1 = T in Figure 4.3; see also Supplemental Table 4.3). For instance, MIMAR
can yield over-estimates of the ancestral effective population size and large under-
or over-estimates of the divergence time, depending on the parameter values (Figure
4.3). In contrast to MIMAR, IM tends to systematically over-estimate the divergence
time but can either under- or over-estimate the ancestral effective population size. We
note that the marginal posterior distributions of the parameters provided by IM and
MIMAR are often markedly dissimilar in this case (results not shown). Perhaps most
importantly, MIMAR and IM reject the isolation model in all cases (i.e., m ≥ 4×10−8),
even though the only gene flow in the model occurs in the ancestral populations, before
the split that led to the two descendant population (Table 4.1).
129
3.2N 4.3N 6.4N 12.8N 4.3N 6.4N 12.8N
01
23
4
T1
NA^
NA
m1=4x10−7 m1=4x10−6
T1=T
(a) MIMAR
3.2N 4.3N 6.4N 12.8N 4.3N 6.4N 12.8N
01
23
4
T1
NA^
NA
m1=4x10−7 m1=4x10−6
T1=T
(b) IM
3.2N 4.3N 6.4N 12.8N 4.3N 6.4N 12.8N
01
23
45
T1
TT
m1=4x10−7 m1=4x10−6
T1=T
(c) MIMAR
3.2N 4.3N 6.4N 12.8N 4.3N 6.4N 12.8N
01
23
45
T1
TT
m1=4x10−7 m1=4x10−6
T1=T
(d) IM
Figure 4.3: Estimates provided by MIMAR (a and c) and IM (b and d) fromdata simulated under models of isolation from a structured ancestral pop-ulation (Figure 4.1c).See legend of Figure 4.2 for details. The results are for NA (a−b) and T (c−d) undersix models of isolation from a structured ancestral population (see Methods) in whichT1= 1
75 , 2 or 4·T and m1 = 4 × 10−7 (blue) or 4 × 10−6 (red). For comparison,we show the estimates provided by the methods for a randomly chosen subset of 10data sets generated under the correct model, i.e., the isolation model (Figure 4.1a, inwhich T1 = T ). Note, these are preliminary results as all analyses were not finishedor did not reach convergence in time for this draft.
130
m1
4×
10−
74×
10−
6
T1
inge
ner
atio
ns
4.3N
6.4N
12.8N
4.3N
6.4N
12.8N
Model
c:A
nce
stra
lst
ruct
ure
MIMAR
1.00
1.00
1.00
1.00
1.00
1.00
IM1.
001.
001.
00N
A1.
00N
A
morm
24×
10−
74×
10−
6
T2
inge
ner
atio
ns
0.8N
1.6N
2.4N
0.8N
1.6N
2.4N
Model
d:
Sec
ondar
yco
nta
ctMIMAR
1.00
1.00
1.00
1.00
1.00
1.00
IM1.
001.
001.
001.
001.
001.
00
Model
e:E
arly
gene
flow
MIMAR
0.80
0.20
0.00
1.00
0.40
0.00
IM0.
200.
000.
000.
000.
000.
00
Tab
le4.
1:P
rop
ort
ion
of
analy
ses
inw
hic
hnon-z
ero
gene
flow
was
dete
cted.
Pro
por
tion
ofan
alyse
sfo
rw
hic
hth
ees
tim
ate
ofth
ege
ne
flow
ratem≥
4×
10−
8(s
eeM
ethods
and
Bec
quet
and
Prz
ewor
ski,
2007
,fo
rdet
ails
).W
esh
owth
ere
sult
spro
vid
edbyMIMAR
and
IMon
dat
ase
tssi
mula
ted
under
thre
em
odel
sillu
stra
ted
inF
igure
s4.
1c−
ew
ith
each
ofth
esi
xco
mbin
atio
nof
par
amet
ers
that
we
consi
der
ed(s
eeM
ethods)
.N
ote,
the
resu
lts
for
model
c)ar
epre
lim
inar
yas
all
anal
yse
sw
ere
not
finis
hed
ordid
not
reac
hco
nve
rgen
cein
tim
efo
rth
isdra
ftan
dfo
rm
odel
d)
we
show
the
resu
lts
ofth
e49
of60
IMan
alyse
sth
atre
ached
conve
rgen
ce.
131
For some parameters, the goodness-of-fit test (notably using FST ) could detect
that the models estimated by MIMAR do not fit data simulated under models of isolation
from a structured ancestral population. But for others (e.g., T1 = 175T = 4.2N
generations and m1 = 4×10−7), our goodness-of-fit test usually tended not to detect
that the estimated model is invalid (Table 4.1 and Supplemental Tables 4.4−4.5).
In summary, the simulations indicate that the large estimates of the ancestral
effective population size provided by IM and MIMAR on real data could reflect structure
in the ancestral population − even structure unrelated to subsequent divergence.
Worryingly, however, the results from both methods could be interpreted incorrectly
as rejecting a model of allopatric speciation. Overall, MIMAR seems to be slightly more
sensitive to model misspecification than IM (Supplemental Tables 4.4−4.5), which is
perhaps not surprising since MIMAR considers only a subset of the information used
by IM when there is no intra-locus recombination, such as in these cases.
Allopatric divergence followed by secondary contact
In some cases when allopatric species came back in secondary contact recently,
the data could be incorrectly interpreted as reflecting parapatric speciation. This
motivated us to investigate the effect of isolation followed by secondary contact on
estimates provided by MIMAR and IM. To do so, we simulated sets of 10 data sets
under such a model for several combinations of parameter values (see Methods and
Figure 4.1d).
As shown in Figure 4.4, the estimates provided by the two methods become less
reliable and precise than when the data are simulated under the simple isolation model
(case T2 = T in the Figures 4.4; see also Supplemental Table 4.3). The two methods
tend to under-estimate the ancestral effective population size; in turn, MIMAR provides
under-estimate of the time of divergence while IM tends to slightly over-estimate this
132
parameter. We also find that MIMAR estimates substantial gene flow rates, in all cases
rejecting the isolation model; while IM also detects gene flow in all cases, the estimates
tend to be closer to our arbitrary cut-off m = 4×10−8 (results not shown). Moreover,
the goodness-of-fit test could usually not detect that the estimated model is incorrect
(Supplemental Tables 4.6−4.7). In this case, MIMAR seems to be more affected than
IM by the misspecification of the model: The estimated posterior distributions of
MIMAR are often not well resolved: the marginal posterior distribution for the split
time is often bimodal and for other parameters, the distributions are often flat (data
not shown). Similarly, IM has great trouble reaching convergence in this case (see
Methods). These observations likely reflect the difficulty of fitting a simple isolation-
migration model to data simulated under a more complex model. Also in this case
the marginal posterior distributions provided by IM and MIMAR are often markedly
dissimilar.
4.4.3 Effect of violations of the isolation-migration model (Figure
4.1b): Parapatry with gene flow only at an early stage
Thus far, we considered cases where models of allopatry give rise to parameter esti-
mates that are likely to be interpreted as supporting parapatric speciation. But the
reverse may also occur: In some cases of parapatric speciation, early gene flow may
not be detected by IM or MIMAR and thus the data could be incorrectly interpreted
as reflecting allopatric speciation. In addition, the large estimates of the ancestral
effective population size provided by the methods on real data could also results from
such cases, for example if there was decreasing gene flow rates since the population
split (e.g., Innan and Watanabe, 2006). To assess this possibility, we investigated
the effect of a change in gene flow rates over time, simulating 10 data sets under a
133
0N 0.8N 1.6N 2.4N 0.8N 1.6N 2.4N
0.0
0.5
1.0
1.5
T2
NA^
NA
m2=4x10−7 m2=4x10−6
T2=0
(a) MIMAR
0N 0.8N 1.6N 2.4N 0.8N 1.6N 2.4N
0.0
0.5
1.0
1.5
T2
NA^
NA
m2=4x10−7 m2=4x10−6
T2=0
(b) IM
0N 0.8N 1.6N 2.4N 0.8N 1.6N 2.4N
0.0
0.5
1.0
1.5
2.0
2.5
T2
TT
m2=4x10−7 m2=4x10−6
T2=0
(c) MIMAR
0N 0.8N 1.6N 2.4N 0.8N 1.6N 2.4N
0.0
0.5
1.0
1.5
2.0
2.5
T2
TT
m2=4x10−7 m2=4x10−6
T2=0
(d) IM
Figure 4.4: Estimates provided by MIMAR (a and c) and IM (b and d) fromdata simulated under models of divergence in isolation followed by sec-ondary contact (Figure 4.1d).Results for six models of isolation followed by secondary contact are shown, withcombinations of T2 = 0.25, 0.50 and 0.75·T and m2 = 4 × 10−7 (blue) or 4 × 10−6
(red) (See legend of Figure 4.3 and Methods for details). For comparison, we showthe estimates provided by the methods for a randomly chosen subset of 10 data setssimulated under the correct model, i.e., the isolation model (Figure 4.1a, in whichT2 = 0). Note that only 49 of 60 IM analyses reached convergence in this case.
134
model of isolation with migration only at an early stage, for several combinations of
parameter values (see Methods and Figure 4.1e).
IM (but not MIMAR) tends to slightly over-estimate the ancestral effective popu-
lation size in this case (Supplemental Table 4.9). Since IM does not over-estimate
this parameter under the isolation-migration model, these results are likely due to
the presence of gene flow shortly after the split. Moreover, IM, and to a lesser extent
MIMAR, tend not to detect gene flow in this case (using m ≥ 4 × 10−8 as a crite-
rion). In this case again, both methods often provide markedly dissimilar marginal
posterior distributions for the parameters of the model, with IM apparently slightly
more affected by the misspecification of the model (Figure 4.5 and Supplemental Ta-
bles 4.8−4.9). This result presumably reflects that the summary statistics considered
by MIMAR are less sensitive to the presence of gene flow only at the early stage of
divergence.
We also performed a goodness-of-fit test and found that in most cases the models
estimated by MIMAR or IM fit the data. So, our simple goodness-of-fit test cannot
detect that the data did not conform to the assumption of a constant gene flow rate
since the split.
In summary, our results suggest that the large estimates of the ancestral effective
population size provided by IM on real data could reflect changing gene flow rates since
the split. Moreover, IM tends to not detect gene flow in this case, and so the results
would incorrectly be interpreted as consistent with allopatric speciation. In contrast,
MIMAR does not seem to provide upwardly biased estimates of this parameter (at least
for the models we investigated here) and usually detects some gene flow, despite the
fact that the introgression events are old.
135
0N 0.8N 1.6N 2.4N 3.2N 0.8N 1.6N 2.4N
0.0
0.5
1.0
1.5
2.0
T2
NA^
NA
M=4x10−7 M=4x10−6
T2=TT2=0
(a) MIMAR
0N 0.8N 1.6N 2.4N 3.2N 0.8N 1.6N 2.4N
0.0
0.5
1.0
1.5
2.0
T2
NA^
NA
M=4x10−7 M=4x10−6
T2=TT2=0
(b) IM
0N 0.8N 1.6N 2.4N 3.2N 0.8N 1.6N 2.4N
0.5
1.0
1.5
2.0
T2
TT
M=4x10−7 M=4x10−6
T2=TT2=0
(c) MIMAR
0N 0.8N 1.6N 2.4N 3.2N 0.8N 1.6N 2.4N
0.5
1.0
1.5
2.0
T2
TTM=4x10−7 M=4x10−6
T2=TT2=0
(d) IM
Figure 4.5: Estimates provided by MIMAR (a and c) and IM (b and d) whendata are simulated under models of isolation with migration only at anearly stage (Figure 4.1e).The results for six models of isolation with migration at an early stage are shown:T2 = 0.25, 0.50 and 0.75·T, and m = 4 × 10−7 (blue) or 4 × 10−6 (red) (See legendof Figure 4.3 and Methods for details). For comparison, we show the estimatesprovided by the methods for a randomly chosen subset of 10 data sets generatedunder the correct models, i.e., the isolation-migration model with m = 4 × 10−7
(Figure 4.1b, in which T2 = 0) and the isolation model with m = 0 (Figure 4.1a,in which T2 = T ). The methods tend to under estimate T , but this is not an effectof the model misspecification since this parameter is also under estimated when thegene flow is constant after the split.
136
4.4.4 Detecting loci with unusual history
To date, only a handful of genes involved in the reproductive isolation between pairs
of species have been characterized, through time-consuming molecular approaches.
To investigate whether computational approaches such as MIMAR could be used to
help identify candidate regions for further investigation, we assumed that our set
of independently-evolving loci contains one locus closely linked to a reproductive
isolation factor. To model this, we simulated polymorphism data under an isolation-
migration model. We further assumed that one locus out of the set has either experi-
enced no gene flow since the split (model i), or that it has a lower effective population
size than other loci (model ii; see Methods for details). Applying MIMAR to these data
yields estimates similar to those obtained when the data are simulated under a simple
isolation-migration model (i.e., when all 20 loci have the same history; Supplemental
Figure 4.6 in Appendix C). Moreover, the estimated model fits the data at the 19 loci
with the same history as assessed by out goodness-of-fit test. Although these results
suggest that the inclusion of one locus with a different history has little effect on the
demographic inference for model ii, there appear to be an effect when one locus did
not experience gene flow (model i; see Supplemental Table 4.10).
Next, we wanted to investigate whether one could detect the locus with an unusual
history. We first performed a goodness-of-fit test on the full data sets and found that
the estimated models fit the full data sets well (i.e., of 20 loci, including the one
with an unusual history). The results would thus be interpreted as consistent with
parapatric speciation. Next, we applied a goodness-of-fit test to each locus separately.
Specifically, for each locus, we obtained the posterior predictive p-values for nine
statistics at this locus given the model estimated by MIMAR (see Methods). As shown
in Table 4.2, using the number of shared or fixed polymorphisms as our statistic (i.e.,
137
Ss and Sf ), a locus with no gene flow since the split had a significant PR value at
the 5% level in approximately two-thirds of the cases; the use of other statistics leads
to lower power. To detect a locus with a reduced effective population size, in turn,
measures of diversity (i.e., S1 and π1) appear to afford most power.
We also investigated whether the test can be improved by combining statistics
(see Methods). We find that if we combine five statistics (S1, π1, Ss, Sf , and FST ),
the power becomes quite high (Supplemental Table 4.11) and the false positive rate
remains low.
Thus, this simple approach seems helpful as a first step in detecting candidate
reproductive isolation loci between species pairs. We note that this method is a
simple quantification of the approach used informally in previous papers (Won and
Hey, 2005; Geraldes et al., 2006; Niemiller et al., 2008). However, we do not put
weight on the specific estimates of power or false discovery rates of the tests because
our models of loci with a reduced effective gene flow rate are highly simplified cartoons
of reproductive isolation factor. One would need much more realistic models to gain
an accurate sense of the performance of our approach, but these results suggest that
extensions of this method are worthwhile investigating.
138
S1
S2
S1/S
2S
sS
fM
odel
iii
iii
iii
iii
iii
Fra
ctio
nof
unusu
allo
cith
atar
esi
gnifi
cant
0.00
0.75
0.00
0.00
0.05
0.60
0.65
0.20
0.70
0.00
Fra
ctio
nof
regu
lar
loci
that
are
sign
ifica
nt
0.02
0.01
0.02
0.02
0.05
0.03
0.01
0.01
0.01
0.01
Fra
ctio
nof
sign
ifica
nt
loci
not
unusu
al1.
000.
121.
001.
000.
950.
480.
190.
560.
221.
00
Tab
le4.
2:R
esu
lts
of
the
locu
s-sp
eci
fic
goodness
-of-
fit
test
s.R
esult
sar
efo
r20
dat
ase
tsof
20lo
ciw
hen
the
dat
aat
19lo
ciw
ere
sim
ula
ted
under
asi
mple
isol
atio
n-m
igra
tion
model
(wit
hco
nst
ant
ratem
=4×
10−
7si
nce
the
split)
,w
hile
the
dat
aat
one
locu
sw
ere
gener
ated
wit
han
unusu
alhis
tory
,
eith
erunder
am
odel
wit
hno
gene
flow
since
the
split
(modeli)
,or
wit
h1 10
thth
eeff
ecti
vep
opula
tion
size
ofot
her
loci
inp
opula
tion
1(m
odelii
).W
ere
por
tfo
rea
chst
atis
tic
the
frac
tion
ofunusu
allo
cith
atar
esi
gnifi
cant
(this
corr
esp
onds
toth
ep
ower
ofth
ete
st),
the
frac
tion
ofot
her
loci
that
are
sign
ifica
nt
(this
corr
esp
onds
toth
efa
lse
pos
itiv
era
te)
and
the
frac
tion
ofsi
gnifi
cant
loci
that
are
not
unusu
al(t
his
corr
esp
onds
toth
efa
lse
dis
cove
ryra
te).
See
Met
hods
for
det
ails
.
FS
Tπ
1π
2D
1D
2M
odel
iii
iii
iii
iii
iii
Fra
ctio
nof
unusu
allo
cith
atar
esi
gnifi
cant
0.40
0.20
0.15
0.85
0.05
0.10
0.05
0.05
0.05
0.05
Fra
ctio
nof
regu
lar
loci
that
are
sign
ifica
nt
0.03
0.07
0.02
0.03
0.04
0.05
0.04
0.06
0.04
0.04
Fra
ctio
nof
sign
ifica
nt
loci
not
unusu
al0.
580.
880.
730.
430.
940.
900.
930.
960.
940.
94T
able
4.2−
conti
nued.
139
4.5 Discussion
We showed previously that MIMAR provides reliable estimates of the isolation-migration
model and has reasonable power to distinguish between the isolation and isolation-
migration models (Becquet and Przeworski, 2007). Here, we confirmed this obser-
vation for MIMAR and show similar results for IM (see also Hey and Nielsen, 2004).
However, real data are unlikely to conform to the numerous assumptions of these
simple models of divergence. To investigate the robustness of parameter estimates
when the model assumptions are violated, we simulated data for cases with popula-
tion structure in the ancestral population or where gene flow rates change over time,
either preceding or following a period of isolation.
We find that the methods become unreliable, often providing biased estimates of
the parameters of interest, such as the time of divergence and the ancestral effec-
tive population size. In particular, IM tends to over-estimate the ancestral effective
population size when there is gene flow only at an early stage of divergence or geo-
graphic structure in the ancestral population. The reason for this can be understood
as follows: large estimates of the effective population sizes reflect a low rate of genetic
drift or, viewed another way, longer and more variable coalescence times. Under as-
sumptions of panmixia and constant population size, estimates of the parameter N
thus have a reasonably straight-forward interpretation. However, many factors can
increase the distributions of coalescence times, notably geographic structure, inflating
estimates of N . In fact, in these cases, the parameter N may no longer have a clear
meaning.
We note that the applications to real data tend to yield very large estimates
of the ancestral effective population size (e.g., Hey et al., 2004; Hey, 2005; Won
and Hey, 2005; Thalmann et al., 2006; Becquet and Przeworski, 2007), while in our
140
simulations, we find estimates that are at most 2-fold higher than the truth. Thus,
while the simulations presented here point to features that may be salient in the
interpretation of MIMAR and IM, they seem, somewhat unsurprisingly, to be missing
additional aspects of real data. Alternatively, the parameter values we considered in
this simulation study did not have as large an effect on this parameter.
A second problem highlighted by this simulation study is that MIMAR and IM tend
to detect some gene flow, even when the data are simulated under a model of allopatric
speciation (Figures 4.1c−d). Thus, application of these computational methods may
lead researchers to incorrectly conclude that the populations have evolved in para-
patry when in fact speciation initiated is the absence of gene flow. Indeed, recent
reviews have highlighted the evidence of gene flow found by IM as evidence that spe-
ciation often initiates in the presence of gene flow (Hey, 2006b; Niemiller et al., 2008).
The latter interpretation is consistent with models of parapatric or sympatric specia-
tion and the idea that natural selection plays a major role in speciation (Orr, 2005).
While it may well be correct, the present study suggests that the computational evi-
dence per se is unreliable. The problem is that the signal relied on by computational
methods (ours as well as others) is whether a null model of allopatric speciation can
be rejected based on excess variation in divergence time among loci. This excess is
judged relative to the isolation model, i.e., with panmixia in the ancestral population,
among other features. Any departures from the null model that increase the variance
in divergence times can lead to spurious rejections of allopatry.
To try to address this problem, we used a simple goodness-of-fit test to assess
whether the estimated model does not fit the data (i.e., if the gene flow varied over
time). However, our goodness-of-fit test is often unable to detect that assumptions of
the model are violated. Better goodness-of-fit tests can undoubtedly be constructed.
141
For now, an alternative, when possible (i.e., when there is little intra-locus recombina-
tion in the data), may be to apply both IM and MIMAR to a data set, as our simulations
suggest that the methods provide incongruent estimates of the parameters when the
data violate assumptions of the model, but tend to yield more similar results un-
der the correct model. Using both methods may therefore help gauge whether the
estimates are readily interpretable.
Another approach would be to obtain estimates of the times of introgression
events. IM has recently been extended to allow the estimation of locus-specific gene
flow rates (Won and Hey, 2005), a feature that can potentially be used to identify loci
with restricted effective gene flow rates (i.e., putative reproductive isolation factors)
and to date introgression events (Won and Hey, 2005; Geraldes et al., 2006; Niemiller
et al., 2008). We attempted to use this feature on data where one locus out of 20
did not experience gene flow (model i), but IM did not estimate an unusually low
gene flow rate at the locus with an unusual history for the data sets we investigated
(results not shown). More generally, variation data at a locus contain only so much
information and it remains unclear if the estimates of locus-specific gene flow rates
and their timing are reliable (Becquet and Przeworski, 2007).
In summary, our results show that using computational methods to learn about
speciation mechanisms may be misleading in ways that are not always detectable
by simple goodness-of-fit tests. What are needed are methods that use more of the
polymorphism data than does MIMAR, but that allows for intra-locus recombination.
Also needed are more realistic models, allowing for varying migration rates over time
and across the genome, as well as demographic complications such as those already
considered by IM.
In cases where the isolation-migration model is found to be appropriate, computa-
142
tional approaches may help to pinpoint candidate loci closely linked to reproductive
isolation factors. As a first step in this direction, we proposed a goodness-of-fit tests
to detect loci with unusual evolutionary histories. The results from simulated data
are encouraging. A positive control of sorts might be to collect polymorphism data
from reproductive isolation factors as well as a series of unlinked regions. By apply-
ing our goodness-of-fit approach to the data, one could then verify that such loci are
indeed detectable by this approach.
4.6 Acknowledgments
We thank G. Coop, J. Wall and Chung-I Wu for helpful discussions and XXX for
comments on the manuscript.
4.7 Appendix A: Supplemental Materials
Command lines to generate simulations under the isolation model and
violations
To generate the data sets under the isolation model, we used a modification of the pro-
gram ms (Hudson, 2002), called MIMARsim (available at http://pps-spud.uchicago.
edu/~cbecquet/download.html, and see documentation therein for details) with the
following command line:
./mimarsim 20 -lf input20loci -u 2e-9 -w .05 -t .005 -ej 2e6 -o
input4IM > input4MIMAR
The flag “-o iminputfilename“ is not described in the documentation of MIMARsim,
but allows output file in the format of the input file required by IM.
143
Command lines to generate simulations under a model of isolation from a
structured ancestral population (Figure 4.1c in main text). The command
line for the model with T1= 175 ·T and m1 = 4× 10−7 was:
./mimarsim 20 -lf input20loci -q 3 -u 2e-9 -w .05 -t .005 -ej 2666700
-M 4e-07 -eh 0.75 -es 0.75 .5 -eM 0.75 0 -o input4IM > input4MIMAR
The flag “-q 3“ is not described in the documentation of MIMARsim, but allows to
specify the gene flow rate using the parameter m instead of M . The command line
for the model with T1= 4·T and m1 = 4× 10−6 was:
./mimarsim 20 -lf input20loci -q 3 -u 2e-9 -w .05 -t .005 -ej 8000000
-M 4e-06 -eh 0.25 -es 0.25 .5 -eM 0.25 0 -o input4IM > input4MIMAR
In turn, the command lines for the other models of isolation from a structured
ancestral population were similar, changing the values of the flags ”-ej”, “-M”, “-eh”,
“-es” and “-eM”.
Command lines to generate simulations under a model of isolation followed
by secondary contact (Figure 4.1d in main text). The command line for the
model with T2= 0.25·T and m2 = 4× 10−7 was:
./mimarsim 20 -lf input20loci -q 3 -u 2e-9 -w .05 -t .005 -ej 2e6 -M
4e-07 -eM 0.25 0 -o input4IM > input4MIMAR
In turn, the command line for the model with T2= 0.75·T and m2 = 4 × 10−6
was:
./mimarsim 20 -lf input20loci -q 3 -u 2e-9 -w .05 -t .005 -ej 2e6 -M
4e-06 -eM 0.75 0 -o input4IM > input4MIMAR
144
The command lines for the other models of isolation followed by secondary contact
were similar, changing the values of the flags “-M” and “-eM”.
Command lines to generate simulations under the isolation-migration
model and violations
To generate the data sets under the isolation-migration model, we used the following
command line:
./mimarsim 20 -lf input20loci -q 3 -u 2e-9 -w .05 -t .005 -ej 2e6 -M
4e-07 -o input4IM > input4MIMAR
Command lines to generate simulations under a model of isolation with
migration at an early stage (Figure 4.1e in main text). The command line
for the model with T2= 0.25·T and m2 = 4× 10−7 was:
./mimarsim 20 -lf input20loci -u 2e-9 -w .05 -t .005 -ej 2e6 -eM .25 1
-o input4IM > input4MIMAR
In turn, the command line for the model with T2= 0.75·T and m2 = 4 × 10−6
was:
./mimarsim 20 -lf input20loci -u 2e-9 -w .05 -t .005 -ej 2e6 -eM 0.75
10 -o input4IM > input4MIMAR
The command lines for the other models of isolation with migration at an early
stage were similar, changing the values of the flag “-eM”.
145
Command lines to generate simulations under an isolation-migration model
with a locus with an unusual history. To generate the data at 19 loci for models
i and ii, we used the following command line:
./mimarsim 19 -lf input19loci -q 3 -u 2e-9 -w .05 -t .005 -ej 2e6 -M
4e-07 -N .005 -n .005 -o input4IM > input4MIMAR
To generate the data at the unusual locus for model i, we used the following
command line (see Methods in main text for details):
./mimarsim 1 -lf input1loci -q 3 -u 2e-9 -w .05 -t .005 -ej 2e6 -N
.005 -n .005 -o input4IM > input4MIMAR
In turn, to generate the data at the unusual locus for model ii, we used the
following command line (see Methods in main text for details):
./mimarsim 1 -lf input1loci -q 3 -u 2e-9 -w .05 -t .0005 -ej 2e6 -M
4e-6 -N .005 -n .005 -o input4IM > input4MIMAR
146
4.8 Appendix B: Supplemental Tables
Performance of MIMAR and IM under the isolation and
isolation-migration models
Isolation Isolation-migrationMIMAR IM MIMAR IM
N1 1.00 1.00 1.00 1.00N2 1.00 1.00 1.00 1.00NA 0.95 1.00 0.80 0.80T 1.00 1.00 0.50 0.95
m ≥ 4× 10−8† 0.00 0.00 1.00 1.00Poor fit 0.15 0.05 0.10 0.05
Table 4.3: Proportion of MIMAR and IM analyses with parameter estimateswithin two fold of their true value when the data are simulated under theisolation and isolation-migration models (Figure 4.1a−b in main text).Results for the estimates of effective population sizes, N1, N2 and NA, the divergencetime, T and the gene flow rate, m. We report the proportion of 20 analyses in whichthe estimated model did not fit the data for one of the statistics.† For m, we report the proportion of 20 analyses in which gene flow was detected,i.e., where the estimate of the gene flow rate m ≥ 4 × 10−8 (see Methods in maintext for details).
147
Effect of violations of the isolation model
Allopatric divergence from a structured ancestral population
m1 = 4× 10−7 m1 = 4× 10−6
T1 in generations 4.3N 6.4N 12.8N 4.3N 6.4N 12.8NN1 1.00 1.00 1.00 NA 1.00 NAN2 1.00 1.00 1.00 NA 1.00 NANA 0.50 0.70 0.10 NA 0.14 NAT 1.00 0.30 0.20 NA 0.00 NA
Poor fit 0.00 0.10 0.30 NA 0.14 NA
Table 4.4: Proportion of MIMAR analyses with parameter estimates withintwo fold of their true value when data are simulated under models ofisolation from a structured ancestral population (Figure 4.1c in main text).See legend of Supplemental Table 4.3 for details. Note, these are preliminary resultsas all analyses were not finished or did not reach convergence in time for this draft.
m1 = 4× 10−7 m1 = 4× 10−6
T1 4.3N 6.4N 12.8N 4.3N 6.4N 12.8NN1 1.00 1.00 1.00 NA NA NAN2 1.00 1.00 1.00 NA NA NANA 0.50 0.57 0.13 NA NA NAT 1.00 0.29 0.25 NA NA NA
Poor fit 0.00 0.14 0.33 NA NA NA
Table 4.5: Proportion of IM analyses with parameter estimates within twofolds of their true value when data are simulated under models of isolationfrom a structured ancestral population (Figure 4.1c in main text).See legend of Supplemental Table 4.3 for details. Note, these are preliminary resultsas all analyses were not finished or did not reach convergence in time for this draft.
148
Allopatric divergence followed by secondary contact
m2 = 4× 10−7 m2 = 4× 10−6
T2 0.8N 1.6N 2.4N 0.8N 1.6N 2.4NN1 1.00 1.00 1.00 1.00 1.00 1.00N2 1.00 1.00 1.00 0.70 0.90 1.00NA 0.60 0.70 0.50 0.60 0.90 0.60T 0.70 0.30 0.70 0.40 0.10 0.20
Poor fit 0.20 0.10 0.10 0.20 0.30 0.00
Table 4.6: Proportion of MIMAR analyses with parameter estimates withintwo fold of their true value when data are simulated under models ofisolation followed by secondary contact (Figure 4.1d in main text).See legend of Supplemental Table 4.3 for details.
m2 = 4× 10−7 m2 = 4× 10−6
T2 0.8N 1.6N 2.4N 0.8N 1.6N 2.4NN1 1.00 1.00 1.00 1.00 1.00 1.00N2 1.00 1.00 1.00 1.00 1.00 1.00NA 0.70 0.60 0.80 0.71 0.80 0.43T 1.00 1.00 1.00 1.00 1.00 0.86
Poor fit 0.20 0.10 0.10 0.10 0.00 0.00
Table 4.7: Proportion of IM analyses with parameter estimates within twofolds of their true value when data are simulated under models of isolationfollowed by secondary contact (Figure 4.1d in main text).See legend of Supplemental Table 4.3 for details. Note that only 49 of 60 analysesreached convergence in this case.
149
Effect of violations of the isolation-migration model: Parapatry with
gene flow only at an early stage
m = 4× 10−7 m = 4× 10−6
T2 0.8N 1.6N 2.4N 0.8N 1.6N 2.4NN1 1.00 1.00 1.00 1.00 1.00 1.00N2 1.00 1.00 1.00 1.00 1.00 1.00NA 0.90 0.90 1.00 1.00 0.90 1.00T 0.90 1.00 1.00 0.50 1.00 1.00
Poor fit 0.40 0.00 0.00 0.20 0.00 0.00
Table 4.8: Proportion of MIMAR analyses with parameter estimates withintwo fold of their true value when data are simulated under models ofisolation with migration only at an early stage (Figure 4.1e in main text).See legend of Supplemental Table 4.3 for details.
m = 4× 10−7 m = 4× 10−6
T2 0.8N 1.6N 2.4N 0.8N 1.6N 2.4NN1 1.00 1.00 1.00 1.00 1.00 1.00N2 1.00 1.00 1.00 1.00 1.00 1.00NA 0.90 1.00 1.00 1.00 1.00 1.00T 0.60 1.00 1.00 0.00 0.80 1.00
Poor fit 0.10 0.00 0.00 0.00 0.00 0.00
Table 4.9: Proportion of IM analyses with parameter estimates within twofolds of their true value when data are simulated under models of isolationwith migration only at an early stage (Figure 4.1e in main text).See legend of Supplemental Table 4.3 for details.
150
Detecting loci with unusual history
Isolation-migration Model i Model iiN1 1.00 1.00 1.00N2 1.00 1.00 1.00NA 0.80 0.75 0.80T 0.50 0.75 0.50
m ≥ 4× 10−8† 1.00 1.00 1.00Poor fit 0.10 0.05 (0.05) 0.10 (0.25)
Table 4.10: Proportion of MIMAR analyses with parameter estimates withintwo fold of their true value when data at a locus with an unusual historyare simulated under models i and ii.Results are for 20 data sets with the data at 19 loci simulated under a isolation-migration model with a constant gene flow rate m = 4× 10−7 since the split and thedata at another locus simulated without gene flow since the split (model i) or witha lower effective population size than the other loci (i.e., model ii). For comparison,we also show the results for 20 data sets simulated under a simple isolation-migrationmodel (i.e., when the 20 loci have the same history). We report for one of the statisticsthe proportion of 20 analyses for which the estimated model did not fit the data for20 (isolation-migration model) or 19 loci (models i and ii) with a similar history andin parenthesis for the data including the locus with a different history. See legend ofSupplemental Table 4.3 for details.
151
Ss
Ss
S1
S1
Sta
tist
ics
Sf
FS
Tπ
1π
1C
ombin
edS
s
Sf
FS
TM
odel
iii
iii
iii
iii
Fra
ctio
nof
unusu
allo
cith
atar
esi
gnifi
cant
0.75
0.15
0.80
0.40
0.15
0.90
0.75
0.90
Fra
ctio
nof
regu
lar
loci
that
are
sign
ifica
nt
0.01
0.02
0.02
0.04
0.03
0.02
0.02
0.02
Fra
ctio
nof
sign
ifica
nt
loci
not
unusu
al0.
250.
730.
360.
680.
800.
310.
290.
28
Tab
le4.
11:
Resu
lts
of
the
best
locu
s-sp
eci
fic
goodness
-of-
fit
test
s.See
lege
nd
ofT
able
4.2
and
Met
hods
inm
ain
text
for
det
ails
.
152
4.9 Appendix C: Supplemental Figure
Isolation−migration i ii
0.0
0.5
1.0
1.5
Model
NA^
NA
Isolation−migration Model i model ii
(a)
Isolation−migration i ii
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Model
TT
Isolation−migration Model i model ii
(b)
Figure 4.6: Estimates provided by MIMAR when the data at one locus with adifferent history are simulated with models i (blue) and ii (red).See legend of Figure 4.3 in main test for details. For comparison, we show the resultsfor 20 data sets simulated under a simple isolation-migration model (black).
CHAPTER 5
ESTIMATING THE DEMOGRAPHIC PARAMETERS OF
HUMAN POPULATIONS
153
154
5.1 Abstract
In the last years, there has been an extensive effort to gather large data sets of human
genetic diversity. The largest have so far consisted of microsatellites and SNP geno-
typing data in human samples from 52 populations. While these data yielded deep
knowledge about human populations diversity and structure, they suffered from vari-
ous drawbacks that made them less than ideal for learning about human demographic
history. Recently, a data set has been purposely generated to specifically study hu-
man demographic history, consisting of resequencing data at 40 unlinked autosomal
and X-linked loci sampled in 90 individuals from six geographically distant human
populations. In this study, we apply our method MIMAR to the autosomal data to
estimate demographic parameters for these six populations. MIMAR provides results
consistent with previous knowledge from other sources: We estimate that the split
between African and non-African populations occurred about 40−70 Kya, while the
Tsumkwe San and two other African populations split 60−80 Kya. Although the
six populations are found far apart from one another today, MIMAR detects extensive
evidence of migration between them, suggesting that it may be problematic to ignore
gene flow in models of human demographic history.
5.2 Introduction
There are several data sets that extensively catalogue the human genetic diversity.
For instance, 993 autosomal microsatellite markers were genotyped in 1056 individ-
uals from 52 populations by the HGDP-CEPH (Cann et al., 2002; Rosenberg et al.,
2002, 2005). This data set yielded important insight about the genetic structure and
the extent of differentiation between human populations. The International HapMap
155
Consortium, in turn, has genotyped 4 million SNPs in a panel of 270 individuals from
four populations (Consortium, 2005; Frazer et al., 2007). The knowledge accumu-
lated from such data sets has led to powerful selection scans and association studies
(e.g., Voight et al., 2006). However, such studies have some limitations with respect
to answering questions related to human demographic history. For instance, data
sets such as HapMap suffer from an ascertainment bias, which is largely unknown.
In addition, some of the sampled populations are admixed (e.g., Akey et al., 2004;
Consortium, 2005), which can affect the estimates of demographic parameters such
as population growth rates (Ptak and Przeworski, 2002) and therefore complicates
any demographic inferences from such data sets.
Wall et al. (2008) recently published a large resequencing data set from six human
populations. The data were gathered specifically in order to learn about human
demography: The populations were chosen to avoid cryptic population structure and
the 20 autosomal and 20 X-linked loci were chosen to be intergenic and far from
genes in an attempt to avoid the confounding effect of natural selection (e.g., Voight
et al., 2005; Barreiro et al., 2008). In this study, we apply MIMAR to the data from
the autosomal loci to estimate human demographic parameters.
5.3 Methods
5.3.1 Raw data description
The data set consists of 20 autosomal regions of about 20 kb located at least 100
kb away from genic regions. Three discrete segments (locus trio) of ∼4−6 kb were
sequenced to span most of the distance of each 20 kb region (Figure 5.1). These re-
gions were chosen in areas of the human genome with medium or high recombination,
156
making MIMAR a better method to apply to this data (since it allows for intra-locus
recombination as opposed to IM, Hey and Nielsen, 2004). The 20 locus trios were
resequenced in three Sub-Saharan African populations (17 Biaka Pygmies, 18 Man-
denka and 15 Tsumkwe San), a European population (16 French Basque), an Asian
population (16 Han Chinese), and an Oceanian population (16 Nan Melanesian).
3 loci of 4-6 kb each
Region of ~20 kb
Figure 5.1: Raw data description.The cartoon shows a region of about 20 kb and the locus trio (black boxes) rese-quenced for this region.
5.3.2 Data processing
Although MIMAR allows for intra-locus recombination, it assumes that the loci are
independent, and cannot easily incorporate linkage disequilibrium (LD) information
among segments. Thus, without modifying the program, we could not use the data
for all three loci of each region. Instead, we selected one or two loci per region and
maximized the information as follows: We tested for significant pairwise LD between
the loci within a region using DnaSP (Rozas et al., 2003) in the Nan Melanesian,
known to have the highest LD levels of the 52 populations considered in Conrad et al.
(2006), and French Basque, for which there were more data (i.e., more chromosomes).
We considered the LD between all pairs of polymorphic sites in the two most distant
loci: If fewer than 5% of the comparisons in the Nan Melanesian and fewer than 10%
157
of the comparisons in the French Basque showed significant LD at the 5% level (true
for 6 regions), the two most distant loci were assumed to be effectively independent,
so we retained both. The analyses below are for 26 loci: the two most distant loci
for six regions, and the longest locus for the remaining 14 regions. The samples from
related individuals as well as samples with missing data on the full length of a locus
were removed from the data set (Tables 5.4−5.8). The remaining missing data were
inferred using fastPHASE using known haplotypes, incorporating the subpopulations
labels, turning off the sampling of haplotypes and using the default values for other
parameters (Scheet and Stephens, 2006). On those augmented data, we calculated
FST , and π (Nei and Li, 1979) and Tajima’s D (Tajima, 1989) for each population
and the four summaries of polymorphism required by MIMAR: The polymorphisms
specific to the first and second sample and shared and fixed between the two samples,
S1, S2, S3 and S4, respectively (Becquet and Przeworski, 2007).
5.3.3 Analyses
We applied MIMAR to the data sets to estimate the parameters of the isolation-
migration models for the 15 pairs of human populations. MIMAR provides estimates
of the ancestral and descendant populations mutation rates (e.g., for population one,
θ1 = 4N1µ, where N1 is the effective population size of population one and µ is
the per base pair per generation mutation rate), the split time in generations, T ,
and the symmetrical number of migrants exchanged each generations by the pop-
ulations, M = 4N1m, were m is the fraction of generational migrants (for details,
see Becquet and Przeworski, 2007). Here, we report the estimates of the effective
population size, the split time in year (assuming 25 years per generation (e.g., Voight
et al., 2005) and µ = 2 × 10−8 (Nachman and Crowell, 2000)) and the gene flow
158
rate parameter m. We performed a goodness-of-fit test that rejected the null hy-
pothesis of homogeneous mutation rate across the loci (Becquet and Przeworski,
2007). So, for each locus, we fixed the scalar for mutation rate variation to the
ratio of observed divided by expected divergence. We obtained the average recom-
bination rates estimated by Myers et al. (2005) for one or more segments of the
genome that included a specific locus. In the analyses with MIMAR, we fixed the re-
combination scalar for the locus to this value (see MIMAR documentation available
at http://pps-spud.uchicago.edu/~cbecquet/MIMARdoc.pdf). We ran MIMAR for
5× 106 burn-in steps and 9.5× 107 recorded steps, with the variances for the kernel
distributions set to one 50th of the ranges of the prior distributions. The prior dis-
tributions were uniform and their widths corresponds to the x-axes of Supplemental
Figures 5.3−5.17 (Appendix A).
5.3.4 Goodness-of-fit test
To examine whether the isolation-migration model is an appropriate description of
the history of the human populations, we simulated 10, 000 data sets for parame-
ters sampled from the posterior distributions estimated by MIMAR, and compared the
simulated data to the actual data (for detail, see Becquet and Przeworski, 2007) .
Encouragingly, in most cases, the isolation-migration model appears to provide
a reasonable fit to the sum over the 26 loci of the four statistics used by MIMAR,
as well as to the mean over the 26 loci of FST , and of π (Nei and Li, 1979) and
Tajima’s D (Tajima, 1989) calculated for both populations in a data set (results not
shown). However, there are some exceptions: For the analyses of Nan Melanesian
and Mandenka (Supplemental Figure 5.18a), French Basque and Nan Melanesian
(Supplemental Figure 5.18c) and Han Chinese and Nan Melanesian (Supplemental
159
Figure 5.18d), the estimated model leads to lower FST than in the data. These
results suggest that in some of these cases (e.g., for the non-African populations),
an isolation-by-distance model may describe the data better. Alternatively, since
these three cases involve the Nan Melanesians, these results may reflect the little
data available for this population: Only 18 chromosomes were used in the analyses
(as opposed to 28 or more in the other samples, Tables 5.4−5.8) resulting in few
Melanesian specific polymorphisms (since rare alleles are less likely to be observed).
This tends to inflate the ratio of unique to shared polymorphisms and may bias
parameters estimates towards models leading to low FST values. For the analyses of
French Basque and Han Chinese (Supplemental Figure 5.18b) and French Basque and
Nan Melanesian (Supplemental Figure 5.18c), the estimated model tends to provide
lower Tajima’s D values than observed for the French Basque. This could be due to
the demographic history of the French Basque, if this population experienced strong
bottlenecks in the recent past (e.g., Alonso et al., 1998).
5.4 Results
The smoothed marginal posterior distributions estimated by MIMAR for pairs of popu-
lations and the estimates of the joint posterior distributions for combinations of gene
flow rate, split time in years and ancestral population size are shown in Supplementary
Figures 5.3−5.17. The mode and the central 95th percentile of the marginal poste-
rior distributions for the parameters of the isolation-migration model are reported in
Tables 5.1−5.3. The ranges of joint parameters with the highest probability density
are reported in Tables 5.4−5.8. For some pairs of populations, the marginal poste-
rior distributions were flat, suggesting that the data were not sufficiently informative
(e.g., Supplemental Figure 5.3a). The estimates of the parameters from unconvincing
160
posterior distributions are reported in italic in the tables and should be taken with
caution.
5.4.1 Split between African populations
The Tsumkwe San are estimated to have split from the common ancestor of the
two other African populations ∼70−80 thousand years ago (Kya). In contrast, the
Biaka Pygmies and Mandenka are estimated to have split more recently, ∼40 Kya (see
Table 5.3). We find extensive evidence of gene flow between the three pairs of African
populations, of the order of 10−20 migrants each generation. However, the estimates
of this parameter may not be reliable due to the wide ranges of values that fit the data
(see Supplemental Figures 5.3−5.5). The ancestral effective population size estimates
for these populations are ∼15, 500 and are consistent across the analyses (Table 5.2).
In turn, the estimates of the effective population sizes vary with the analysis.
When estimated in analyses with other African populations, the estimates of the
effective population sizes were >13, 000 for the Biaka Pygmies, ∼10, 000 for Man-
denka and ∼20, 000 for Tsumkwe San (Table 5.1). In contrast, when estimated in
analyses with non-African populations, these estimates seem less precise and become
>13, 000, >14, 000 and >11, 000, respectively (see Table 5.1 and Supplemental Fig-
ures 5.6−5.14). MIMAR may be less reliable for the data of African and non-African
populations pairs because it ignores the very different demographic histories of these
two groups (e.g., growth or constant effective population size in Africa vs. bottleneck
and recovery after the out-of-Africa migration; Schaffner et al., 2005). In any case,
these estimates between 10, 000 and 20, 000 are consistent with previous reports (e.g.,
Maynard Smith and Haigh, 1974; Schaffner et al., 2005).
161
Bia
kaP
ygm
ies
Man
den
kaT
sum
kw
eSan
Fre
nch
Bas
que
Han
Chin
ese
Nan
Mel
anes
ian
Bia
kaP
ygm
ies
32,4
00
49,1
00a
15,7
0035,4
0025,0
0036,7
00
13,5
00
13,4
00b
11,2
0012,8
0014,3
0015,9
00
47,3
00
48,4
00b
43,6
0048,2
0048,0
0048,3
00
Man
den
ka10,1
0020,0
00
10,2
0017,6
0030,5
0031,5
007,
920
12,0
00
7,41
013,7
0014,5
0016,3
0036,6
0040,7
00
22,0
0048,2
0048,2
0048,4
00
Tsu
mkw
eSan
16,3
0022,5
0019,5
00
20,0
0020,1
0018,8
0011,8
0012,6
0012,0
00
11,4
0012,1
0012,3
0047,5
0047,6
0047,4
00
47,7
0047,2
0047,2
00
Fre
nch
Bas
que
2,73
03,
720
2,98
03,5
70
3,72
04,
710
1,86
02,
680
2,06
02,5
20
2,59
03,
390
4,44
06,
100
4,77
019,9
00
39,3
0045,1
00
Han
Chin
ese
4,71
05,
700
4,22
07,
680
12,7
00
41,0
003,
360
4,28
03,
130
5,81
04,6
20
6,4
907,
170
9,66
06,
530
47,2
0023,7
00
48,0
00
Nan
Mel
anes
ian
3,91
04,
220
3,29
06,
200
4,81
04,4
80
2,41
02,
990
2,22
05,
380
3,93
03,3
80
6,10
07,
720
5,45
047,4
0047,0
0022,7
00
Table
5.1
:E
stim
ate
sof
the
desc
endant
eff
ect
ive
popula
tion
size
s.
The
modes
(a)
and
95th
confiden
cein
terv
als
(b)
ofth
ees
tim
ated
mar
ginal
pos
teri
ordis
trib
uti
ons
are
indic
ated
.T
he
effec
tive
pop
ula
tion
size
ses
tim
ated
for
the
pop
ula
tion
sin
row
sar
esh
own
for
the
anal
yse
sw
ith
each
ofth
efive
other
pop
ula
tion
sin
colu
mns.
The
num
ber
sin
ital
icsh
owes
tim
ates
that
are
unre
liab
leb
ecau
se,
e.g.
,of
flat
pos
teri
ordis
trib
uti
ons
(see
Supple
men
tal
Fig
ure
s5.
3−5.
17).
The
dia
gonal
inb
old
show
the
mea
ns
over
five
anal
yse
sof
the
effec
tive
pop
ula
tion
size
sof
the
six
hum
anp
opula
tion
s.
162
Bia
kaP
ygm
ies
Man
den
kaT
sum
kw
eSan
Fre
nch
Bas
que
Han
Chin
ese
Man
den
ka15,7
0012,1
00−
19,4
00
Tsu
mkw
eSan
15,7
0015,6
0011,1
008,
240
−21,0
0020,8
00
Fre
nch
Bas
que
14,6
0010,7
0017,1
0010,1
005,
710
11,1
00−
23,0
0017,4
0023,2
00
Han
Chin
ese
14,1
0012,6
0015,6
0010,2
007,
870
4,82
010,5
006,
490
−19,5
0016,0
0021,7
0014,6
00
Nan
Mel
anes
ian
13,8
0010,7
0016,3
009,
670
10,2
007,
800
4,80
010,4
006,
360
8,32
019,2
0015,6
0022,3
0012,2
0012,4
00
Tab
le5.
2:E
stim
ate
sof
the
ance
stra
leff
ect
ive
popula
tion
size
s.See
lege
nd
ofT
able
5.1
for
det
ails
.
163
Bia
kaP
ygm
ies
Man
den
kaT
sum
kw
eSan
Fre
nch
Bas
que
Han
Chin
ese
Nan
Mel
anes
ian
Bia
kaP
ygm
ies
1.91×
10−
48.
14×
10−
52.
30×
10−
41.
64×
10−
41.
62×
10−
4
1.21×
10−
71.
68×
10−
77.
25×
10−
61.
26×
10−
63.
10×
10−
6
−2.
21×
10−
42.
18×
10−
43.
19×
10−
42.
49×
10−
42.
68×
10−
4
Man
den
ka40,7
002.
13×
10−
42.
95×
10−
42.
43×
10−
42.
69×
10−
4
30,3
00−
4.35×
10−
74.
78×
10−
61.
01×
10−
64.
46×
10−
6
387,
000
3.84×
10−
44.
05×
10−
43.
79×
10−
44.
64×
10−
4
Tsu
mkw
eSan
78,2
0067,6
001.
39×
10−
47.
12×
10−
57.
48×
10−
5
57,3
0052,9
00−
2.08×
10−
61.
20×
10−
61.
82×
10−
6
392,
000
469,
000
1.90×
10−
41.
78×
10−
41.
72×
10−
4
Fre
nch
Bas
que
42,6
0047,6
0068,0
009.9
1×
10−
47.
20×
10−
4
39,8
0046,4
0051,2
00−
3.3
1×
10−
71.
86×
10−
7
482,
000
480,
000
476,
000
1.4
0×
10−
31.
06×
10−
3
Han
Chin
ese
67,6
0062,6
0072,6
0012,6
001.0
3×
10−
4
54,1
0051,4
0059,6
008,
350
−1.2
2×
10−
7
466,
000
460,
000
469,
000
454,
000
4.1
9×
10−
4
Nan
Mel
anes
ian
65,7
0057,6
0078,2
0017,6
0010,6
0053,5
0049,8
0057,2
008,
470
2,96
0−
475,
000
478,
000
473,
000
429,
000
203,
000
Tab
le5.
3:E
stim
ate
sof
the
spli
tti
me
(low
er
half
)and
gene
flow
rate
(upp
er
half
)fo
reach
pop
ula
tion
pair
.See
lege
nd
ofT
able
5.1
for
det
ails
.
164
Anal
ysi
sn
1an
2m
bT∗c
NA
d
Bia
kaP
ygm
ies×
Man
den
ka28−
3030
m×T∗
2.0
0×
10−
7−
3.2
0×
10−
725,0
00−
50,0
00
m×N
A1.3
0×
10−
4−
2.0
0×
10−
413,0
00−
15,0
00T×N
A25,0
00−
50,0
0015,0
00−
18,0
00
Bia
kaP
ygm
ies×
Tsu
mkw
eSan
28−
3018−
28
m×T∗
3.8
0×
10−
5−
6.0
0×
10−
575,0
00−
100,
000
m×N
A9.7
0×
10−
5−
1.6
0×
10−
413,0
00−
15,0
00T×N
A75,0
00−
100,
000
15,0
00−
18,0
00
Man
den
ka×
Tsu
mkw
eSan
3018−
28
m×T∗
6.2
0×
10−
7−
1.0
0×
10−
650,0
00−
75,0
00
m×N
A2.0
0×
10−
4−
3.3
0×
10−
410,0
00−
13,0
00T×N
A50,0
00−
75,0
0015,0
00−
18,0
00
Tab
le5.
4:E
stim
ate
sfr
om
join
tp
ost
eri
or
dis
trib
uti
ons
for
the
Afr
ican
popula
tions.
The
range
sof
the
esti
mat
esofm
,T∗
andN
Aw
ith
the
hig
hes
tpro
bab
ilit
yden
sity
inth
ejo
int
pos
teri
ordis
trib
uti
ons
esti
mat
edbyMIMAR
are
rep
orte
d.
The
num
ber
sin
ital
icsh
ould
be
take
nw
ith
cauti
on(s
eete
xt)
.an
1an
dn
2ar
eth
enum
ber
ofch
rom
osom
esin
the
firs
tan
dse
cond
pop
ula
tion
ofth
ean
alysi
s,re
spec
tive
ly(t
he
sam
ple
size
vari
esb
ecau
seof
mis
sing
dat
a,se
eM
ethods
for
det
ails
).bm
corr
esp
onds
toth
efr
acti
onof
ap
opula
tion
that
isre
pla
ced
by
mig
rants
from
the
other
pop
ula
tion
each
gener
atio
n.
cT∗
isth
ees
tim
ate
ofth
eti
me
since
the
pop
ula
tion
ssp
lit
inye
ars.
dN
Ais
the
esti
mat
eof
the
ance
stra
leff
ecti
vep
opula
tion
size
.
165
5.4.2 Split between African and non-African populations.
The estimates of the split times of Han Chinese and Nan Melanesian with the Biaka
Pygmies and Mandenka range from 60−70 Kya; in turn, the split times with the
Tsumkwe San range from 70−80 Kya (Supplemental Figures 5.6−5.14). In contrast,
the split times estimated between the French Basque and Biaka Pygmies and Man-
denka range from 40−50 Kya; with the Tsumkwe San, they are ∼70 Kya (Table 5.3).
These results likely reflect different migratory routes leading to the Asian and Euro-
pean populations and are consistent with the results of the previous section, namely
the finding that Tsumkwe San split first from the rest of Africa ∼70 Kya.
The estimates of the ancestral effective population size of African and non-African
populations are generally consistent and range from 8, 000−18, 000 (Table 5.2). We
also note that there is some evidence of gene flow between all pairs of populations,
of the order of 2−10 migrants per generation between Biaka Pygmies and Mandenka
and the non-African populations and ∼1 migrant per generation between Tsumkwe
San and non-African populations (Table 5.3). As mentioned above, this parameter is
difficult to estimate and it is not clear whether these estimates reflect ongoing gene
flow, range expansion, or more recent migratory events.
166
Anal
ysi
sn
1n
2m
T∗
NA
Fre
nch
Bas
que×
Bia
kaP
ygm
ies
3228−
30
m×T∗
2.40×
10−
4 −3.
50×
10−
448
0,0
00−
500,0
00
m×N
A1.
60×
10−
4 −2.
40×
10−
413,0
00−
15,0
00T×N
A25,0
00−
50,0
0015,0
00−
18,0
00
Han
Chin
ese×
Bia
kaP
ygm
ies
3228−
30
m×T∗
4.0
0×
10−
5−
6.0
0×
10−
575,0
00−
100,
000
m×N
A1.3
0×
10−
4−
2.0
0×
10−
410,0
00−
13,0
00T×N
A50,0
00−
75,0
0015,0
00−
18,0
00
Nan
Mel
anes
ian×
Bia
kaP
ygm
ies
1828−
30
m×T∗
2.30×
10−
4 −3.
50×
10−
445
0,0
00−
480,0
00
m×N
A1.
50×
10−
4 −2.
30×
10−
410,0
0013,0
00T×N
A50,0
00−
75,0
0015,0
00−
18,0
00
Tab
le5.
5:E
stim
ate
sfr
om
join
tp
ost
eri
or
dis
trib
uti
ons
betw
een
the
Bia
ka
Pygm
ies
an
dn
on
-Afr
ican
popula
tions.
See
lege
nd
ofT
able
5.4
for
det
ails
.
167
Anal
ysi
sn
1n
2m
T∗
NA
Fre
nch
Bas
que×
Man
den
ka32
30
m×T∗
3.90×
10−
4 −5.
90×
10−
445
0,0
00−
480,0
00
m×N
A2.
60×
10−
4 −3.
90×
10−
47,
900−
10,0
00T×N
A50,0
00−
75,0
0013,0
00−
15,0
00
Han
Chin
ese×
Man
den
ka32
30
m×T∗
6.50×
10−
5 −1.
00×
10−
475,0
00−
100,0
00
m×N
A1.
60×
10−
4 −2.
50×
10−
47,
900−
10,0
00T×N
A50,0
00−
75,0
0013,0
00−
15,0
00
Nan
Mel
anes
ian×
Man
den
ka18
30
m×T∗
3.30×
10−
4 −5.
20×
10−
448
0,0
00−
500,0
00
m×N
A2.
20×
10−
4 −3.
30×
10−
47,
900−
10,0
00T×N
A50,0
00−
75,0
0013,0
00−
15,0
00
Tab
le5.
6:E
stim
ate
sfr
om
join
tp
ost
eri
or
dis
trib
uti
ons
betw
een
the
Mandenka
and
non
-Afr
ican
pop
ula
-ti
ons.
See
lege
nd
ofT
able
5.4
for
det
ails
.
168
Anal
ysi
sn
1n
2m
T∗
NA
Fre
nch
Bas
que×
Tsu
mkw
eSan
3218−
28
m×T∗
1.70×
10−
4 −2.
50×
10−
448
0,0
00−
500,0
00
m×N
A1.
20×
10−
4 −1.
70×
10−
413,0
00−
15,0
00T×N
A50,0
00−
75,0
0015,0
00−
18,0
00
Han
Chin
ese×
Tsu
mkw
eSan
3218−
28
m×T∗
3.7
0×
10−
5−
5.4
0×
10−
510
0,00
0−13
0,00
0
m×N
A8.0
0×
10−
5−
1.2
0×
10−
413,0
00−
15,0
00T×N
A75,0
00−
100,
000
15,0
00−
18,0
00
Nan
Mel
anes
ian×
Tsu
mkw
eSan
1818−
28
m×T∗
1.10×
10−
4 −1.
60×
10−
445
0,0
00−
480,0
00
m×N
A7.
50×
10−
5 −1.
10×
10−
413,0
00−
15,0
00T×N
A75,0
00−
100,
000
15,0
00−
18,0
00
Tab
le5.
7:E
stim
ate
sfr
om
join
tp
ost
eri
or
dis
trib
uti
ons
betw
een
the
Tsu
mkw
eSan
an
dn
on
-Afr
ican
pop
-ula
tions.
See
lege
nd
ofT
able
5.4
for
det
ails
.
169
5.4.3 Split between non-African populations
French Basque are estimated to have split from the ancestral population of Han
Chinese and Nan Melanesian 13−18 Kya, and Han Chinese and Nan Melanesian
to have split ∼10 Kya (Table 5.3). The estimates of the ancestral population sizes
are ∼10, 000 for the three non-African population pairs (Table 5.2). However, these
estimates may be misleading because, although there is evidence of gene flow in the
three analyses of the order of 5−15 migrants per generation, these estimates are
highly unreliable and T , NA and m tend to be correlated (Supplemental Figures
5.15−5.17). Moreover, in those three cases, the estimated models do not fit the data
well (see Methods and Supplemental Figures 5.18b−d)
The estimates of the effective population size for Han Chinese range 4,200−7,700.
These estimates are generally consistent across analyses, with the exception of the
analysis with Nan Melanesian for which this parameter is not well estimated (Sup-
plemental Figure 5.17a), maybe because the different bottlenecks that are known to
have affected these populations are not included in the model or because a model of
isolation-by-distance is more appropriate in these cases (e.g., see Methods and Supple-
mental Figure 5.18). The estimates of the effective population size for French Basque
and Nan Melanesian range 2,700−4,700 and 2,000−6,200, respectively, but these es-
timates are usually smaller in the analyses with African populations than with the
non-African populations (Table 5.1), which may again reflect the demographic histo-
ries of these populations. We note that these estimates are generally consistent with
those found in other studies (e.g., Schaffner et al., 2005).
170
Anal
ysi
sn
1n
2m
T∗
NA
Fre
nch
Bas
que×
Han
Chin
ese
3232
m×T∗
4.5
0×
10−
5−
7.8
0×
10−
550−
25,0
00
m×N
A7.
50×
10−
4 −1.
30×
10−
37,
900−
10,0
00T×N
A50−
25,0
007,
900−
10,0
00
Fre
nch
Bas
que×
Nan
Mel
anes
ian
3218
m×T∗
4.90×
10−
5 −8.
60×
10−
550−
25,0
00
m×N
A4.
90×
10−
5 −8.
60×
10−
57,
900−
10,0
00T×N
A50−
25,0
007,
900−
10,0
00
Han
Chin
ese×
Nan
Mel
anes
ian
3218
m×T∗
4.00×
10−
5 −7.
00×
10−
550−
25,0
00
m×N
A7.
00×
10−
5 −1.
20×
10−
47,
900−
10,0
00T×N
A50−
25,0
007,
900−
10,0
00
Tab
le5.
8:E
stim
ate
sfr
om
join
tp
ost
eri
or
dis
trib
uti
ons
for
the
non-A
fric
an
popula
tion
s.See
lege
nd
ofT
able
5.4
for
det
ails
.
171
5.5 Conclusions
Figure 5.2 summarizes the broad-scale history of the six human populations estimated
by MIMAR. This model is consistent with previous observations. Specifically, it was
thought that Tsumkwe San is an ancient population (e.g., Passarino et al., 1998)
and MIMAR estimates that they split ∼70 Kya from other African groups. MIMAR also
estimates the split between African and Asian populations to 60−70 Kya and between
African and European populations to∼40 Kya, which fits well with previous estimates
for the waves of migration out-of-Africa of anatomically model human (e.g., Cavalli-
Sforza and Feldman, 2003).While the split times and other demographic parameters
estimated between the non-African populations are not well estimated by MIMAR, these
estimates are roughly consistent with previous reports (Voight et al., 2005; Keinan
et al., 2007). This said, the estimated split times are likely under-estimates, as we
recently showed that MIMAR tends to provide under-estimates of this parameter in
the presence of migration (Becquet and Przeworski, 2008). In addition, the difficulty
in estimating the parameters of the isolation-migration model for the non-African
populations pairs likely reflect the violation of numerous assumptions of the model,
e.g., panmixia and constant effective population sizes. Alternatively, an isolation-
migration model may be inappropriate in this case: the estimated models between
the non-African populations leads to lower FST than in the data, suggesting that a
different model − perhaps an isolation-by-distance model − may describe the data
better (see Methods and Supplemental Figure 5.18).
Interestingly, MIMAR found evidence of gene flow between all pairs of populations,
regardless of how differentiated or distant geographically they were. This is interesting
because commonly used models of human demographic history usually do not include
gene flow between populations after the split (e.g., Schaffner et al., 2005). The results
172
Non-African
populations
Tsumkwe San Biaka Pigmies
Mandenka
Split between Tsumkwe San
and the other African populations 70-80 Kya
Split between African
and non-African populations 40-70 Kya
Past
Present
Figure 5.2: Cartoon summarizing pairwise models estimated by MIMAR forsix human populations.The main split events are indicated. The colored lineages represent the history ofa population considered in this study. The three non-African populations are rep-resented by a single lineage as their split times were poorly estimated (see Results).The grey arrows represent gene flow between the populations as they diverge.
of this study suggest that it may be problematic to ignore migration when dealing
with data from human populations.
The effective population sizes estimates provided by MIMAR, when well-estimated,
are also consistent with previous estimates for human populations (e.g., Maynard Smith
and Haigh, 1974; Schaffner et al., 2005): The African populations and the ancestral
populations of the non-African populations are ∼10, 000, all the non-African popula-
tions∼5, 000 or less and for the ancestors of the African populations∼15, 000−20, 000.
We noted that in many analyses, MIMAR did not provide reliable estimates of the effec-
tive population sizes in the descendant populations (Table 5.1). Human populations
are certainly non-random mating and their effective population sizes have changed
173
often, as populations experienced bottlenecks during range expansions, e.g., during
the out-of-Africa migration of anatomically modern human, and recent population
growth, at least since the emergence of agriculture. Since the data considered here
undoubtedly violate the assumptions of panmixia and constant effective population
sizes of the isolation-migration model, the estimates of effective population sizes need
to be considered with caution. The results may also be affected by the choice of data
processing we used here. For instance, if there were systematic errors in the inferred
data, this could have introduced a bias. However, the alternative of removing the
missing data would bias the ratios Si/L, where L is the number of base pairs at a
locus, i ∈ [1, 4] and Si is the number of polymorphisms of type i (see Methods), thus
leading to a different bias in the estimates of the parameters of interest.
We note that MIMAR is limited to estimating demographic parameters between
pairs of populations. We therefore presented the results for the pairwise comparisons
between six human populations, although their histories are not independent (e.g.,
Ramachandran et al., 2005; Schaffner et al., 2005). Estimating the demographic
parameters for a divergence model for all six populations together would likely be
more appropriate and provide more reliable estimates. In particular, some of the
parameter estimates provided by MIMAR may be unreliable if there is gene flow between
a pair and other, unsampled populations: We observed for instance that estimates
obtained for a population across pairwise comparisons are quite variable. Therefore,
our interpretation of the history of the six human populations from the pairwise
comparisons needs to be taken with caution (Figure 5.2). It is unclear how the
parameters are affected by unsampled populations and a simulation study would
undoubtedly be helpful in inquiring how it may affect the estimates provided by
MIMAR.
174
Some of the difficulty discussed above in estimating demographic parameters for
the human populations may be resolved by a larger data set. We plan to apply MIMAR
to the X-linked fraction of the data and also to modify the method to incorporate
sex-specific migration rates to allow its application to the full data set of 40 loci
(Wall et al., 2008). If the results are consistent with those presented here, the fact
that some parameters are not well estimated may simply reflect that the complexities
of the human population histories: i.e., the demographic histories of some of these
populations are not well approximated by the simple isolation-migration model con-
sidered by MIMAR. In contrast, if the results vary dramatically depending on the data
(i.e., 20 X-linked versus 20 autosomal loci), this could point to different histories of
the X chromosomes and autosomes due, for example, to sex-specific migration rates.
Finally, in this study, we fit isolation-migration models to populations that are
not totally geographically isolated from each other, as attested by the substantial
evidence for gene flow detected by MIMAR and the independent knowledge that many
unsampled populations that regularly exchange migrants lie between them (Rosenberg
et al., 2002; Ramachandran et al., 2005). But, since the six human populations that we
investigated here have been chosen specifically because there were geographically far
apart from each other (Wall et al., 2008), their histories may be well approximated
by an isolation-migration model (with some exceptions perhaps, see Methods and
Supplemental Figure 5.18). For this reason, we do not explicitly investigate whether
other models, e.g., an isolation-by-distance model, would fit the data better than the
isolation-migration models estimated by MIMAR. However, if pairs of populations were
sampled along the gradient separating the six populations presented in this study,
their demographic histories may not be well approximated by the simple isolation-
migration model and MIMAR may provide unreliable estimates. In this case, the fit of
175
various models should be investigated as, e.g., isolation-by distance models may be
better approximations than isolation-migration models (e.g., Ramachandran et al.,
2005).
5.6 Acknowledgments
We thank Jeffrey Wall for the useful discussions and for providing us with the data.
5.7 Appendix A: Supplemental Figures
Estimated isolation-migration models for the 15 pairs of human
populations
176
010
000
2000
030
000
4000
050
000
0.000.020.040.060.080.100.12
Pop
ulat
ion
size
Bia
ka P
igm
ies
Man
denk
aA
nces
tral
pop
ulat
ion
(a)
0e+
001e
−04
2e−
043e
−04
4e−
045e
−04
0.0000.0100.0200.030
Mig
ratio
n
(b)
0e+
001e
+05
2e+
053e
+05
4e+
055e
+05
0.000.050.100.15
Tim
e(c
)
T
m
1e+0
52e
+05
3e+0
54e
+05
3.2e−071.3e−05
0.0
1
0.02 0.03
(d)
NA
m
1e+0
52e
+05
3e+0
54e
+05
3.2e−071.3e−05
0.0
1
0.02
(e)
NA
T
1e+0
52e
+05
3e+0
54e
+05
1e+052e+053e+054e+05
0.0
5
(f)
Fig
ure
5.3:
Est
imate
dm
odel
from
the
Bia
ka
Pygm
ies
and
Mandenka
popu
lati
on
data
.a−
c)Sm
oot
hed
mar
ginal
pos
teri
ordis
trib
uti
ons
esti
mat
edbyMIMAR,
when
the
firs
tp
opula
tion
isuse
asth
ere
fere
nce
(Bia
kaP
ygm
ies
inth
isca
se).
d−
f)Joi
nt
pos
teri
ordis
trib
uti
ons
ofth
ege
ne
flow
rate
and
split
tim
e(d
),of
the
gene
flow
rate
and
the
ance
stra
leff
ecti
vep
opula
tion
size
(e)
and
ofth
esp
lit
tim
ean
dth
ean
cest
ral
effec
tive
pop
ula
tion
size
(f).
177
010
000
2000
030
000
4000
050
000
0.000.020.040.060.080.10
Pop
ulat
ion
size
Bia
ka P
igm
ies
Tsu
mkw
e S
anA
nces
tral
pop
ulat
ion
(a)
0e+
001e
−04
2e−
043e
−04
4e−
045e
−04
6e−
04
0.0000.0100.020
Mig
ratio
n
(b)
0e+
001e
+05
2e+
053e
+05
4e+
055e
+05
0.000.020.040.060.080.10
Tim
e(c
)
T
m
1e+0
52e
+05
3e+0
54e
+05
3.3e−071.5e−05
0.0
05
0.0
1 0.015
0.02
(d)
NA
m
1e+0
52e
+05
3e+0
54e
+05
3.3e−071.5e−05 0
.01
0.0
2
(e)
NA
T
1e+
052e
+05
3e+
054e
+05
1e+052e+053e+054e+05
0.0
5
(f)
Fig
ure
5.4:
Est
imate
dm
odel
from
the
Bia
ka
Pygm
ies
and
Tsu
mkw
eSan
pop
ula
tion
data
.F
ordet
ails
,se
ele
gend
ofSupple
men
tal
Fig
ure
5.3.
178
010
000
2000
030
000
4000
050
000
0.000.020.040.06
Pop
ulat
ion
size
Man
denk
aT
sum
kwe
San
Anc
estr
al p
opul
atio
n
(a)
0e+
002e
−04
4e−
046e
−04
8e−
04
0.0000.0100.0200.030
Mig
ratio
n
(b)
0e+
001e
+05
2e+
053e
+05
4e+
055e
+05
0.000.010.020.030.04
Tim
e(c
)
T
m
1e+0
52e
+05
3e+0
54e
+05
3.8e−071.8e−05
0.0
05
0.0
1 0
.01
0.015
(d)
NA
m
1e+0
52e
+05
3e+0
54e
+05
3.8e−071.8e−05 0
.01
0.0
2 0
.03
(e)
NA
T
1e+0
52e
+05
3e+0
54e
+05
1e+052e+053e+054e+05
0.0
2
(f)
Fig
ure
5.5:
Est
imate
dm
odel
from
the
Mandenka
and
Tsu
mkw
eSan
popula
tion
data
.F
ordet
ails
,se
ele
gend
ofSupple
men
tal
Fig
ure
5.3.
179
010
000
2000
030
000
4000
050
000
0.000.050.100.150.20
Pop
ulat
ion
size
Fre
nch
Bas
que
Bia
ka P
igm
ies
Anc
estr
al p
opul
atio
n
(a)
0e+
002e
−04
4e−
046e
−04
0.000.010.020.030.040.05
Mig
ratio
n
(b)
0e+
001e
+05
2e+
053e
+05
4e+
055e
+05
0.0000.0050.0100.0150.020
Tim
e(c
)
T
m
1e+0
52e
+05
3e+0
54e
+05
1.7e−063.6e−05
0.0
05
0.0
1 0
.015
(d)
NA
m
1e+0
52e
+05
3e+0
54e
+05
1.7e−063.6e−05 0
.02 0.0
4
(e)
NA
T
1e+0
52e
+05
3e+0
54e
+05
1e+052e+053e+054e+05
0.0
1
(f)
Fig
ure
5.6:
Est
imate
dm
odel
from
the
Fre
nch
Basq
ue
and
Bia
ka
Pygm
ies
popula
tion
data
.F
ordet
ails
,se
ele
gend
ofSupple
men
tal
Fig
ure
5.3.
180
010
000
2000
030
000
4000
050
000
0.000.050.100.15
Pop
ulat
ion
size
Han
Chi
nese
Bia
ka P
igm
ies
Anc
estr
al p
opul
atio
n
(a)
0e+
001e
−04
2e−
043e
−04
4e−
045e
−04
6e−
04
0.0000.0100.0200.030
Mig
ratio
n
(b)
0e+
001e
+05
2e+
053e
+05
4e+
055e
+05
0.000.010.020.03
Tim
e(c
)
T
m
1e+0
52e
+05
3e+0
54e
+05
1.1e−062.7e−05
0.0
05
0.0
1
(d)
NA
m
1e+0
52e
+05
3e+0
54e
+05
1.1e−062.7e−05
0.0
2
(e)
NA
T
1e+0
52e
+05
3e+0
54e
+05
1e+052e+053e+054e+05
0.0
2
(f)
Fig
ure
5.7:
Est
imate
dm
odel
from
the
Han
Chin
ese
and
Bia
ka
Pygm
ies
popula
tion
data
.F
ordet
ails
,se
ele
gend
ofSupple
men
tal
Fig
ure
5.3.
181
010
000
2000
030
000
4000
050
000
0.000.050.100.150.20
Pop
ulat
ion
size
Nan
Mel
anes
ian
Bia
ka P
ygm
ies
Anc
estr
al p
opul
atio
n
(a)
0e+
002e
−04
4e−
046e
−04
0.000.010.020.030.040.050.06
Mig
ratio
n
(b)
0e+
001e
+05
2e+
053e
+05
4e+
055e
+05
0.0000.0100.0200.030
Tim
e(c
)
T
m
1e+0
52e
+05
3e+0
54e
+05
1.2e−063.1e−05
0.0
05
0.0
1 0
.015
(d)
NA
m
1e+0
52e
+05
3e+0
54e
+05
1.2e−063.1e−05
0.0
2
0.0
4
(e)
NA
T
1e+0
52e
+05
3e+0
54e
+05
1e+052e+053e+054e+05
0.0
1
0.0
2
(f)
Fig
ure
5.8:
Est
imate
dm
odel
from
the
Nan
Mela
nesi
an
and
Bia
ka
Pygm
ies
pop
ula
tion
data
.F
ordet
ails
,se
ele
gend
ofSupple
men
tal
Fig
ure
5.3.
182
010
000
2000
030
000
4000
050
000
0.000.050.100.150.20
Pop
ulat
ion
size
Fre
nch
Bas
que
Man
denk
aA
nces
tral
pop
ulat
ion
(a)
0e+
002e
−04
4e−
046e
−04
8e−
04
0.000.010.020.030.040.050.06
Mig
ratio
n
(b)
0e+
001e
+05
2e+
053e
+05
4e+
055e
+05
0.0000.0050.0100.0150.020
Tim
e(c
)
T
m
1e+0
52e
+05
3e+0
54e
+05
1.3e−063.4e−05
0.0
05
0.0
1
0.0
15
(d)
NA
m
1e+0
52e
+05
3e+0
54e
+05
1.3e−063.4e−05 0
.02 0
.04
(e)
NA
T
1e+0
52e
+05
3e+0
54e
+05
1e+052e+053e+054e+05
0.0
1
(f)
Fig
ure
5.9:
Est
imate
dm
odel
from
the
Fre
nch
Basq
ue
and
Mandenka
popu
lati
on
data
.F
ordet
ails
,se
ele
gend
ofSupple
men
tal
Fig
ure
5.3.
183
010
000
2000
030
000
4000
050
000
0.000.040.080.12
Pop
ulat
ion
size
Han
Chi
nese
Man
denk
aA
nces
tral
pop
ulat
ion
(a)
0e+
002e
−04
4e−
046e
−04
8e−
04
0.000.010.020.03
Mig
ratio
n
(b)
0e+
001e
+05
2e+
053e
+05
4e+
055e
+05
0.000.010.020.030.04
Tim
e(c
)
T
m
1e+0
52e
+05
3e+0
54e
+05
6.9e−072.6e−05
0.0
05
0.0
1 0
.015
(d)
NA
m
1e+0
52e
+05
3e+0
54e
+05
6.9e−072.6e−05
0.0
2
0.04
(e)
NA
T
1e+0
52e
+05
3e+0
54e
+05
1e+052e+053e+054e+05
0.0
2
(f)
Fig
ure
5.10
:E
stim
ate
dm
odel
from
the
Han
Chin
ese
and
Mandenka
popula
tion
data
.F
ordet
ails
,se
ele
gend
ofSupple
men
tal
Fig
ure
5.3.
184
010
000
2000
030
000
4000
050
000
0.000.050.100.15
Pop
ulat
ion
size
Nan
Mel
anes
ian
Man
denk
aA
nces
tral
pop
ulat
ion
(a)
0.00
000.
0002
0.00
040.
0006
0.00
080.
0010
0.00
12
0.000.010.020.030.040.05
Mig
ratio
n
(b)
0e+
001e
+05
2e+
053e
+05
4e+
055e
+05
0.0000.0050.0100.0150.020
Tim
e(c
)
T
m
1e+0
52e
+05
3e+0
54e
+05
1.1e−063.7e−05
0.0
05
0.0
1 0
.015
(d)
NA
m
1e+0
52e
+05
3e+0
54e
+05
1.1e−063.7e−05 0
.02
0.0
4
(e)
NA
T
1e+0
52e
+05
3e+0
54e
+05
1e+052e+053e+054e+05
0.0
1
0.0
2
(f)
Fig
ure
5.11
:E
stim
ate
dm
odel
from
the
Nan
Mela
nesi
an
and
Mandenka
popula
tion
data
.F
ordet
ails
,se
ele
gend
ofSupple
men
tal
Fig
ure
5.3.
185
010
000
2000
030
000
4000
050
000
0.000.050.100.150.200.25
Pop
ulat
ion
size
Fre
nch
Bas
que
Tsu
mkw
e S
anA
nces
tral
pop
ulat
ion
(a)
0e+
001e
−04
2e−
043e
−04
4e−
045e
−04
0.000.010.020.030.040.05
Mig
ratio
n
(b)
0e+
001e
+05
2e+
053e
+05
4e+
055e
+05
0.000.010.020.030.04
Tim
e(c
)
T
m
1e+0
52e
+05
3e+0
54e
+05
1.4e−062.7e−05
0.0
05
0.01
0.0
1
(d)
NA
m
1e+0
52e
+05
3e+0
54e
+05
1.4e−062.7e−05 0
.01
0.0
2
0.0
3
(e)
NA
T
1e+0
52e
+05
3e+0
54e
+05
1e+052e+053e+054e+05
0.0
1
0.0
2
(f)
Fig
ure
5.12
:E
stim
ate
dm
odel
from
the
Fre
nch
Basq
ue
and
Tsu
mkw
eSan
popula
tion
data
.F
ordet
ails
,se
ele
gend
ofSupple
men
tal
Fig
ure
5.3.
186
010
000
2000
030
000
4000
050
000
0.000.050.100.150.20
Pop
ulat
ion
size
Han
Chi
nese
Tsu
mkw
e S
anA
nces
tral
pop
ulat
ion
(a)
0e+
001e
−04
2e−
043e
−04
4e−
045e
−04
0.0000.0100.0200.030
Mig
ratio
n
(b)
0e+
001e
+05
2e+
053e
+05
4e+
055e
+05
0.0000.0100.0200.030
Tim
e(c
)
T
m
1e+0
52e
+05
3e+0
54e
+05
1.1e−062.5e−05
0.0
05
0.0
1 (d)
NA
m
1e+0
52e
+05
3e+0
54e
+05
1.1e−062.5e−05
0.0
1 0.0
2
0.0
4
(e)
NA
T
1e+0
52e
+05
3e+0
54e
+05
1e+052e+053e+054e+05
0.0
2
(f)
Fig
ure
5.13
:E
stim
ate
dm
odel
from
the
Han
Chin
ese
and
Tsu
mkw
eSan
pop
ula
tion
data
.F
ordet
ails
,se
ele
gend
ofSupple
men
tal
Fig
ure
5.3.
187
010
000
2000
030
000
4000
050
000
0.000.050.100.150.20
Pop
ulat
ion
size
Nan
Mel
anes
ian
Tsu
mkw
e S
anA
nces
tral
pop
ulat
ion
(a)
0e+
001e
−04
2e−
043e
−04
4e−
04
0.000.020.040.06
Mig
ratio
n
(b)
0e+
001e
+05
2e+
053e
+05
4e+
055e
+05
0.0000.0050.0100.0150.0200.0250.030
Tim
e
(c)
T
m
1e+0
52e
+05
3e+0
54e
+05
1.2e−062.4e−05
0.0
05
0.01
0.0
1
(d)
NA
m
1e+0
52e
+05
3e+0
54e
+05
1.2e−062.4e−05
0.0
1 0.0
2
(e)
NA
T
1e+0
52e
+05
3e+0
54e
+05
1e+052e+053e+054e+05
0.0
1
0.0
2
(f)
Fig
ure
5.14
:E
stim
ate
dm
odel
from
the
Nan
Mela
nesi
an
and
Tsu
mkw
eSan
popula
tion
data
.F
ordet
ails
,se
ele
gend
ofSupple
men
tal
Fig
ure
5.3.
188
010
000
2000
030
000
4000
050
000
0.000.020.040.060.080.100.12
Pop
ulat
ion
size
Fre
nch
Bas
que
Han
Chi
nese
Anc
estr
al p
opul
atio
n
(a)
0.00
00.
001
0.00
20.
003
0.0000.0100.020
Mig
ratio
n
(b)
0e+
001e
+05
2e+
053e
+05
4e+
055e
+05
0.000.050.100.15
Tim
e(c
)
T
m
1e+0
52e
+05
3e+0
54e
+05
4.9e−074.5e−05
0.01
0.02 0.03
(d)
NA
m
1e+0
52e
+05
3e+0
54e
+05
4.9e−074.5e−05 0
.01
0.02
0.0
3
(e)
NA
T
1e+0
52e
+05
3e+0
54e
+05
1e+052e+053e+054e+05
0.1
(f)
Fig
ure
5.15
:E
stim
ate
dm
odel
from
the
Fre
nch
Basq
ue
and
Han
Chin
ese
popula
tion
data
.F
ordet
ails
,se
ele
gend
ofSupple
men
tal
Fig
ure
5.3.
189
010
000
2000
030
000
4000
050
000
0.000.040.080.12
Pop
ulat
ion
size
Fre
nch
Bas
que
Nan
Mel
anes
ian
Anc
estr
al p
opul
atio
n
(a)
0.00
00.
001
0.00
20.
003
0.00
4
0.0000.0050.0100.015
Mig
ratio
n
(b)
0e+
001e
+05
2e+
053e
+05
4e+
055e
+05
0.000.050.100.15
Tim
e(c
)
T
m
1e+0
52e
+05
3e+0
54e
+05
4.9e−074.9e−05
0.01 0.02 0.03
(d)
NA
m
1e+0
52e
+05
3e+0
54e
+05
4.9e−074.9e−05 0
.01
0.0
2 0.03
(e)
NA
T
1e+
052e
+05
3e+
054e
+05
1e+052e+053e+054e+05
0.1
(f)
Fig
ure
5.16
:E
stim
ate
dm
odel
from
the
Fre
nch
Basq
ue
and
Nan
Mela
nesi
an
popula
tion
data
.F
ordet
ails
,se
ele
gend
ofSupple
men
tal
Fig
ure
5.3.
190
010
000
2000
030
000
4000
050
000
0.0000.0100.0200.030
Pop
ulat
ion
size
Han
Chi
nese
Nan
Mel
anes
ian
Anc
estr
al p
opul
atio
n
(a)
0.00
000.
0005
0.00
100.
0015
0.00
200.
0025
0.00
30
0.0000.0040.0080.012
Mig
ratio
n
(b)
0e+
001e
+05
2e+
053e
+05
4e+
055e
+05
0.000.020.040.06
Tim
e(c
)
T
m
1e+0
52e
+05
3e+0
54e
+05
4.6e−074e−05
0.02 0.04
0.06
(d)
NA
m
1e+0
52e
+05
3e+0
54e
+05
4.6e−074e−05
0.0
1
0.0
2 0.03
(e)
NA
T
1e+0
52e
+05
3e+0
54e
+05
1e+052e+053e+054e+05
0.1
(f)
Fig
ure
5.17
:E
stim
ate
dm
odel
from
the
Han
Chin
ese
and
Nan
Mela
nesi
an
popula
tion
data
.F
ordet
ails
,se
ele
gend
ofSupple
men
tal
Fig
ure
5.3.
191
Estimated models with a poor fit to the data
0 50 100 150 200 250
0.0
0.4
0.8
S statistics
S1 (Mel.)S2 (Man)S3 (s.)S4 (f.)
0.0 0.1 0.2 0.3 0.40.00
0.10
Fst
1 2 3 40.00
0.10
π
π for Mel.π for Man
−1.5 −0.5 0.0 0.5 1.00.00
0.10
D
D for M.D for M.
(a)
Figure 5.18: Isolation-migration models estimated by MIMAR with a poor fitto the human population data.We show for a pair of populations, the distributions of the sum over the 26 lociof the four statistics used by MIMAR (the polymorphisms specific to the first andsecond sample and shared and fixed between the two samples, S1, S2, S3 and S4,respectively), as well as the mean over the 26 loci of FST and π and Tajima’s Dcalculated for each population (see Methods for details). Results are for a) NanMelanesian (blue) and Mandenka (red), b) French Basque and Han Chinese, c) FrenchBasque and Nan Melanesian and d) Han Chinese and Nan Melanesian. The verticallines correspond to the values of the statistics in the human data sets. In this case,the model does not seem to fit the data for the mean FST .
192
0 50 100 150 200 2500.
00.
40.
8S statistics
S1 (Fre.)S2 (Han)S3 (s.)S4 (f.)
0.0 0.1 0.2 0.30.00
0.10
0.20
Fst
1.0 2.0 3.0 4.00.00
0.10
π
π for Fre.π for Han
−1.0 −0.5 0.0 0.5 1.00.00
0.04
0.08
D
D for F.D for H.
(b)
Figure 5.18 − continued: Results for French Basque (blue) and Han Chinese (red).In this case, the model does not seem to fit the data for the mean Tajima’s D in theFrench Basque.
0 50 100 150 200 250
0.0
0.4
0.8
S statistics
S1 (Fre.)S2 (Mel.)S3 (s.)S4 (f.)
0.0 0.1 0.2 0.3 0.40.00
0.10
Fst
1.0 2.0 3.0 4.00.00
0.05
0.10
0.15
π
π for Fre.π for Mel.
−1.0 −0.5 0.0 0.5 1.00.00
0.06
0.12
D
D for F.D for M.
(c)
Figure 5.18 − continued: Results for French Basque (blue) and Nan Melanesian(red). In this case, the model does not seem to fit the data for the mean FST andthe mean Tajima’s D in the French Basque.
193
0 50 100 150 200
0.0
0.4
0.8
S statistics
S1 (Han)S2 (Mel.)S3 (s.)S4 (f.)
0.00 0.05 0.10 0.15 0.200.00
0.10
0.20
Fst
1.0 1.5 2.0 2.5 3.0 3.50.00
0.06
0.12
π
π for Hanπ for Mel.
−0.5 0.0 0.5 1.00.00
0.06
0.12
D
D for H.D for M.
(d)
Figure 5.18 − continued: Han Chinese (blue) and Nan Melanesian (red). In thiscase, the model does not seem to fit the data for the mean FST .
CHAPTER 6
CONCLUSIONS
194
195
This collection of projects describes computational approaches and their uses in
learning about speciation and more generally about the demographic history of popu-
lations. Despite the ambivalent conclusions of Chapter 4 about the reliability of such
methods in learning about nodes of speciation, I remain confident that population
genetic approaches can help us learn about the processes of speciation, provided that
one is appropriately cautious when defining the underlying models and interpreting
the results. In this regard, the new generations of polymorphism data, such as full
genome short-read resequencing, will likely help resolve some of the problems met
with the available polymorphism data since they carry orders of magnitude more
information on genetic variation. Of course, the size of new data sets will give rise
to new and challenging issues and I am looking forward investigating those and de-
veloping computational approaches to help extract useful information and answer
unresolved questions in evolutionary biology.
Here, I would like to point out that the question of whether speciation can be
initiated in the presence of gene flow is directly relevant to our history, as attested by
the highly debated hypothesis that early hominids and chimpanzee hybridized since
they split (Patterson et al., 2006b; Barton, 2006; Wakeley, 2008). I focused part of my
PhD projects on great apes with this ultimate goal in mind: Since the great apes are
our closest living evolutionary relatives, they are the best models to help us under-
stand the evolution of human biology and culture. Unraveling the mechanisms of the
great apes species and subspecies formation is thus a crucial step in understanding
how anatomically modern human evolved.
196
Finally, I would like to draw the readers’ attention to the fact that all great apes
species and subspecies are highly endangered (e.g., http://www.greatapetrust.
org/save/statistics.php and citations therein). The situation is disastrous: We
know so little and so much remains to be discovered about non-human great apes
that if any one subspecies came to disappear, the human species will deprive itself of
a wealth of knowledge about its origin. I hope that the method and results presented
in this thesis will help the conservation biology community in preserving the biodi-
versity of our (still) green planet and specifically the rapidly disappearing species and
subspecies of great apes.
REFERENCES
Aagaard, S. M., Sastad, S. M., Greilhuber, J., and Moen, A. (2005). A secondaryhybrid zone between diploid Dactylorhiza incarnata ssp. cruenta and allotetraploidD. lapponica (orchidaceae). Heredity, 94(5):488–96.
Akey, J. M., Eberle, M. A., Rieder, M. J., Carlson, C. S., Shriver, M. D., Nickerson,D. A., and Kruglyak, L. (2004). Population history and natural selection shapepatterns of genetic variation in 132 genes. PLoS Biol., 2(10):e286.
Albrecht, G. H. and Miller, J. M. A. (1993). Geographic variation in primates. Areview with implications for interpreting fossils. In Kimbel, W. H. and Mar, L. B.,editors, Species, species concepts, and primate evolution, pages 123–161. New York:Plenum Press.
Alonso, S., Fernandez-Fernandez, I., Castro, A., and De Pancorbo, M. M. (1998).Genetic characterization of apob and d17s5 aflp loci in a sample from the basquecountry (northern spain). Hum Biol, 70(3):491–505.
Altschul, S. F., Gish, W., Miller, W., Meyers, E. W., and Lipman, D. J. (1990). Basiclocal alignment search tool. J. Mol. Biol., 215(3):403–410.
Andolfatto, P. and Przeworski, M. (2000). A genome-wide departure from the stan-dard neutral model in natural populations of Drosophila. Genetics, 156(1):257–268.
Andolfatto, P. and Wall, J. D. (2003). Linkage disequilibrium patterns across arecombination gradient in african Drosophila melanogaster. Genetics, 165(3):1289–1305.
Austin, J. D., Lougheed, S. C., and Boag, P. T. (2004). Controlling for the effects ofhistory and nonequilibrium conditions in gene flow estimates in northern bullfrog(Rana catesbeiana) populations. Genetics, 168(3):1491–1506.
Bachtrog, D., Thornton, K. R., Clark, A. G., and Andolfatto, P. (2006). Extensive in-trogression of mitochondria DNA relative to nuclear genes in the Drosophila yakubaspecies group. Evolution Int. J. Org. Evolution, 60(2):292–302.
Barbash, D. A., Siino, D. F., Tarone, A. M., and Roote, J. (2003). A rapidly evolvingMYB-related protein causes species isolation in Drosophila. Proc. Natl. Acad. Sci.,100(9):5302–5307.
Barluenga, M., Stolting, K. N., Salzburger, W., Muschick, M., and Meyer, A. (2006).Sympatric speciation in nicaraguan crater lake cichlid fish. Nature, 439(7077):719–723.
197
198
Barreiro, L. B., Laval, G., Quach, H., Patin, E., and Quintana-Murci, L. (2008).Natural selection has driven population differentiation in modern humans. NatGenet, 40(3):340–345.
Barton, N. H. (2001). The role of hybridization in evolution. Molecular Ecology,10(3):551–568.
Barton, N. H. (2006). Evolutionary biology: How did the human species form? Curr.Biol., 16(16):R647–R650.
Barton, N. H. and Bengtsson, B. O. (1986). The barrier to genetic exchange betweenhybridising populations. Heredity, 57(Pt 3):357–376.
Beadle, L. C. (1981). The inland waters of tropical Africa: An introduction to tropicallimnology. Longman Group Ltd. London, 2nd edition.
Beaumont, M. A. (2003). Estimation of population growth or decline in geneticallymonitored populations. Genetics, 164(3):1139–1160.
Beaumont, M. A., Zhang, W., and Balding, D. J. (2002). Approximate Bayesiancomputation in population genetics. Genetics, 162(4):2025–2035.
Becquet, C., Patterson, N., Stone, A., Przeworski, M., and Reich, D. (2007). Genomicanalysis of chimpanzee population structure. PLoS Genet., 3(4):e66.
Becquet, C. and Przeworski, M. (2007). A new approach to estimate parameters ofspeciation models with application to apes. Genome Res., 17(10):1505–1519.
Becquet, C. and Przeworski, M. (2008). Can we learn about modes of speciation bycomputational approaches? Evolution Int. J. Org. Evolution. In prep.
Bull, V., Beltran, M., Jiggins, C. D., McMillan, W. O., Bermingham, E., and Mallet,J. (2006). Polyphyly and gene flow between non-sibling Heliconius species. BMCBiol., 4:11.
Cann, H. M., de Toma, C., Cazes, L., Legrand, M.-F., Morel, V., Piouffre, L., Bod-mer, J., Bodmer, W. F., Bonne-Tamir, B., Cambon-Thomsen, A., Chen, Z., Chu,J., Carcassi, C., Contu, L., Du, R., Excoffier, L., Friedlaender, J. S., Groot, H.,Gurwitz, D., Herrera, R. J., Huang, X., Kidd, J., Kidd, K. K., Langaney, A., Lin,A. A., Mehdi, S. Q., Parham, P., Piazza, A., Pistillo, M. P., Qian, Y., Shu, Q., Xu,J., Zhu, S., Weber, J. L., Greely, H. T., Feldman, M. W., Thomas, G., Dausset, J.,and Cavalli-Sforza, L. L. (2002). A human genome diversity cell line panel. Science,296(5566):261b–262.
199
Carling, M. D. and Brumfield, R. T. (2008). Integrating phylogenetic and populationgenetic analyses of multiple loci to test species divergence hypotheses in passerinabuntings. Genetics, 178(1):363–377.
Carothers, A. D., I., R., Kolcic, I., Polasek, O., Hayward, C., Wright, A. F., Campbell,H., Teague, P., Hastie, N., and Weber, J. (2006). Estimating human inbreedingcoefficients: Comparison of genealogical and marker heterozygosity approaches.Ann. Hum. Genet., 70:666–676.
Cavalli-Sforza, L., Menozzi, P., and Piazza, A. (1994). The history and geography ofhuman genes. Princeton (N. J.): Princeton University Press.
Cavalli-Sforza, L. L. and Feldman, M. W. (2003). The application of molecular geneticapproaches to the study of human evolution. Nat. Genet., 33:266–275.
Clark, A. G. (1997). Neutral behavior of shared polymorphism. Proc. Natl. Acad.Sci., 94(15):7730–7734.
Conrad, D. F., Jakobsson, M., Coop, G., Wen, X., Wall, J. D., Rosenberg, N. A., andPritchard, J. K. (2006). A worldwide survey of haplotype variation and linkagedisequilibrium in the human genome. Nat Genet, 38(11):1251–1260.
Consortium, T. I. H. (2005). A haplotype map of the human genome. Nature,437(7063):1299–1320.
Coop, G. and Przeworski, M. (2007). An evolutionary view of human recombination.Nat. Rev. Genet., 8(1):23–24.
Coyne, J. A. and Orr, H. A. (1989). Patterns of speciation in drosophila. EvolutionInt. J. Org. Evolution, 43(2):362–381.
Coyne, J. A. and Orr, H. A. (1998). The evolutionary genetics of speciation. Philos.Trans. R. Soc. Lond., B, Biol. Sci., 353:287–305.
Coyne, J. A. and Orr, H. A. (2004a). Allopatric and parapatric specition. In Specia-tion, chapter 3, pages 83–123. SINAUER ASSOCIATES, Inc., Sunderland, MA.
Coyne, J. A. and Orr, H. A. (2004b). Ecological isolation. In Speciation, pages179–210. SINAUER ASSOCIATES, Inc., Sunderland, MA.
Coyne, J. A. and Orr, H. A. (2004c). Genic incompatibilities. In Speciation, pages267–276. SINAUER ASSOCIATES, Inc., Sunderland, MA.
Coyne, J. A. and Orr, H. A. (2004d). Speciation. SINAUER ASSOCIATES, Inc.,Sunderland, MA.
200
D’Andrade, R. and Morin, P. A. (1996). Chimpanzee and human mitochondrialDNA: A principal components and individual-by-site analysis. Am. Anthropologist,98(2):352–370.
de Waal, F. B. M. (1997). Bonobo: The Forgotten Ape. University of California Press.
Degnan, J. H. and Rosenberg, N. A. (2006). Discordance of species trees with theirmost likely gene trees. PLoS Genet., 2(5):e68.
Dieckmann, U. and Doebeli, M. (1999). On the origin of species by sympatric speci-ation. Nature, 400(6742):354–357.
Dixon, C. J., Schonswetter, P., and Schneeweiss, G. M. (2007). Traces of ancientrange shifts in a mountain plant group (Androsace halleri complex, primulaceae).Mol. Ecol., 16(18):3890–3901.
Dobzhansky, T. (1937). Genetics and the Origin of Species. New York, ColumbiaUniv. Press.
Dolman, G. and Moritz, C. (2006). A multilocus perspective on refugial isolationand divergence in rainforest skinks (Carlia). Evolution Int. J. Org. Evolution,60(3):573–82.
Emelianov, I., Marec, F., and Mallet, J. (2004). Genomic evidence for divergencewith gene flow in host races of the larch budmoth. Proc. R. Soc. Lond., B, Biol.Sci., 271(1534):97–105.
Excoffier, L., Novembre, J., and Schneider, S. (2000). Computer note. SIMCOAL: Ageneral coalescent program for the simulation of molecular data in interconnectedpopulations with arbitrary demography. J. Hered., 91(6):506–509.
Fischer, A., Pollack, J., Thalmann, O., Nickel, B., and Paabo, S. (2006). Demographichistory and genetic differentiation in apes. Curr. Biol., 16(11):1133–1138.
Fischer, A., Wiebe, V., Paabo, S., and Przeworski, M. (2004). Evidence for a complexdemographic history of chimpanzees. Mol. Biol. Evol., 21(5):799–808.
Fossella, J., Samant, S. A., Silver, L. M., King, S. M., Vaughan, K. T., Olds-Clarke, P.,Johnson, K. A., Mikami, A., Vallee, R. B., and Pilder, S. H. (2000). An axonemaldynein at the hybrid sterility 6 locus: implications for t haplotype-specific malesterility and the evolution of species barriers. Mamm. Genome, 11(1):8–15.
201
Frazer, K. A., Ballinger, D. G., Cox, D. R., Hinds, D. A., Stuve, L. L., Gibbs,R. A., Belmont, J. W., Boudreau, A., Hardenbol, P., Leal, S. M., Pasternak, S.,Wheeler, D. A., Willis, T. D., Yu, F., Yang, H., Zeng, C., Gao, Y., Hu, H., Hu,W., Li, C., Lin, W., Liu, S., Pan, H., Tang, X., Wang, J., Wang, W., Yu, J.,Zhang, B., Zhang, Q., Zhao, H., Zhao, H., Zhou, J., Gabriel, S. B., Barry, R.,Blumenstiel, B., Camargo, A., Defelice, M., Faggart, M., Goyette, M., Gupta, S.,Moore, J., Nguyen, H., Onofrio, R. C., Parkin, M., Roy, J., Stahl, E., Winchester,E., Ziaugra, L., Altshuler, D., Shen, Y., Yao, Z., Huang, W., Chu, X., He, Y., Jin,L., Liu, Y., Shen, Y., Sun, W., Wang, H., Wang, Y., Wang, Y., Xiong, X., Xu,L., Waye, M. M. Y., Tsui, S. K. W., Xue, H., Wong, J. T.-F., Galver, L. M., Fan,J.-B., Gunderson, K., Murray, S. S., Oliphant, A. R., Chee, M. S., Montpetit, A.,Chagnon, F., Ferretti, V., Leboeuf, M., Olivier, J.-F., Phillips, M. S., Roumy, S.,Sallee, C., Verner, A., Hudson, T. J., Kwok, P.-Y., Cai, D., Koboldt, D. C., Miller,R. D., Pawlikowska, L., Taillon-Miller, P., Xiao, M., Tsui, L.-C., Mak, W., Song,Y. Q., Tam, P. K. H., Nakamura, Y., Kawaguchi, T., Kitamoto, T., Morizono, T.,Nagashima, A., Ohnishi, Y., Sekine, A., Tanaka, T., Tsunoda, T., Deloukas, P.,Bird, C. P., Delgado, M., Dermitzakis, E. T., Gwilliam, R., Hunt, S., Morrison, J.,Powell, D., Stranger, B. E., Whittaker, P., Bentley, D. R., Daly, M. J., de Bakker,P. I. W., Barrett, J., Chretien, Y. R., Maller, J., McCarroll, S., Patterson, N., Pe’er,I., Price, A., Purcell, S., Richter, D. J., Sabeti, P., Saxena, R., Schaffner, S. F.,Sham, P. C., Varilly, P., Altshuler, D., Stein, L. D., Krishnan, L., Smith, A. V.,Tello-Ruiz, M. K., Thorisson, G. A., Chakravarti, A., Chen, P. E., Cutler, D. J.,Kashuk, C. S., Lin, S., Abecasis, G. R., Guan, W., Li, Y., Munro, H. M., Qin, Z. S.,Thomas, D. J., McVean, G., Auton, A., Bottolo, L., Cardin, N., Eyheramendy, S.,Freeman, C., Marchini, J., Myers, S., Spencer, C., Stephens, M., Donnelly, P.,Cardon, L. R., Clarke, G., Evans, D. M., Morris, A. P., Weir, B. S., Tsunoda, T.,Mullikin, J. C., Sherry, S. T., Feolo, M., Skol, A., Zhang, H., Zeng, C., Zhao, H.,Matsuda, I., Fukushima, Y., Macer, D. R., Suda, E., Rotimi, C. N., Adebamowo,C. A., Ajayi, I., Aniagwu, T., Marshall, P. A., Nkwodimmah, C., Royal, C. D. M.,Leppert, M. F., Dixon, M., Peiffer, A., Qiu, R., Kent, A., Kato, K., Niikawa, N.,Adewole, I. F., Knoppers, B. M., Foster, M. W., Clayton, E. W., Watkin, J., Gibbs,R. A., Belmont, J. W., Muzny, D., Nazareth, L., Sodergren, E., Weinstock, G. M.,Wheeler, D. A., Yakub, I., Gabriel, S. B., Onofrio, R. C., Richter, D. J., Ziaugra,L., Birren, B. W., Daly, M. J., Altshuler, D., Wilson, R. K., Fulton, L. L., Rogers,J., Burton, J., Carter, N. P., Clee, C. M., Griffiths, M., Jones, M. C., McLay, K.,Plumb, R. W., Ross, M. T., Sims, S. K., Willey, D. L., Chen, Z., Han, H., Kang,L., Godbout, M., Wallenburg, J. C., L’Archeveque, P., Bellemare, G., Saeki, K.,Wang, H., An, D., Fu, H., Li, Q., Wang, Z., Wang, R., Holden, A. L., Brooks,L. D., McEwen, J. E., Guyer, M. S., Wang, V. O., Peterson, J. L., Shi, M., Spiegel,J., Sung, L. M., Zacharia, L. F., Collins, F. S., Kennedy, K., Jamieson, R., andStewart, J. (2007). A second generation human haplotype map of over 3.1 millionSNPs. Nature, 449(7164):851–861.
202
Frisse, L., Hudson, R. R., Bartoszewicz, A., Wall, J. D., Donfack, J., and Di Rienzo,A. (2001). Gene conversion and different population histories may explain thecontrast between polymorphism and linkage disequilibrium levels. Am. J. Hum.Genet., 69(4):831–843.
Fritz, U., Ayaz, D., Buschbom, J., Kami, H. G., Mazanaeva, L. F., Aloufi, A. A., Auer,M., Rifai, L., Silic, T., and Hundsdorfer, A. K. (2007). Go east: phylogeographiesof Mauremys caspica and M. rivulata− discordance of morphology, mitochondrialand nuclear genomic markers and rare hybridization. J. Evol. Biol., page e660.
Gage, T. B. (1998). The comparative demography of primates: with some commentson the evolution of life histories. Annu. Rev. Anthropol., 27:197–221.
Gagneux, P. (2002). The genus Pan: population genetics of an endangered outgroup.Trends Genet., 18(7):327–330.
Galtier, N., Depaulis, F., and Barton, N. H. (2000). Detecting bottlenecks and selec-tive sweeps from DNA sequence polymorphism. Genetics, 155(2):981–987.
Geraldes, A., Ferrand, N., and Nachman, M. W. (2006). Contrasting patterns ofintrogression at X-linked loci across the hybrid zone between subspecies of theeuropean rabbit (Oryctolagus cuniculus). Genetics, 173(2):919–933.
Ghebranious, N., Vaske, D., Yu, A., Zhao, C., Marth, G., and Weber, J. (2003).STRP screening sets for the human genome at 5 cM density. BMC Genomics,24:6.
Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996). Implementation. InMarkov Chain Monte Carlo In Practice, chapter 1.4, pages 8–19. Chapman andHall/CRC, Boca Raton, Florida.
Goebel, T. (2007). Anthropology: The missing years for modern humans. Science,315(5809):194–196.
Goldstein, D. B. and Pollock, D. D. (1997). Launching microsatellites: A review ofmutation processes and methods of phylogenetic inference. J. Hered., 88(5):335–342.
Gonder, M. K., Disotell, T. R., and Oates, J. F. (2006). New genetic evidence onthe evolution of chimpanzee populations and implications for taxonomy. Int. J.Primatol., 27:1103–1127.
Gonder, M. K., Oates, J. F., Disotell, T. R., Forstner, M. R. J., Morales, J. C.,and Melnick, D. J. (1997). A new west african chimpanzee subspecies? Nature,388(6640):337–337.
203
Goodman, S. J., Barton, N. H., Swanson, G., Abernethy, K., and Pemberton, J. M.(1999). Introgression through rare hybridization: A genetic study of a hybridzone between red and sika deer (Genus Cervus) in Argyll, Scotland. Genetics,152(1):355–371.
Greene-Till, R., Zhao, Y., and Hardies, S. C. (2000). Gene flow of unique sequencesbetween Mus musculus domesticus and Mus spretus. Mamm. Genome, 11(3):225–230.
Groves, C. P. (1970). Population systematics of the gorilla. J. Zool. Lond., 161:287–300.
Groves, C. P. (1971). Pongo pygmaeus. Mamm. Species, 4(4):1–6.
Groves, C. P. (2001). Primate taxonomy. Washington (D. C.): Smithsonian Institu-tion Press.
Grubb, P., Butynski, T. M., Oates, J. F., Bearder, S. K., Disotell, T. R.,Grovesm Colin, P., and Struhsaker, T. T. (2003). Assessment of the diversityof african primates. Int. J. Primatol., 24(6):1301–1357.
Hall, J. P. (2005). Montane speciation patterns in Ithomiola butterflies (lepidoptera:Riodinidae): are they consistently moving up in the world? Proc. Biol. Sci.,272(1580):2457–2466.
Hastings, W. (1970). Monte Carlo sampling methods using Markov chains and theirapplications. Biometrika, 57:97–109.
Hey, J. (1991). The structure of genealogies and the distribution of fixed differencesbetween DNA sequence samples from natural populations. Genetics, 128(4):831–840.
Hey, J. (2005). On the number of new world founders: A population genetic portraitof the peopling of the americas. PLoS Biol., 3(6):e193.
Hey, J. (2006a). On the failure of modern species concepts. Trends Ecol. Evol.,21(8):447–450.
Hey, J. (2006b). Recent advances in assessing gene flow between diverging populationsand species. Curr. Opin. Genet. Dev., 16(6):592–596.
Hey, J. and Nielsen, R. (2004). Multilocus methods for estimating populationsizes, migration rates and divergence time, with applications to the divergence ofDrosophila pseudoobscura and D. persimilis. Genetics, 167(2):747–760. Genetics.
204
Hey, J., Won, Y. J., Sivasundar, A., Nielsen, R., and Markert, J. A. (2004). Using nu-clear haplotypes with microsatellites to study gene flow between recently separatedcichlid species. Mol. Ecol., 13(4):909–919.
Hill, W. C. O. (1969). The nomenclature, taxonomy and distribution of chimpanzees.In Bourne, G. H., editor, The chimpanzee; a series of volumes on the chimpanzee,volume 1, pages 22–49. Basel, New York: S. Karger A. G.
Hobolth, A., Christensen, O. F., Mailund, T., and Schierup, M. H. (2006). Genomicrelationships and speciation times of human, chimpanzee and gorilla inferred froma coalescent Hidden Markov Model. PLoS Genet., preprint(2006):e7.eor.
Hosono, S., Faruqi, A., Dean, F., Du, Y., Sun, Z., Wu, X., Du, J., Kingsmore, S.,Egholm, M., and Lasken, R. (2003). Unbiased wholegenome amplification directlyfrom clinical samples. Genome Res., 13:954–964.
Hudson, R. R. (1983). Properties of a neutral allele model with intragenic recombi-nation. Theor. Popul. Biol., 23(2):183–201.
Hudson, R. R. (2001). Two-locus sampling distributions and their application. Ge-netics, 159(4):1805–1817.
Hudson, R. R. (2002). Generating samples under a wright-fisher neutral model ofgenetic variation. Bioinformatics, 18(2):337–338.
Hudson, R. R. and Coyne, J. A. (2002). Mathematical consequences of the genealog-ical species concept. Evolution Int. J. Org. Evolution, 56(8):1557–1565.
Hudson, R. R. and Kaplan, N. L. (1985). Statistical properties of the number ofrecombination events in the history of a sample of DNA sequences. Genetics,111(1):147–164.
Hudson, R. R., Slatkin, M., and Maddison, W. P. (1992). Estimation of levels of geneflow from DNA sequence data. Genetics, 132(2):583–589.
Hughes, P. D., Woodward, J. C., and Gibbard, P. L. (2006). Quaternary glacialhistory of the mediterranean mountains. Progr. Phys. Geogr., 30(3):334–364.
Innan, H. and Watanabe, H. (2006). The effect of gene flow on the coalescent timein the human-chimpanzee ancestral population. Mol. Biol. Evol., 23(5):1040–1047.
Kaessmann, H., Wiebe, V., and Paabo, S. (1999). Extensive nuclear DNA sequencediversity among chimpanzees. Science, 286(5442):1159–1162. Science.
Kaessmann, H., Wiebe, V., Weiss, G., and Paabo, S. (2001). Great ape DNA se-quences reveal a reduced diversity and an expansion in humans. Nat. Genet.,27(2):155–156.
205
Kallman, K. D. and Kazianis, S. (2006). The genus Xiphophorus in mexico andcentral america. Zebrafish, 3(3):271–285.
Keinan, A., Mullikin, J. C., Patterson, N., and Reich, D. (2007). Measurement of thehuman allele frequency spectrum demonstrates greater genetic drift in east asiansthan in europeans. Nat Genet, 39(10):1251–1255.
Kimura, M. (1969). The number of heterozygous nucleotide sites maintained in afinite population due to steady flux of mutations. Genetics, 61(4):893–903.
Kittles, R. A. and Weiss, K. M. (2003). Race, ancestry and genes: Implications fordefining disease risk. Annu. Rev. Genom. Hum. Genet., 4:33–67.
Kliman, R. M., Andolfatto, P., Coyne, J. A., Depaulis, F., Kreitman, M., Berry,A. J., McCarter, J., Wakeley, J., and Hey, J. (2000). The population genetics ofthe origin and divergence of the Drosophila simulans complex species. Genetics,156(4):1913–1931.
Knaden, M., Tinaut, A., Cerda, X., Wehner, S., and Wehner, R. (2005). Phylogenyof three parapatric species of desert ants, Cataglyphis bicolor, C. viatica, and C.savignyi : a comparison of mitochondrial DNA, nuclear DNA, and morphologicaldata. Zoology (Jena), 108(2):169–177.
Kong, A., Gudbjartsson, D. F., Sainz, J., Jonsdottir, G. M., Gudjonsson, S. A.,Richardsson, B., Sigurdardottir, S., Barnard, J., Hallbeck, B., Masson, G., Shlien,A., Palsson, S. T., Frigge, M. L., Thorgeirsson, T. E., Gulcher, J. R., and Stefans-son, K. (2002). A high-resolution recombination map of the human genome. Nat.Genet., 31(3):241–247.
Kronforst, M. R., Young, L. G., Blume, L. M., and Gilbert, L. E. (2006). Multilocusanalyses of admixture and introgression among hybridizing Heliconius butterflies.Evolution Int. J. Org. Evolution, 60(6):1254–68.
Kumar, S., Tamura, K., Jakobsen, I. B., and Nei, M. (2001). MEGA2: molecularevolutionary genetics analysis software. Bioinformatics, 17(12):1244–1245.
Leman, S. C., Chen, Y., Stajich, J. E., Noor, M. A. F., and Uyenoyama, M. K. (2005).Likelihoods from summary statistics: Recent divergence between species. Genetics,171(3):1419–1436.
Llopart, A., Elwyn, S., Lachaise, D., and Coyne, J. A. (2002). Genetics of a differencein pigmentation between Drosophila yakuba and Drosophila santomea. EvolutionInt. J. Org. Evolution, 56(11):2262–2277.
206
Llopart, A., Lachaise, D., and Coyne, J. A. (2005). Multilocus analysis of introgres-sion between two sympatric sister species of drosophila: Drosophila yakuba and D.santomea. Genetics, 171(1):197–210.
Lockwood, J. R. and Roeder, K.and Devlin, B. (2001). A Bayesian hierarchical modelfor allele frequencies. Genet. Epidemiol., 20:13–33.
Machado, C. A., Kliman, R. M., Markert, J. A., and Hey, J. (2002). Inferring thehistory of speciation from multilocus DNA sequence data: The case of Drosophilapseudoobscura and close relatives. Mol. Biol. Evol., 19(4):472–488.
Mallet, J. (2005). Hybridization as an invasion of the genome. Trends Ecol. Evol.,20(5):229–237.
Maynard Smith, J. and Haigh, J. (1974). The hitch-hiking effect of a favourable gene.Genet Res, 23(1):23–35.
Mayr, E. (1963). Animal Species and Evolution. The Belknap press, Cambridge, MA.
Mayr, E. (2001). Wu’s genic view of speciation. J. Evol. Biol., 14(6):866–867.
Mazzoni, C., Araki, A., Ferreira, G., Azevedo, R., Barbujani, G., and Peixoto, A.(2008). Multilocus analysis of introgression between two sand fly vectors of leish-maniasis. BMC Evol Biol, 8(1):141.
McBrearty, S. and Jablonski, N. G. (2005). First fossil chimpanzee. Nature,437(7055):105–108.
McVean, G. A. T., Myers, S. R., Hunt, S., Deloukas, P., Bentley, D. R., and Donnelly,P. (2004). The fine-scale structure of recombination rate variation in the humangenome. Science, 304(5670):581–584.
Meng, X. L. (1994). Posterior predictive p-values. Ann. Stat., 22(3):1142–1160.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E.(1953). Equation of state calculation by fast computing machines. J. Chem. Phys.,21:1087–1092.
Miller, S. R., Purugganan, M. D., and Curtis, S. E. (2006). Molecular popula-tion genetics and phenotypic diversification of two populations of the thermophiliccyanobacterium Mastigocladus laminosus. Appl. Environ. Microbiol., 72(4):2793–2800.
Morin, P. A., Moore, J. J., Chakraborty, R., Jin, L., Goodall, J., and Woodruff, D. S.(1994). Kin selection, social structure, gene flow, and the evolution of chimpanzees.Science, 265(5176):1193–1201.
207
Muir, C., Galdikas, B., and Andrew, T. (2000). mtDNA sequence diversity oforangutans from the islands of Borneo and Sumatra. J. Mol. Evol., 51(5):471–480.
Muller, H. J. (1940). Bearing of the Drosophila work on systematics. In Huxley,J., editor, In The New Systematics, pages 185–268. Clarendon Press, Oxford, UK,clarendon, oxford edition.
Myers, S., Bottolo, L., Freeman, C., McVean, G., and Donnelly, P. (2005). A Fine-Scale Map of Recombination Rates and Hotspots Across the Human Genome. Sci-ence, 310(5746):321–324.
Myers Thompson, J. A. (2003). A model of the biogeographical journey from Proto-pan to Pan paniscus. Primates, 44(2):191–197.
Nachman, M. W. and Crowell, S. L. (2000). Estimate of the mutation rate pernucleotide in humans. Genetics, 156(1):297–304.
Navarro, A. and Barton, N. H. (2003). Accumulating postzygotic isolation genesin parapatry: a new twist on chromosomal speciation. Evolution Int. J. Org.Evolution, 57(3):447–459.
Nei, M. and Li, W. H. (1979). Mathematical model for studying genetic variation interms of restriction endonucleases. Proc. Natl. Acad. Sci., 76(10):5269–5273.
Nielsen, R. and Signorovitch, J. (2003). Correcting for ascertainment biases when an-alyzing SNP data: applications to the estimation of linkage disequilibrium. Theor.Popul. Biol., 63(3):245–255.
Nielsen, R. and Wakeley, J. (2001). Distinguishing migration from isolation: AMarkov Chain Monte Carlo approach. Genetics, 158(2):885–896.
Niemiller, M. L., Fitzpatrick, B. M., and Miller, B. T. (2008). Recent divergence withgene flow in Tennessee cave salamanders (Plethodontidae: Gyrinophilus) inferredfrom gene genealogies. Molecular Ecology, 17(9):2258–2275.
Nordborg, M. and Tavare, S. (2002). Linkage disequilibrium: what history has to tellus. Trends Genet., 18(2):83–90.
Nosil, P. (2008). Speciation with gene flow could be common. Molecular Ecology,17(9):2103–2106.
Orr, H. A. (2001). Some doubts about (yet another) view of species. J. Evol. Biol.,14(6):870–871.
208
Orr, H. A. (2005). The genetic basis of reproductive isolation: Insights fromDrosophila. Proc. Natl. Acad. Sci., 102(suppl 1):6522–6526.
Orr, H. A., Masly, J. P., and Presgraves, D. C. (2004). Speciation genes. Curr. Opin.Genet. Dev., 14(6):675–679.
Passarino, G., Semino, O., Quintana-Murci, L., Excoffier, L., Hammer, M., andSantachiara-Benerecetti, A. S. (1998). Different genetic components in theEthiopian population, identified by mtDNA and Y-chromosome polymorphisms.Am J Hum Genet, 62(2):420–434.
Patterson, N., Price, A. L., and Reich, D. (2006a). Population structure and eigen-analysis. PLoS Genet., 2:e190.
Patterson, N., Richter, D. J., Gnerre, S., Lander, E. S., and Reich, D. (2006b).Genetic evidence for complex speciation of humans and chimpanzees. Nature,441(7097):1103–1108.
Petren, K., Grant, P. R., Grant, B. R., and Keller, L. F. (2005). Comparative land-scape genetics and the adaptive radiation of darwin’s finches: the role of peripheralisolation. Mol. Ecol., 14(10):2943–2957.
Pollard, D. A., Iyer, V. N., Moses, A. M., and Eisen, M. B. (2006). Widespreaddiscordance of gene trees with species tree in Drosophila: Evidence for incompletelineage sorting. PLoS Genet., 2(10):e173.
Presgraves, D. C., Balagopalan, L., Abmayr, S. M., and Orr, H. A. (2003). Adaptiveevolution drives divergence of a hybrid inviability gene between two species ofDrosophila. Nature, 423(6941):715–719.
Price, T. (2007). Speciation in birds. Robert and Company Publishers.
Pritchard, J. K., Seielstad, M. T., Perez-Lezaun, A., and Feldman, M. W. (1999). Pop-ulation growth of human Y chromosomes: a study of Y chromosome microsatellites.Mol. Biol. Evol., 16(12):1791–1798.
Pritchard, J. K., Stephens, M., and Donnelly, P. (2000). Inference of populationstructure using multilocus genotype data. Genetics, 155(2):945–959.
Przeworski, M. (2003). Estimating the time since the fixation of a beneficial allele.Genetics, 164(4):1667–1676.
Ptak, S. E. and Przeworski, M. (2002). Evidence for population growth in humans isconfounded by fine-scale population structure. Trends Genet, 18(11):559–563.
209
Putnam, A. S., Scriber, M., and Andolfatto, P. (2007). Discordant divergence timesamong Z chromosome regions between two ecologically distcint swallowtail butterflyspecies. Evolution Int. J. Org. Evolution, 61(4):912–927.
Ramachandran, S., Deshpande, O., Roseman, C. C., Rosenberg, N. A., Feldman,M. W., and Cavalli-Sforza, L. L. (2005). Support from the relationship of geneticand geographic distance in human populations for a serial founder effect originatingin africa. Proc. Natl. Acad. Sci., 102(44):15942–15947.
Reinartz, G. E., Karron, J. D., Phillips, R. B., and Weber, J. L. (2000). Patterns ofmicrosatellite polymorphism in the range-restricted bonobo (Pan paniscus): Con-siderations for interspecific comparison with chimpanzees (P. troglodytes). Mol.Ecol., 9:315–328.
Rosenberg, N. A., Li, L. M., Ward, R., and Pritchard, J. K. (2003). Informativenessof genetic markers for inference of ancestry. Am. J. Hum. Genet., 73(6):1402–1422.
Rosenberg, N. A., Mahajan, S., Ramachandran, S., Zhao, C., Pritchard, J. K., andFeldman, M. W. (2005). Clines, clusters, and the effect of study design on theinference of human population structure. PLoS Genet., 1(6):e70.
Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivo-tovsky, L. A., and Feldman, M. W. (2002). Genetic structure of human populations.Science, 298(5602):2381–2385.
Rozas, J., Sanchez-DelBarrio, J. C., Messeguer, X., and Rozas, R. (2003). DnaSP,DNA polymorphism analyses by the coalescent and other methods. Bioinformatics,19(18):2496–2497.
Savolainen, V., Anstett, M.-C., Lexer, C., Hutton, I., Clarkson, J. J., Norup, M. V.,Powell, M. P., Springate, D., Salamin, N., and Baker, W. J. (2006). Sympatricspeciation in palms on an oceanic island. Nature, advanced online publication.
Sawamura, K., Watanabe, T., and Yamamoto, M. (1993). Hybrid lethal systems inthe Drosophila Melanogaster species complex. Genetica, 88(2-3):175–185.
Schaffner, S. F., Foo, C., Gabriel, S., Reich, D., Daly, M. J., and Altshuler, D. (2005).Calibrating a coalescent simulation of human genome sequence variation. GenomeRes, 15(11):1576–1583.
Scheet, P. and Stephens, M. (2006). A fast and flexible statistical model forlarge−scale population genotype data: Applications to inferring missing genotypesand haplotypic phase. Am. J. Hum. Genet., 78(4):629–644.
210
Schneider, S., Roessli, D., and Excoffier, L. (2000). ARLEQUIN: A software forpopulation genetics data analysis, version 2000 [Computer Program]. Geneva: De-partment of Anthropology, University of Geneva.
Shea, B. T., Leigh, S. R., and Groves, C. P. (1993). Multivariate craniometric varia-tion in chimpanzees: implications for species identification in paleoanthropology. InKimbel, W. and Mar, L., editors, Species, species concepts, and primate evolution,pages 265–296. New York: Plenum Press.
Slatkin, M. (1995). A measure of population subdivision based on microsatellite allelefrequencies. Genetics, 139(1):457–462.
Smith, R. J. and Pilbeam, D. R. (1980). Evolution of the orang-utan. Nature,284(5755):447–448.
Stadler, T., Arunyawat, U., and Stephan, W. (2008). Population genetics of speciationin two closely related wild tomatoes (solanum section lycopersicon). Genetics,178(1):339–350.
Stephens, M., Smith, N. J., and Donnelly, P. (2001). A new statistical method forhaplotype reconstruction from population data. Am. J. Hum. Genet., 68(4):978–89.
Stone, A. C., Griffiths, R. C., Zegura, S. L., and Hammer, M. F. (2002). High levelsof Y-chromosome nucleotide diversity in the genus Pan. Proc. Natl. Acad. Sci.,99(1):43–48.
Stone, A. C. and Verrelli, B. C. (2006). Focusing on comparative ape populationgenetics in the post-genomic age. Curr. Opin. Genet. Dev., 16(6):586–591.
Tajima, F. (1989). Statistical method for testing the neutral mutation hypothesis byDNA polymorphism. Genetics, 123(3):585–595. Genetics.
Takahata, N. and Satta, Y. (2002). Pre-speciation coalescence and the effective sizeof ancestral populations. In Slatkin, M. and Veuille, M., editors, Modern Develop-ments in Theoretical Population Genetics, pages 52–71. Oxford University Press,Oxford.
Tavare, S., Balding, D. J., Griffiths, R. C., and Donnelly, P. (1997). Inferring coales-cence times from DNA sequence data. Genetics, 145(2):505–518.
Thalmann, O., Fischer, A., Lankester, F., Paabo, S., and Vigilant, L. (2006). Thecomplex evolutionary history of gorillas: Insights from genomic data. Mol. Biol.Evol., 24(1):146–158.
211
The Chimpanzee Sequencing and Analysis Consortium (2005). Initial sequenceof the chimpanzee genome and comparison with the human genome. Nature,437(7055):69–87.
Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994). CLUSTAL W: improv-ing the sensitivity of progressive multiple sequence alignment through sequenceweighting, position-specific gap penalties and weight matrix choice. Nucl. AcidsRes., 22(22):4673–4680.
Ting, C.-T., Tsaur, S.-C., Wu, M.-L., and Wu, C.-I. (1998). A rapidly evolvinghomeobox at the site of a hybrid sterility gene. Science, 282(5393):1501–1504.
Turner, T. L., Hahn, M. W., and Nuzhdin, S. V. (2005). Genomic islands of speciationin Anopheles gambiae. PLoS Biol., 3(9):e285.
Valdes, A. M., Slatkin, M., and Freimer, N. B. (1993). Allele frequencies at mi-crosatellite loci: The stepwise mutation model revisited. Genetics, 133(3):737–749.
Voight, B. F., Adams, A. M., Frisse, L. A., Qian, Y., Hudson, R. R., and Di Rienzo, A.(2005). Interrogating multiple aspects of variation in a full resequencing data set toinfer human population size changes. Proc. Natl. Acad. Sci., 102(51):18508–18513.
Voight, B. F., Kudaravalli, S., Wen, X., and Pritchard, J. K. (2006). A map of recentpositive selection in the human genome. PLoS Biol, 4(3):e72.
Wakeley, J. (1996). Distinguishing migration from isolation using the variance ofpairwise differences. Theor. Popul. Biol., 49(3):369–386.
Wakeley, J. (2008). Complex speciation of humans and chimpanzees. Nature,452(7184):E3–E4; discussion E4.
Wakeley, J. and Hey, J. (1997). Estimating ancestral population parameters. Genetics,145(3):847–855. Genetics.
Wall, J. D. (2000). Detecting ancient admixture in humans using sequence polymor-phism data. Genetics, 154(3):1271–1279.
Wall, J. D. (2003). Estimating ancestral population sizes and divergence times. Ge-netics, 163(1):395–404.
Wall, J. D., Cox, M. P., Mendez, F. L., Woerner, A., Severson, T., and Hammer,M. F. (2008). A novel DNA sequence database for analyzing human demographichistory. Genome Research, page gr.075630.107.
Wang, R. L., Stec, A., Hey, J., Lukens, L., and Doebley, J. (1999). The limits ofselection during maize domestication. Nature, 398(6724):236–239.
212
Wittbrodt, J., Adam, D., Malitschek, B., Maueler, W., Raulf, F., Telling, A., Robert-son, S. M., and Schartl, M. (1989). Novel putative receptor tyrosine kinase encodedby the melanoma-inducing Tu locus in Xiphophorus. Nature, 341(6241):415–421.
Won, Y.-J. and Hey, J. (2005). Divergence population genetics of chimpanzees. Mol.Biol. Evol., 22(2):297–307.
Won, Y. J., Sivasundar, A., Wang, Y., and Hey, J. (2005). On the origin of lakemalawi cichlid species: A population genetic analysis of divergence. Proc. Natl.Acad. Sci., 102(suppl 1):6581–6586.
Wu, C.-I. (2001a). Genes and speciation. J. Evol. Biol., 14(6):889–891.
Wu, C.-I. (2001b). The genic view of the process of speciation. J. Evol. Biol.,14(6):851–865.
Wu, C.-I. and Ting, C.-T. (2004). Genes and speciation. Nat. Rev. Genet., 5(2):114–122.
Yu, N., Fu, Y. X., and Li, W. H. (2002). DNA polymorphism in a worldwide sampleof human X chromosomes. Mol. Biol. Evol., 19(12):2131–2141.
Yu, N., Jensen-Seaman, M. I., Chemnick, L., Kidd, J. R., Deinard, A. S., Ryder, O.,Kidd, K. K., and Li, W. H. (2003). Low nucleotide diversity in chimpanzees andbonobos. Genetics, 164(4):1511–1518.
Yu, N., Jensen-Seaman, M. I., Chemnick, L., Ryder, O., and Li, W. H. (2004).Nucleotide diversity in gorillas. Genetics, 166(3):1375–1383.
Zhang, Y., Ryder, O. A., and Zhang, Y. (2001). Genetic divergence of orangutansubspecies (Pongo pygmaeus). J. Mol. Evol., 52(6):516–26.
Zhi, L., Karesh, W. B., Janczewski, D. N., Frazier-Taylor, H., Sajuthi, D., Gombek,F., Andau, M., Martenson, J. S., and O’Brien, S. J. (1996). Genomic differenti-ation among natural populations of orang-utan (Pongo pygmaeus). Curr. Biol.,6(10):1326–1336.
Zhivotovsky, L. A., Rosenberg, N. A., and Feldman, M. W. (2003). Features of evo-lution and expansion of modern humans, inferred from genomewide microsatellitemarkers. Am. J. Hum. Genet., 72(5):1171–1186.