Download pdf - THE UNIVERSITY OF CHICAGO POPULATION GENETIC …przeworski.uchicago.edu/cbecquet/PhDThesis.pdf · admittedly humble) contributions to the population genetics and evolutionary biology

THE UNIVERSITY OF CHICAGO

POPULATION GENETIC APPROACHES TO THE STUDY OF SPECIATION

A DISSERTATION SUBMITTED TO

THE FACULTY OF THE DIVISION OF THE BIOLOGICAL SCIENCES AND

THE PRITZKER SCHOOL OF MEDICINE

IN CANDIDACY FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

DEPARTMENT OF HUMAN GENETICS

BY

CELINE BECQUET

CHICAGO, ILLINOIS

AUGUST 2008

The origin of isolating mechanisms is . . . a problem of fundamental importance,

and the paucity of our knowledge on this subject is felt as a glaring defect in the

whole doctrine of evolution.

−Theodosius Dobzhansky and Pius Charles Koller (1938)

Nothing in biology makes sense except in the light of evolution.

−Theodosius Dobzhansky 1973

Thank God for Evolution!

−Michael Dowd 2007

ABSTRACT

The mechanisms of speciation still escape our full understanding despite over a cen-

tury of research. Here, I take a population genetic approach to learn about the

processes by which populations diverge and eventually become distinct species, illus-

trating its application in fields varying from conservation biology to human evolution.

As a background to the dissertation, I present in Chapter 1 the state of research in

these fields at the time I started my PhD and provide brief introductions to the

projects described in the thesis. In Chapter 2, I use extensive variation data from

multiple loci to unravel the complex relationships and evolutionary history of chim-

panzee species and subspecies. In Chapter 3, I introduce a computational approach,

MIMAR, which uses summary statistics of polymorphism at multiple independently-

evolving loci and allows for intra-locus recombination to estimate the parameters of

simple speciation models. I apply this method to data from the species and subspecies

of chimpanzee, refining and completing the results of Chapter 2, as well as from the

other non-human ape species and populations. In Chapter 4, I study the behavior

of my and related computational methods when data are generated under realistic

deviations from the simple models considered to date. This work underscores the

limitation of computational approaches in resolving the question of whether the early

stages of speciation can occur in the presence of gene flow. Finally in Chapter 5, I use

my method to learn more about the demographic history of six human populations.

iii

ACKNOWLEDGEMENTS

I would not be at this stage of my scientific development without the guidance of

my adviser Molly Przeworski. Molly has provided me with the opportunity to work

on exciting and stimulating scientific projects, leading to my actual (even though

admittedly humble) contributions to the population genetics and evolutionary biology

communities. She has also given me countless comments and valuable advice leading

to the noticeable improvement (I hope) of my writing and presentation style to a

level that will allow me to grow into a, if not worthy, at least acceptable, scientist.

She has been a patient mentor and I hope to remember and apply during my future

experiences in the academic world everything she has taught me. Importantly, she

has been at times a remarkably understanding colleague and friend; I cannot thank

her enough for such things.

I thank all the teachers of Brown University and University of Chicago who put

up with me during the early years of my PhD. Thank you to Carol Ober, whose care,

scientific input and human understanding smoothed the emotionally rough transfer

from Brown to U of C. Thank you to all the members of the Przeworski, Pritchard

and Stephens labs (and friends of PPS, in particular the Gilad lab) for the discussions,

social framework and dynamic and friendly work environment they provided me these

last three years. In this group, I need to single out my wonderful office-mate Graham

Coop, who has put up with my flavorful swearing, aromatic food, dumb questions,

erratic moods and so much more for so long. Despite all this, Graham has always been

an extremely helpful, competent, interested, sympathetic, understanding and caring

colleague and friend. Thank you to Kevin Bullaughey for his availability in providing

iv

v

technical support and helpful discussions. Thanks to Dick Hudson for accepting to

sit as an ad hoc thesis committee member during my thesis defense, his friendly and

helpful discussions and his widely used program ms, which has been the basis of my

work since I was a baby population geneticist. I thank the members of my thesis

committee for their helpful comments and trustful support regarding my abilities in

completing this PhD: Anna Di Rienzo, Jonathan Pritchard, Matthew Stephens, and

Chung-I Wu.

Thank you to all my friends from Brown, and specifically to Carol and Walter

Casper who welcomed me as one of their own, and helped a lot in making the transi-

tion from Europe to the US painless. Thank you too all my friends from U of C (most

of them colleagues in Human Genetics, Ecology and Evolution or in the Biological

Sciences Division) and from the Hyde Park community for the social life, relaxing

outings and more generally well needed human warmth. Thank you to Margarida

Cardoso-Moreira and Joanna Kelley for their valuable comments on earlier versions

of this dissertation and Molly again, who must have read this thesis a million times.

Thanks Lucky for the much needed and well-appreciated distractions and moral sup-

port during my thesis writing.

TABLE OF CONTENTS

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 GENETIC STRUCTURE OF CHIMPANZEE POPULATIONS . . . . . . 112.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Author summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Results/Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.1 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . 192.4.2 Principal components analysis . . . . . . . . . . . . . . . . . 232.4.3 Testing for additional populations . . . . . . . . . . . . . . . 262.4.4 Evidence for inbreeding . . . . . . . . . . . . . . . . . . . . . 272.4.5 First and second generation hybrids . . . . . . . . . . . . . . 272.4.6 Allele frequency differentiation . . . . . . . . . . . . . . . . . 282.4.7 Central and Eastern chimpanzees are most closely related in

time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.4.8 Population separation times . . . . . . . . . . . . . . . . . . . 32

2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.6 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 372.7 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.8 Appendix A: Testing for inbreeding . . . . . . . . . . . . . . . . . . . 43

vi

vii

3 A NEW APPROACH TO ESTIMATE PARAMETERS OF SPECIATIONMODELS, WITH APPLICATION TO APES . . . . . . . . . . . . . . . . 463.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.3.1 Performance of MIMAR under the allopatric model . . . . . . . 553.3.2 Comparison to IM for the case of no recombination . . . . . . 563.3.3 Assessing the evidence for gene flow . . . . . . . . . . . . . . . 593.3.4 Sensitivity to intra-locus recombination rates . . . . . . . . . . 613.3.5 Application to ape data . . . . . . . . . . . . . . . . . . . . . 63

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743.4.1 Advantages and limitations of MIMAR . . . . . . . . . . . . . . 743.4.2 Analyses of ape polymorphism data . . . . . . . . . . . . . . . 78

3.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793.5.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793.5.2 Data summaries . . . . . . . . . . . . . . . . . . . . . . . . . . 813.5.3 Estimation method . . . . . . . . . . . . . . . . . . . . . . . . 823.5.4 Simulated data and performance analyses . . . . . . . . . . . 873.5.5 Analysis of ape polymorphism data . . . . . . . . . . . . . . . 89

3.6 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 933.7 Appendix A: Supplemental Materials . . . . . . . . . . . . . . . . . . 943.8 Appendix B: Supplemental Figures . . . . . . . . . . . . . . . . . . . 94

4 CAN WE LEARN ABOUT MODES OF SPECIATION BY COMPUTA-TIONAL APPROACHES? . . . . . . . . . . . . . . . . . . . . . . . . . . . 1084.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1094.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1094.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.3.1 The isolation model and violations . . . . . . . . . . . . . . . 1164.3.2 The isolation-migration model and violations . . . . . . . . . . 1194.3.3 Estimating the parameters of the isolation-migration model . . 1214.3.4 Goodness-of-fit test . . . . . . . . . . . . . . . . . . . . . . . . 123

4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1254.4.1 Performance of MIMAR and IM under the isolation and isolation-

migration models . . . . . . . . . . . . . . . . . . . . . . . . . 1264.4.2 Effect of violations of the isolation model (Figure 4.1a) . . . . 1274.4.3 Effect of violations of the isolation-migration model (Figure

4.1b): Parapatry with gene flow only at an early stage . . . . 1324.4.4 Detecting loci with unusual history . . . . . . . . . . . . . . . 136

4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1394.6 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1424.7 Appendix A: Supplemental Materials . . . . . . . . . . . . . . . . . . 142

viii

4.8 Appendix B: Supplemental Tables . . . . . . . . . . . . . . . . . . . 1464.9 Appendix C: Supplemental Figure . . . . . . . . . . . . . . . . . . . 152

5 ESTIMATING THE DEMOGRAPHIC PARAMETERS OF HUMAN POP-ULATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1535.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1545.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1545.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

5.3.1 Raw data description . . . . . . . . . . . . . . . . . . . . . . . 1555.3.2 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . 1565.3.3 Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1575.3.4 Goodness-of-fit test . . . . . . . . . . . . . . . . . . . . . . . . 158

5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595.4.1 Split between African populations . . . . . . . . . . . . . . . . 1605.4.2 Split between African and non-African populations. . . . . . . 1655.4.3 Split between non-African populations . . . . . . . . . . . . . 169

5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1715.6 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1755.7 Appendix A: Supplemental Figures . . . . . . . . . . . . . . . . . . . 175

6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

LIST OF FIGURES

2.1 STRUCTURE analysis, blinded to population labels, recapitulatesthe reported population structure of the chimpanzees. . . . . . . . 21

2.2 PCA, without using population labels, divides the 84 chimpanzeesinto four apparently discontinuous populations of Western, Central,Eastern, and bonobo. . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 The significant fourth eigenvector from the analysis of all 84 chim-panzees is correlated to the first eigenvector from analysis of Westernchimpanzees only. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4 Goodness-of-fit of the SMM. . . . . . . . . . . . . . . . . . . . . . . 302.5 Distribution of mean heterozygosity within individuals (Hw, orange). 442.6 Distribution of mean squared difference in repeat units within indi-

viduals (Rw, orange). . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.1 The “isolation-migration” model. . . . . . . . . . . . . . . . . . . . 553.2 Performance of MIMAR (x-axis) and IM (y-axis). . . . . . . . . . . . 583.3 Performance of MIMAR in the presence of gene flow. . . . . . . . . . 593.4 Sensitivity of MIMAR to intra-locus recombination. . . . . . . . . . . 623.5 Smoothed marginal posterior distributions estimated by MIMAR from

bonobo and common chimpanzee polymorphism data. . . . . . . . 643.6 Smoothed marginal posterior distributions estimated by MIMAR from

the common chimpanzee subpopulations polymorphism data. . . . 653.7 Smoothed marginal posterior distributions estimated by MIMAR from

the gorilla (a) and orangutan (b) subspecies polymorphism data. . . 703.8 Smoothed posterior distributions estimated by MIMAR from simulated

data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 953.9 Smoothed posterior distributions estimated by IM (black) and MIMAR

(grey) from simulated data sets. . . . . . . . . . . . . . . . . . . . 983.10 Goodness-of-fit of the isolation-migration model for the ape species

and subspecies data. . . . . . . . . . . . . . . . . . . . . . . . . . . 100

ix

x

4.1 Simple models of speciation. . . . . . . . . . . . . . . . . . . . . . 1184.2 Estimates provided by MIMAR and IM from data simulated under the

isolation (Figure 4.1a) and isolation-migration models (Figure 4.1b). 1274.3 Estimates provided by MIMAR (a and c) and IM (b and d) from data

simulated under models of isolation from a structured ancestral pop-ulation (Figure 4.1c). . . . . . . . . . . . . . . . . . . . . . . . . . 129

4.4 Estimates provided by MIMAR (a and c) and IM (b and d) from datasimulated under models of divergence in isolation followed by sec-ondary contact (Figure 4.1d). . . . . . . . . . . . . . . . . . . . . . 133

4.5 Estimates provided by MIMAR (a and c) and IM (b and d) when dataare simulated under models of isolation with migration only at anearly stage (Figure 4.1e). . . . . . . . . . . . . . . . . . . . . . . . 135

4.6 Estimates provided by MIMAR when the data at one locus with adifferent history are simulated with models i (blue) and ii (red). . . 152

5.1 Raw data description. . . . . . . . . . . . . . . . . . . . . . . . . . . 1565.2 Cartoon summarizing pairwise models estimated by MIMAR for six

human populations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1725.3 Estimated model from the Biaka Pygmies and Mandenka population

data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1765.4 Estimated model from the Biaka Pygmies and Tsumkwe San popu-

lation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1775.5 Estimated model from the Mandenka and Tsumkwe San population

data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1785.6 Estimated model from the French Basque and Biaka Pygmies popu-

lation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1795.7 Estimated model from the Han Chinese and Biaka Pygmies popula-

tion data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1805.8 Estimated model from the Nan Melanesian and Biaka Pygmies pop-

ulation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1815.9 Estimated model from the French Basque and Mandenka population

data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1825.10 Estimated model from the Han Chinese and Mandenka population

data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1835.11 Estimated model from the Nan Melanesian and Mandenka population

data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1845.12 Estimated model from the French Basque and Tsumkwe San popu-

lation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1855.13 Estimated model from the Han Chinese and Tsumkwe San population

data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

xi

5.14 Estimated model from the Nan Melanesian and Tsumkwe San pop-ulation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

5.15 Estimated model from the French Basque and Han Chinese popula-tion data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

5.16 Estimated model from the French Basque and Nan Melanesian pop-ulation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

5.17 Estimated model from the Han Chinese and Nan Melanesian popu-lation data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

5.18 Isolation-migration models estimated by MIMAR with a poor fit tothe human population data. . . . . . . . . . . . . . . . . . . . . . . 191

LIST OF TABLES

2.1 Details of the 84 samples in this study. . . . . . . . . . . . . . . . . 162.2 Individuals with >5% ancestry from more than one cluster. . . . . . 212.3 Information on the top 30 microsatellites used for this study ranked

by informativeness. . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4 Genetic differentiation among populations. . . . . . . . . . . . . . . 292.5 Eastern and Central chimpanzees are phylogenetically most closely

related. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.6 Estimates of divergence time from ASD. . . . . . . . . . . . . . . . 352.7 Mean heterozygosity within individuals from each population. . . . 432.8 Mean squared difference in repeat units within individuals in each

population. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.1 Performance of MIMAR and IM. . . . . . . . . . . . . . . . . . . . . . 563.2 Performance of MIMAR when detecting gene flow. . . . . . . . . . . . 603.3 Results for chimpanzee species. . . . . . . . . . . . . . . . . . . . . 673.4 Results for chimpanzee subspecies. . . . . . . . . . . . . . . . . . . . 683.5 Results for gorilla and orangutan subspecies. . . . . . . . . . . . . . 72

4.1 Proportion of analyses in which non-zero gene flow was detected. . . 1304.2 Results of the locus-specific goodness-of-fit tests. . . . . . . . . . . 1384.3 Proportion of MIMAR and IM analyses with parameter estimates within

two fold of their true value when the data are simulated under the iso-lation and isolation-migration models (Figure 4.1a−b in main text). 146

4.4 Proportion of MIMAR analyses with parameter estimates within twofold of their true value when data are simulated under models ofisolation from a structured ancestral population (Figure 4.1c in maintext). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

4.5 Proportion of IM analyses with parameter estimates within two foldsof their true value when data are simulated under models of isolationfrom a structured ancestral population (Figure 4.1c in main text). 147

4.6 Proportion of MIMAR analyses with parameter estimates within twofold of their true value when data are simulated under models ofisolation followed by secondary contact (Figure 4.1d in main text). 148

4.7 Proportion of IM analyses with parameter estimates within two foldsof their true value when data are simulated under models of isolationfollowed by secondary contact (Figure 4.1d in main text). . . . . . 148

xii

xiii

4.8 Proportion of MIMAR analyses with parameter estimates within twofold of their true value when data are simulated under models ofisolation with migration only at an early stage (Figure 4.1e in maintext). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

4.9 Proportion of IM analyses with parameter estimates within two foldsof their true value when data are simulated under models of isolationwith migration only at an early stage (Figure 4.1e in main text). . 149

4.10 Proportion of MIMAR analyses with parameter estimates within twofold of their true value when data at a locus with an unusual historyare simulated under models i and ii. . . . . . . . . . . . . . . . . . 150

4.11 Results of the best locus-specific goodness-of-fit tests. . . . . . . . 151

5.1 Estimates of the descendant effective population sizes. . . . . . . . . 1615.2 Estimates of the ancestral effective population sizes. . . . . . . . . . 1625.3 Estimates of the split time (lower half) and gene flow rate (upper

half) for each population pair. . . . . . . . . . . . . . . . . . . . . . 1635.4 Estimates from joint posterior distributions for the African populations.1645.5 Estimates from joint posterior distributions between the Biaka Pyg-

mies and non-African populations. . . . . . . . . . . . . . . . . . . . 1665.6 Estimates from joint posterior distributions between the Mandenka

and non-African populations. . . . . . . . . . . . . . . . . . . . . . . 1675.7 Estimates from joint posterior distributions between the Tsumkwe

San and non-African populations. . . . . . . . . . . . . . . . . . . . 1685.8 Estimates from joint posterior distributions for the non-African pop-

ulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

CHAPTER 1

INTRODUCTION

1

2

The biological species concept, “groups of interbreeding natural populations that

are reproductively isolated from other such groups”, introduced by Mayr (1963) is

now widely accepted. However, this definition is not always applicable and, in par-

ticular, controversy remains about whether certain groups of individuals are species,

subspecies or just different populations of the same species. In particular, when

groups of individuals are geographically isolated, it is not always clear whether they

fit the definition of species (Price, 2007), i.e., whether they are reproductively iso-

lated and thus would not merge into a homogeneous group upon secondary contact

(Coyne and Orr, 2004d). Whether they can interbreed successfully can sometimes

be evaluated if there is a recent hybrid zone between the two species. For instance,

human disturbance may bring into contact what turn out to be “true” species. Since

reproductively isolation will lead to reduce hybrid fitness, the hybrid zone thus cre-

ated should remain stable over time or quickly disappear if reinforcement increases

the reproductive isolation (Coyne and Orr, 1989). In other cases, human disturbance

can bring into contact groups in the process of speciating and not fully reproductively

isolated, thus threatening biodiversity if the genetic differentiation become lost upon

the merging of the nascent species. Laboratory crosses can also be performed to

measure the fitness of the hybrids between species and study the processes of repro-

ductive isolation. Although molecular research has yielded important insight on the

processes of speciation (Orr, 2005), the reproductive isolation of some species may be

impossible to study in laboratory conditions. Notable examples are cases of ecological

speciation, when the ecology cannot be modeled in the laboratory (Coyne and Orr,

2004b) or cases when research is prevented because of ethical and practical reasons,

as in the case of the great apes.

3

The great apes are an example of controversial species and subspecies. They are

classified into five accepted species: human, gorilla, orangutans, as well as two species

of chimpanzee: The common chimpanzee (Pan troglodytes) and its sister species,

the Pygmy chimpanzee or bonobo (P. paniscus), which occur on either side of the

Congo River. This body of water has never experienced a major drought since its

formation 1.5−3.5 million years ago (Mya) (Beadle, 1981; Myers Thompson, 2003),

so that − given that apes cannot swim − it represents a complete barrier to gene

flow. But should chimpanzee and bonobo be considered different species? There

have been reports of hybrids born in captivity and there is no clear evidence that the

hybrids have reduced fitness (de Waal, 1997), as would be expected if they were true

species. In addition, the non-human great apes species are often subdivided further

into groups that are defined, depending on the classification, as species, subspecies

or populations. For example, the common chimpanzee species are usually described

with three subspecies labels: the Western (P. t. verus), Central (P. t. troglodytes),

and Eastern (P. t. schweinfurthii) chimpanzee (Hill, 1969). Until my and others

contribution to the field (Becquet et al., 2007), this classification was mostly based

on their separation by main bodies of water and other geographical barriers, with

little evidence from morphological, behavioral or even genetic studies. This question

of what constitutes a species as opposed to a locally adapted population is not only

semantic but is central to conservation biology and the attempt to focus conservation

efforts on endangered species such as the chimpanzees and all other non-human great

apes.

Linked to the question of whether some groups of individuals are true species are

unresolved and highly debated questions in evolutionary biology, such as how many

loci are involved in the emergence of reproductive isolation between nascent species

4

and whether speciation can be initiated in the presence of gene flow (Wu, 2001b; Mayr,

2001; Orr, 2001; Wu, 2001a). Until recently, allopatric speciation was widely accepted

as the only or at least most common mode of speciation. Allopatry assumes that

divergence is initiated by and requires a phase of total geographical isolation between

the emerging species. As there is no gene flow during the early stage of speciation

under this model, the genomes are assumed to diverge homogeneously and thus many

genes may become involved in reproductive isolation as a by-product of divergence.

In contrast, the recently proposed “genic view of speciation” assumes that only a

few genes are required for species to become reproductively isolated (Wu, 2001b), as

suggested by the functions of the examples of reproductive isolation factors that have

been characterized to date (Wu, 2001a; Orr et al., 2004). This view allows for a pair of

species to experience some gene flow at the early stage of their divergence, i.e., with the

parapatric model of speciation, when two species diverged while occupying adjacent

geographical areas. In this model of species formation, natural selection is the force

that actively leads to the accumulation of reproductive isolation factors, a hypothesis

that has recently received support from the observation that the evolution of the few

characterized reproductive isolation factors was driven by positive Darwinian selection

(e.g., Orr, 2005).

Whether the early stages of speciation can occur with gene flow remains contro-

versial. Few clear-cut and undisputed cases of parapatric speciation have been docu-

mented, in part because when gene flow is detected, it is difficult to rule out a model

of allopatry followed by secondary contact (Coyne and Orr, 2004a). A large portion of

my PhD research focused on developing and using computational approaches to help

understand speciation mechanisms in an attempt to answer some of these enduring

questions in evolutionary biology.

5

When I started my PhD, I was interested in deciphering the population genetics

and demographic history of apes species and populations (Chapters 2 and 3). At

the time, the largest data set available for common chimpanzee was composed of

about 50 short loci sampled for genetic variation in less than 20 individuals of the

three commonly defined subspecies of chimpanzee (Yu et al., 2003). In contrast, the

data available for human at the time included hundreds of highly variable markers

genotyped in a thousand individuals from 52 populations (Rosenberg et al., 2002).

Thus, while the extent of population differentiation and structure was starting to

be well characterized in humans, the same was far from true in our closest living

evolutionary relatives, the chimpanzees.

The second Chapter of this dissertation is an article published in PLoS Genetics

(Becquet et al., 2007), which describes the largest data set collected to date in chim-

panzees: Approximately 300 microsatellite loci genotyped in over 80 chimpanzees and

bonobos. We collected the data specifically to inquire about the Genetic Structure of

Chimpanzee Populations. This work was supervised by Molly Przeworski and David

Reich and was done in collaboration with Nick Patterson and Anne Stone. I per-

formed most of the data analyses and wrote sections of the manuscript describing the

methods and results. Specifically, I applied the program STRUCTURE (Pritchard

et al., 2000) to these data and showed that the subspecies labels of chimpanzee cor-

respond to well defined genetic populations, with little evidence of recent admixture

between them in the wild. I also used simple statistical approaches to gain insights

about the demographic history of the chimpanzee populations and to estimate the

divergence times between the chimpanzee subspecies and species. The data suggested

that the Eastern and Central populations are more closely related than they are to

6

Western chimpanzee. However, these results are only qualitative because microsatel-

lite data provide unreliable estimates of demographic parameters such as divergence

times and effective population sizes owing to their complex, partly unknown − and

thus difficult to model − mutational mechanisms.

Demographic parameters are more reliably estimated when computational ap-

proaches are applied to nucleotide polymorphism data sampled at multiple, independ-

ently-evolving and ideally neutral loci. At the time I started my PhD, several such

computational methods had been developed to estimate the parameters of simple di-

vergence models in an attempt to learn about speciation mechanisms (Wakeley and

Hey, 1997; Kliman et al., 2000; Nielsen and Wakeley, 2001; Hey and Nielsen, 2004;

Leman et al., 2005; Putnam et al., 2007). These methods used genetic variation across

loci, usually polymorphism data sampled in a pair of closely related populations or

recently diverged species, and fit an isolation model (a simplification of allopatry, e.g.,

Wakeley and Hey, 1997) or an isolation-migration model (a simplification of parap-

atry, e.g., Hey and Nielsen, 2004) to the data. However, all the available methods

had important drawbacks and limitations, ranging from their use of a single locus to

ignoring gene flow between populations since their split and/or not allowing for intra-

locus recombination, all of which can lead to biased estimates of the demographic

parameters of interest (Takahata and Satta, 2002; Hey and Nielsen, 2004).

Chapter 3 is an article published in Genome Research (Becquet and Przeworski,

2007), which describes a new approach to estimate parameters of speciation mod-

els, with application to apes. The purpose of this study was to develop a computa-

tional method that would not have the limitations of previous approaches. I devel-

oped the program MIMAR (Markov Chain Monte Carlo estimation of the Isolation-

7

Migration model Allowing for Recombination) to estimate the parameters of the

isolation-migration model (Hey and Nielsen, 2004): The effective population sizes

of two recently diverged populations and of their common ancestor, their divergence

time and the constant rate of gene flow since they split. MIMAR considers summaries

of polymorphism data sampled in two populations adapted from statistics known

to contain information about the parameters of interest (Wakeley and Hey, 1997).

Importantly, in contrast to other approaches that can also use data from multiple

independently-evolving loci, MIMAR allows for intra-locus recombination.

I carried out a simulation study and encouragingly found that when there is no

intra-locus recombination the method performs similarly to IM, an approach that uses

the full polymorphism data from non-recombining loci (Hey and Nielsen, 2004). I also

confirmed the tendency of computational methods to provide biased estimates of the

parameters of interest when intra-locus recombination is ignored. I further illustrated

the potential of MIMAR by applying it to data from the species and subspecies of great

apes. The results suggested that the isolation-migration model provides a reason-

able approximation to the demographic histories of the great apes populations and

species. In accordance with the results of Chapter 2, I found that Western chimpanzee

diverged approximately 440 thousand years ago (Kya), before the split between Cen-

tral and Eastern chimpanzees. However, while the study presented in Chapter 2 did

not provide evidence of recent migration between the chimpanzee subspecies, MIMAR

suggested that these populations have experienced some historical gene flow. I fur-

ther refined previous estimates that bonobo and chimpanzee diverged in allopatry

∼850 Kya (Won and Hey, 2005; Fischer et al., 2006), and that Western and Eastern

gorilla subspecies split ∼90 Kya with subsequent gene flow (Fischer et al., 2006; Thal-

mann et al., 2006). Finally, I provided the first population parameter estimates for

8

the Bornean and Sumatran orangutan subspecies, which appear to have experienced

gene flow since they split over 250 Kya.

In Chapter 3, I also observe that when computational methods such as MIMAR and

IM are applied to real data, they often provide much larger estimates of the ancestral

effective population relative to that of the descendant populations. Chapter 4 is a

manuscript in preparation for Evolution, entitled: Can we learn about modes of speci-

ation by computational approaches? In this study, I inquired whether the observations

from Chapter 3 may be due to violated assumptions of the model considered by the

methods, specifically the assumptions of panmixia in the ancestral population and

constant gene flow since the split. More generally, I conducted a simulation study to

assess the reliability of MIMAR and IM when a realistic complication is ignored in the

model. I found that when there were population structure in the ancestral population

or gene flow only at the early stage of divergence, IM tends to provide estimates of

the ancestral effective population size that are biased upwards. In turn, when there

is structure in the ancestral population or a phase of isolation followed by a recent

secondary contact, both methods detect some gene flow, potentially lending spurious

support for parapatric speciation. I further introduce a goodness-of-fit test that could

potentially detect when the estimated model is inappropriate, but unfortunately, this

simple test often suggests that the estimated model fit the data even when it is incor-

rect. Taken together, these results suggest that existing computational approaches

available to date may be of limited use in distinguishing parapatry from allopatry.

However, for cases where a parapatric model of speciation is appropriate, I introduce

a simple locus-specific goodness-of-fit test that can help identify loci linked to candi-

date reproductive isolation factors.

9

Despite the limitations of available computational approaches, they can still en-

lighten us about the demographic history of recently diverged populations, including

our own species (e.g., Hey, 2005). As an example, in Chapter 5, I present a study

aimed at Estimating the demographic parameters of human populations. This Chapter

is an unpublished analysis, which is the first step of a collaboration with Jeff Wall at

UCSF, who provided the data. In this study, I applied MIMAR to data from six human

populations and found estimates roughly consistent with what was known about hu-

man demography (e.g., Cavalli-Sforza and Feldman, 2003). In particular, I estimate

the divergence time between African and non-African populations to ∼40−60 Kya.

MIMAR also detected evidence of extensive migration between populations, even highly

geographically diverged, suggesting that gene flow should not be ignored in models

of human demographic history. In light of the results from Chapter 4, it is not clear

whether these results reflect ongoing gene flow, range expansion, or recent migration.

10

The following chapters describe the research projects that compose my PhD. Un-

fortunately, these chapters do not describe the pleasure that I had and the knowledge

that I gained in the completion of these projects. I feel very fortunate to have worked

with remarkable scientists on exciting and relevant questions about the demographic

histories of the great apes, including human (Chapters 2 and 5). The development of

the program MIMAR represents the bulk of my PhD and is my main contribution to

the scientific community (Chapter 3). Despite the limitations of this − and similar

− methods that I explore in Chapter 4, MIMAR remains the only method available

that estimates parameters of isolation-migration models and allows for intra-locus

recombination. MIMAR might remain useful for some time in fields of study for which

there are no fully sequenced and annotated genomes. It has been especially satisfying

and rewarding to see MIMAR used increasingly to answer new population genetics and

speciation questions and to discuss with users ways to improve the method. I hope

the reader will enjoy this dissertation as much as I took pleasure in the completion

of my PhD.

CHAPTER 2

GENETIC STRUCTURE OF CHIMPANZEE

POPULATIONS

11

12

2.1 Abstract

Little is known about the history and population structure of our closest living rela-

tives, the chimpanzees, in part because of an extremely poor fossil record. To address

this, we report the largest genetic study of the chimpanzees to date, examining 310

microsatellites in 84 common chimpanzees and bonobos. We infer three common

chimpanzee populations, which correspond to the previously defined labels of ”West-

ern,” ”Central,” and ”Eastern,” and find little evidence of gene flow between them.

There is tentative evidence for structure within Western chimpanzees, but we do

not detect distinct additional populations. The data also provide historical insights,

demonstrating that the Western chimpanzee population diverged first, and that the

Eastern and Central populations are more closely related in time.

2.2 Author summary

Common chimpanzees have been traditionally classified into three populations: West-

ern, Central, and Eastern. While the morphological or behavioral differences are very

small, genetic studies of mitochondrial DNA and the Y chromosome have supported

the geography-based designations. To obtain a crisp picture of chimpanzee popula-

tion structure, we gather far more data than previously available: 310 microsatellite

markers genotyped in 78 common chimpanzees and six bonobos, allowing a high reso-

lution genetic analysis of chimpanzee population structure analogous to recent studies

that have elucidated human structure. We show that the traditional chimpanzee pop-

ulation designations − Western, Central, and Eastern − accurately label groups of

individuals that can be defined from the genetic data without any prior knowledge

about where the samples were collected. The populations appear to be discontinuous,

13

and we find little evidence for gradients of variation reflecting hybridization among

chimpanzee populations. Regarding chimpanzee history, we demonstrate that Central

and Eastern chimpanzees are more closely related to each other in time than either

is to Western chimpanzees.

2.3 Introduction

Standard taxonomies recognize two species of chimpanzees: bonobos (Pan paniscus)

and common chimpanzees (P. troglodytes), whose current ranges in Africa do not over-

lap. Common chimpanzees have been classified further into three populations or sub-

species based on their separation by geographic barriers (generally rivers): Western

(P. troglodytes verus), Central (P. t. troglodytes), and Eastern (P. t. schweinfurthii)

(Hill, 1969; Groves, 2001). While there are no or only slight morphological or behav-

ioral differences among the common chimpanzees (Albrecht and Miller, 1993; Shea

et al., 1993; Fischer et al., 2006), genetic studies of mitochondrial DNA (mtDNA)

(Morin et al., 1994; D’Andrade and Morin, 1996) and the Y chromosome (Stone

et al., 2002) have supported the geography-based population designations (Morin

et al., 1994; Stone et al., 2002), and mtDNA studies have led to the proposal of a

fourth common chimpanzee subspecies, P. t. vellorosus, around the Sanaga river in

Cameroon (Gonder et al., 1997, 2006). However, studies of single loci provide at best

partial information about history and population subdivision (Hudson and Coyne,

2002); for example, analyses of X and Y chromosome datasets (Kaessmann et al.,

1999) suggest that genetic diversity is highest in Central and lowest in Western chim-

panzees, while mtDNA suggests a different pattern (Stone et al., 2002). Resequenc-

ing and microsatellite-based datasets have also provided inconsistent evidence about

whether Eastern chimpanzees are more diverse than bonobos (Fischer et al., 2006;

14

Reinartz et al., 2000). To obtain a clear picture of chimpanzee population structure,

a large number of independently-evolving regions should be studied simultaneously.

The most comprehensive study of chimpanzees to date − including multiple loci

and samples from Western, Central, and Eastern chimpanzees and bonobos − found

few fixed genetic differences among chimpanzee populations and estimated autosomal

FST values between populations of 0.09−0.32, overlapping the range of differentiation

seen in humans. Fischer et al. (2006) argued from these results that there are no

chimpanzee subspecies and suggested instead that chimpanzee variation might be

characterized by continuous gradients of gene frequencies, with ongoing gene flow

across groups. This and the other multi-locus datasets that have been published to

date (Yu et al., 2003; Reinartz et al., 2000; Fischer et al., 2004) are small compared

with recent genetic assessments of human structure (Rosenberg et al., 2002), however,

and have not yet provided a clear picture. For example, mtDNA and Y chromosome

data have been interpreted as showing discontinuity among chimpanzee populations

(Morin et al., 1994; D’Andrade and Morin, 1996; Stone et al., 2002; Gonder et al.,

1997, 2006), potentially at odds with the model proposed by Fischer et al. (2006).

An accurate picture of chimpanzee population structure is also crucial for under-

standing their history. For example, Won and Hey (2005) estimated that common

chimpanzees and bonobos split ∼0.9 million years ago (Mya), and Western and Cen-

tral chimpanzees split ∼0.42 Mya, with low levels of migration from Western to Cen-

tral since that time. This analysis, which assumed that the populations split from a

common ancestor, would need to be reevaluated if the data were better described by

a model of stable isolation-by-distance (Fischer et al., 2006).

To clarify chimpanzee population structure, we gathered an order-of-magnitude

larger dataset than has previously been available. This allowed us to test whether

15

genetic data alone can be used to assign chimpanzees to the categories of Western,

Central, and Eastern chimpanzees, whether there is evidence for substantial admix-

ture between groups, and whether there is unrecognized substructure among the

chimpanzees (Gagneux, 2002).

2.4 Results/Discussion

We analyzed data from 310 polymorphic microsatellites in 84 individuals: 78 com-

mon chimpanzees and six bonobos. These samples were chosen to include multiple

representatives of each putative population. Of the common chimpanzees, 41 were

reported as Western, 16 as Central, seven as Eastern, three as hybrids, and 11 did not

have a reported subpopulation (Table 2.1). This dataset was designed to include a

similar number of genetic markers (and in fact included many of the same markers) as

the dataset analyzed by Rosenberg et al. (2002) to elucidate human population struc-

ture. Because of high mutation rates, microsatellite alleles often have arisen multiple

times, and hence it is difficult to resolve the genealogy at any locus. A benefit of the

high mutation rate, however, is that microsatellites provide more information about

recent historical events per locus compared with resequencing data (Rosenberg et al.,

2003).

16

IDO

ther

Iden

tifie

r(s)

Sex

Sam

ple

Rep

orte

dA

fter

Rep

orte

dC

lass

ifica

tion

Sour

ceC

ateg

ory

Gen

etic

Bir

thpl

ace

Bas

edon

Ana

lysi

sm

tDN

A/Y

Chr

omos

ome

Gen

otyp

e1

Am

elie

fL

eip

zig

Cen

tral

Cen

tral

Hau

t-O

gooue

2C

hiq

uit

af

Lei

pzi

gC

entr

al

Cen

tral

Hau

t-O

goou

e3

Ber

the

fL

eip

zig

Cen

tral

Cen

tral

Cap

tive

born

4B

akou

mb

am

Lei

pzi

gC

entr

al

Cen

tral

Hau

t-O

goou

eY

,C

entr

al

5N

oem

ief

Lei

pzi

gC

entr

al

Cen

tral

Est

uair

e6

Cla

raf

Lei

pzi

gC

entr

al

Cen

tral

Gab

on

7M

inkeb

em

Lei

pzi

gC

entr

al

Cen

tral

Cap

tive

born

8M

asu

ku

fL

eip

zig

Cen

tral

Cen

tral

Cap

tive

born

9G

emin

if

Lei

pzi

gC

entr

al

Cen

tral

Est

uair

e10

Hen

rim

Lei

pzi

gC

entr

al

Cen

tral

Nyan

ga

Y,

Cen

tral

11

Ivin

do

mL

eip

zig

Cen

tral

Cen

tral

Ogoou

e-Iv

ind

oY

,C

entr

al

12

Moan

da

mL

eip

zig

Cen

tral

Cen

tral

Hau

t-O

goou

eY

,C

entr

al

13

Lala

laf

Lei

pzi

gC

entr

al

Cen

tral

Est

uair

e14

Makata

mL

eip

zig

Cen

tral

Cen

tral

Hau

t-O

gooue/

Ogooue-

Ivin

do

Y,

Cen

tral

15

Makokou

fL

eip

zig

Cen

tral

Cen

tral

Cap

tive

born

16

Pt

197,

stu

dnu

mb

er277,

IPB

IR496

mA

rizo

na

Cen

tral

Cen

tral

Wild

cau

ght,

ori

gin

un

kn

ow

nm

tDN

A,

Cen

tral

17

Akila

fL

eip

zig

East

ern

Most

lyor

all

Bu

run

di

East

ern

18

Alley

fL

eip

zig

East

ern

East

ern

Sou

thea

stC

on

go

19

Am

izer

of

Lei

pzi

gE

ast

ern

East

ern

Bu

run

di

20

An

nie

fL

eip

zig

East

ern

East

ern

Nort

hea

stC

on

go

21

Ju

dy

fL

eip

zig

East

ern

East

ern

Sou

thea

stC

on

go

22

Mim

if

Lei

pzi

gE

ast

ern

East

ern

Nort

hea

stC

on

go

23

Pt

169,

ISIS

nu

mb

er3850

fA

rizo

na

East

ern

Wes

tern

/C

ap

tive

born

mtD

NA

,E

ast

ern

East

ern

24

An

nacl

ara

fL

eip

zig

Wes

tern

Wes

tern

Cap

tive

born

25

Fri

tsm

Lei

pzi

gW

este

rnW

este

rnS

ierr

aL

eon

e26

Hilko

mL

eip

zig

Wes

tern

Wes

tern

Cap

tive

born

27

Lis

bet

hf

Lei

pzi

gW

este

rnW

este

rnS

ierr

aL

eon

e28

Lou

ise

fL

eip

zig

Wes

tern

Wes

tern

Cap

tive

born

29

Marc

om

Lei

pzi

gW

este

rnW

este

rnS

ierr

aL

eon

e30

Osc

ar

mL

eip

zig

Wes

tern

Wes

tern

Cap

tive

born

Tab

le2.

1:D

eta

ils

of

the

84sa

mple

sin

this

study.

F,

fem

ale;

M,

mal

e;IP

BIR

,In

tegr

ated

Pri

mat

eB

iom

ater

ials

and

Info

rmat

ion

Res

ourc

e;IS

IS,

Inte

rnat

ional

Sp

ecie

sId

enti

fica

tion

Syst

em;

Pt,

P.

trog

lody

tes.

17

IDO

ther

Iden

tifie

r(s)

Sex

Sam

ple

Rep

orte

dA

fter

Rep

orte

dC

lass

ifica

tion

Sour

ceC

ateg

ory

Gen

etic

Bir

thpl

ace

Bas

edon

Ana

lysi

sm

tDN

A/Y

Chr

omos

ome

Gen

otyp

e31

Reg

ina

fL

eip

zig

Wes

tern

Wes

tern

Sie

rra

Leo

ne

32

Socr

ate

sm

Lei

pzi

gW

este

rnW

este

rnC

ap

tive

born

33

Son

jaf

Lei

pzi

gW

este

rnW

este

rnS

ierr

aL

eon

e34

Yora

nm

Lei

pzi

gW

este

rnW

este

rnC

ap

tive

born

35

Yvon

ne

fL

eip

zig

Wes

tern

Wes

tern

Sie

rra

Leo

ne

36

Pt

81,

stu

db

ook

nu

mb

er380

fA

rizo

na

Wes

tern

Wes

tern

Sie

rra

Leo

ne

mtD

NA

,W

este

rn;

Y,

Wes

tern

37

Pt

82,

stu

db

ook

nu

mb

er341

mA

rizo

na

Wes

tern

Wes

tern

Sie

rra

Leo

ne

mtD

NA

,W

este

rn;

Y,

Wes

tern

38

Pt

83,

stu

db

ook

nu

mb

er459

fA

rizo

na

Wes

tern

Wes

tern

Wild

cau

ght,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

39

Pt

87,

ISIS

nu

mb

er1149

mA

rizo

na

Wes

tern

Wes

tern

Wild

cau

ght,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn40

Pt

88,

ISIS

nu

mb

er1144

mA

rizo

na

Wes

tern

Wes

tern

Wild

cau

ght,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn41

Pt

90,

ISIS

nu

mb

er1339

mA

rizo

na

Wes

tern

Wes

tern

Wild

cau

ght,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn42

Pt

97,

ISIS

nu

mb

er2036

mA

rizo

na

Wes

tern

Wes

tern

Wild

cau

ght,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn43

Pt

98,

ISIS

nu

mb

er2724

mA

rizo

na

Wes

tern

Wes

tern

Wild

cau

ght,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn44

Pt

99,

stu

db

ook

nu

mb

er561

mA

rizo

na

Wes

tern

Wes

tern

Wild

cau

ght,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn45

Pt

100,

ISIS

nu

mb

er3000

mA

rizo

na

Wes

tern

Wes

tern

Wil

dca

ught,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn46

Pt

101,

ISIS

nu

mb

er3214

mA

rizo

na

Wes

tern

Wes

tern

Wil

d-c

au

ght,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn47

Pt

102,

ISIS

nu

mb

er1068

mA

rizo

na

Wes

tern

Wes

tern

Wil

dca

ught,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn48

Pt

103,

ISIS

nu

mb

er3340

mA

rizo

na

Wes

tern

Wes

tern

Wil

dca

ught,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn49

Pt

104,

ISIS

nu

mb

er3339

mA

rizo

na

Wes

tern

Wes

tern

Wil

dca

ught,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn50

Pt

105,

ISIS

nu

mb

er2435

mA

rizo

na

Wes

tern

Wes

tern

Wil

dca

ught,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn51

Pt

106,

stu

dnu

mb

er430,

ISIS

2377

mA

rizo

na

Wes

tern

Wes

tern

Wild

cau

ght,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn52

Pt

107,

stu

dnu

mb

er142,

ISIS

2474

fA

rizo

na

Wes

tern

Wes

tern

Wild

cau

ght,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

53

Pt

112,

stu

dnu

mb

er314

mA

rizo

na

Wes

tern

Wes

tern

Wild

cau

ght,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn54

Pt

114,

ISIS

nu

mb

er2412

mA

rizo

na

Wes

tern

Wes

tern

/W

ild

cau

ght,

ori

gin

un

kn

ow

nm

tDN

A,

Nig

eria

n;

Y,

Wes

tern

Cen

tral

55

Pt

115,

ISIS

nu

mb

er2738

mA

rizo

na

Wes

tern

Wes

tern

Wil

dca

ught,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn56

Pt

117,

ISIS

nu

mb

er1641

mA

rizo

na

Wes

tern

Wes

tern

Wil

dca

ught,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn57

Pt

120,

ISIS

nu

mb

er2216

mA

rizo

na

Wes

tern

Wes

tern

Wil

dca

ught,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn58

Pt

121,

ISIS

nu

mb

er2549

mA

rizo

na

Wes

tern

Wes

tern

Wil

dca

ught,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn59

Pt

122,

ISIS

nu

mb

er2417

mA

rizo

na

Wes

tern

Wes

tern

Wil

dca

ught,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn60

Pt

124,

ISIS

nu

mb

er2404

mA

rizo

na

Wes

tern

Wes

tern

Wil

dca

ught,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn

Tab

le2.

1−

conti

nued.

18

IDO

ther

Iden

tifie

r(s)

Sex

Sam

ple

Rep

orte

dA

fter

Rep

orte

dC

lass

ifica

tion

Sour

ceC

ateg

ory

Gen

etic

Bir

thpl

ace

Bas

edon

Ana

lysi

sm

tDN

A/Y

Chr

omos

ome

Gen

otyp

e61

Pt

125,

ISIS

nu

mb

er2554

mA

rizo

na

Wes

tern

Wes

tern

Wild

cau

ght,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn62

Pt

126,

ISIS

nu

mb

er1818

mA

rizo

na

Wes

tern

Wes

tern

Wild

cau

ght,

ori

gin

un

kn

ow

nm

tDN

A,

Wes

tern

;Y

,W

este

rn63

Cori

ell

NA

03448

mC

ori

ell/

IPB

IRW

este

rnW

este

rnC

ap

tive

born

mtD

NA

,W

este

rn;

Y,

Wes

tern

64

Cori

ell

NA

03450

mC

ori

ell/

IPB

IRW

este

rnW

este

rnC

ap

tive

born

mtD

NA

,W

este

rn;

Y,

Wes

tern

65

Mari

lyn

e(C

ori

ell

NS

03612)

fC

ori

ell/

IPB

IRU

nre

port

edW

este

rn/

Cap

tive

born

mtD

NA

,W

este

rnC

entr

al

66

Kip

per

(Cori

ell

NS

03629)

mC

ori

ell/

IPB

IRU

nre

port

edW

este

rnC

ap

tive

born

mtD

NA

,W

este

rn;

Y,

Wes

tern

67

Gay

(Cori

ell

NS

03639)

fC

ori

ell/

IPB

IRU

nre

port

edW

este

rnC

ap

tive

born

mtD

NA

,N

iger

ian

68

Ju

an

(Cori

ell

NS

03641)

mC

ori

ell/

IPB

IRU

nre

port

edW

este

rn/

Cap

tive

born

Y,

Wes

tern

Cen

tral

69

Liz

zie

(Cori

ell

NS

03646)

fC

ori

ell/

IPB

IRU

nre

port

edM

ost

lyor

all

Cap

tive

born

mtD

NA

,W

este

rnW

este

rn70

Sh

een

a(C

ori

ell

NS

03650)

fC

ori

ell/

IPB

IRU

nre

port

edW

este

rnC

ap

tive

born

mtD

NA

,W

este

rn71

Jim

oh

(Cori

ell

NS

03657)

mC

ori

ell/

IPB

IRU

nre

port

edW

este

rnC

ap

tive

born

mtD

NA

,W

este

rn;

Y,

Wes

tern

72

Alici

a(C

ori

ell

NS

03659)

fC

ori

ell/

IPB

IRU

nre

port

edW

este

rnC

ap

tive

born

73

Garb

o(C

ori

ell

NS

03660)

fC

ori

ell/

IPB

IRU

nre

port

edW

este

rnC

ap

tive

born

74

Tan

k(C

ori

ell

NS

03623)

mC

ori

ell/

IPB

IRU

nre

port

edW

este

rnC

ap

tive

born

mtD

NA

,W

este

rn;

Y,

Wes

tern

75

Kase

y(C

ori

ell

NS

03656)

fC

ori

ell/

IPB

IRU

nre

port

edW

este

rnC

ap

tive

born

mtD

NA

,W

este

rn76

Pt

13

mA

rizo

na

Hyb

rid

Most

lyor

all

Cap

tive

born

mtD

NA

,C

entr

al;

Y,

East

ern

Cen

tral

77

Pt

113,

stu

dnu

mb

er721

mA

rizo

na

Hyb

rid

Wes

tern

/C

ap

tive

born

mtD

NA

,C

entr

al;

Y,

Wes

tern

Cen

tral

78

Pt

123,

stu

dnu

mb

er662

mA

rizo

na

Hyb

rid

Most

lyC

ap

tive

born

mtD

NA

,N

iger

ian

;Y,

Cen

tral

Cen

tral

79

Ulin

di

fL

eip

zig

Bon

ob

oB

on

ob

oC

ap

tive

born

80

Yasa

fL

eip

zig

Bon

ob

oB

on

ob

oC

ap

tive

born

81

IPB

IRnu

mb

er092

fC

ori

ell/

IPB

IRB

on

ob

oB

on

ob

oC

ap

tive

born

82

IPB

IRnu

mb

er251

mC

ori

ell/

IPB

IRB

on

ob

oB

on

ob

oC

ap

tive

born

83

IPB

IRnu

mb

er367

fC

ori

ell/

IPB

IRB

on

ob

oB

on

ob

oC

ap

tive

born

84

IPB

IRnu

mb

er661

mC

ori

ell/

IPB

IRB

on

ob

oB

on

ob

oC

ap

tive

born

Tab

le2.

1−

conti

nued.

19

2.4.1 Cluster analysis

To explore the genetic evidence for subdivision among chimpanzees, we first applied

the program STRUCTURE to the dataset (Materials and Methods; Pritchard et al.,

2000). Each STRUCTURE analysis requires a hypothesized number of populations

and assigns individuals to these populations − without using any pre-assigned popula-

tion labels − in a way that minimizes the amount of Hardy-Weinberg disequilibrium

and linkage disequilibrium across widely separated markers. The analysis strongly

supports the division of the samples of common chimpanzees and bonobos into at

least four discontinuous subpopulations. Although the software does not provide a

formal statistical procedure for choosing the number of clusters, Pritchard et al. (2000)

suggest using the model with the highest likelihood. When we ran the software as-

suming models of one to six clusters (averaging results for three random number seeds

for each model), the likelihood of the data for four clusters was higher than for any

other model. The inferred clusters correspond remarkably well to the reported labels

of Western, Central, Eastern, and bonobo, and also agree well with the assignments

based on mtDNA or Y chromosome haplotypes (Figure 2.1; Table 2.1).

The multi-locus dataset also provides power to identify individuals with multiple

ancestries and to assess their ancestry proportions. This cannot be done reliably

using studies of single loci such as the Y chromosome or mtDNA, because individuals

can in fact be descendants of multiple ancestral populations without carrying DNA

from some of the populations at the locus being studied. The STRUCTURE analysis

identified nine individuals as having more than 5% genetic ancestry from two clusters

(Table 2.2).

Of the individuals identified by STRUCTURE as likely hybrids, seven were born

in captivity, and just two were wild-caught, consistent with what would be expected

20

if there were low rates of migration between Central and Western chimpanzees in

the wild (Table 2.2; see also Won and Hey, 2005). Interestingly, individual num-

ber 54, one of two wild-caught individuals identified as a hybrid by this analysis,

has an mtDNA haplotype hypothesized to correspond to P. t. vellorosus (Gonder

et al., 1997). The two captive-born chimpanzees with the putative P. t. vellorosus

haplotype, however, have markedly different estimates of ancestry proportions, and

thus there is no evidence from the STRUCTURE analysis that these individuals form

a distinct population: the population ancestry estimates are 45% Central and 55%

Western for number 54; 84% Central and 16% Western for number 78; and 100%

Western for number 67.

We also used STRUCTURE to validate a minimal set of markers that could be

useful for classifying chimpanzees in conservation studies (Table S1, Microsatellites

used for this study, found at doi:10.1371/journal.pgen.0030066.st001). The top

30 markers (ranked by informativeness for assigning individuals to populations; see

Rosenberg et al., 2003) provide excellent power for classification (Table 2.3). Of 75

chimpanzees estimated as having 100% ancestry in one group by all markers, we found

that 71 were classified identically by the top 30 markers (by the criterion that at least

90% of the ancestry is assigned to the same group). Of nine individuals identified as

hybrids with all the markers, six were also detected as hybrids with the reduced set.

In addition to quantitative precision, the microsatellite panel also has a qualitative

advantage over single marker studies in classifying chimpanzee hybrids: mtDNA and

Y chromosome analyses cannot detect first generation female hybrids (Table 2.1) or

reliably classify hybrids of the second or higher generation.

21

Ce

ntr

al

Ea

ste

rnW

este

rnB

on

ob

o

Fig

ure

2.1:

ST

RU

CT

UR

Eanaly

sis,

bli

nded

top

opula

tion

lab

els

,re

capit

ula

tes

the

rep

ort

ed

pop

ula

tion

stru

cture

of

the

chim

panze

es.

Indiv

idual

s76−

78ar

ere

por

ted

hybri

ds.

Only

two

indiv

idual

sw

ith

a>

5%pro

por

tion

ofan

cest

ryin

mor

eth

anon

ein

ferr

edcl

ust

erar

ew

ild

bor

n:

num

ber

54an

dnum

ber

17.

Red

,C

entr

al;

blu

e,E

aste

rn;

gree

n,

Wes

tern

;ye

llow

,b

onob

o.

IDSex

Rep

orte

dST

RU

CT

UR

EA

nal

ysi

sP

CA

(Est

imat

eO

ther

Gen

etic

Sta

tus

WC

Eof

Per

centa

geIn

form

atio

nfr

omm

tDN

AIs

Qual

itat

ive)

and

YC

hro

mos

ome

17F

Eas

tern

991

All

Eas

tern

23F

Eas

tern

491

50∼

50%

Eas

tern

,∼

50%

Wes

tern

mtD

NA

,E

aste

rn54

MW

este

rn55

45∼

50%

Wes

tern

,∼

50%

Cen

tral

mtD

NA

,ve

llor

osu

s65

FU

nknow

n39

61∼

50%

Wes

tern

,∼

50%

Cen

tral

mtD

NA

,W

este

rn68

MU

nknow

n74

26∼

65%

Wes

tern

,∼

35%

Cen

tral

Y,

Wes

tern

69F

Unknow

n89

11A

llW

este

rnm

tDN

A,

Wes

tern

76M

Hybri

d81

19A

llC

entr

alm

tDN

A,

Cen

tral

;Y

,E

aste

rn77

MH

ybri

d50

50∼

50%

Wes

tern

,∼

50%

Cen

tral

mtD

NA

,W

este

rn;

Y,

Cen

tral

78M

Hybri

d15

823

∼15

%W

este

rn,∼

85%

Cen

tral

mtD

NA

,ve

llor

osu

s;Y

Cen

tral

Tab

le2.

2:In

div

iduals

wit

h>

5%ance

stry

from

more

than

one

clust

er.

All

nin

ein

div

idual

sin

this

table

are

indic

ated

by

ST

RU

CT

UR

Eto

hav

e>

5%an

cest

ryfr

omat

leas

ttw

op

opula

tion

s.O

fth

ese,

two

are

wild

bor

n:

num

ber

17an

dnum

ber

54.

PC

Aco

nfirm

sth

em

ixed

ance

stry

ofsi

xin

div

idual

s(n

um

ber

23,

num

ber

54,

num

ber

65,

num

ber

68,

num

ber

77,

and

num

ber

78)

(com

par

eF

igure

s2.

1an

d2.

2).

F,

fem

ale;

M,

mal

e.

22

Nam

elo

cati

onon

Locu

sor

alia

sR

epP

hysi

cal

Map

Info

rmat

iven

ess

]al

lele

sA

llel

esi

zera

nge

chro

mos

ome

GA

TA

32C

10Y

DY

S391

413

1120

291.

0751

55

289

317

GG

AA

4B09

N3

D3S

2403

413

1474

031.

0156

317

204

269

GA

TA

104

74

1431

1540

80.

9976

0827

173

227

GA

TA

129H

041

D1S

3721

441

1397

840.

9650

4129

188

255

GA

TA

164B

08P

3D

3S45

454

8500

000

0.90

1269

2721

325

8G

AT

A11

A06

18D

18S5

424

1148

2759

0.89

8777

3016

721

0G

AT

A61

E03

6D

6S10

514

3667

9852

0.89

4083

1220

725

1A

TT

T03

06

495

4028

60.

8813

5412

104

136

GA

TA

176C

012

D2S

2972

410

2193

472

0.86

4309

2919

827

0G

AT

A71

H05

16D

16S7

694

2612

5312

0.86

3574

2524

230

0A

TA

27A

06P

12D

12S1

042

328

0000

000.

8625

7510

116

146

GA

TA

43A

041

D1S

1653

415

5149

568

0.85

6855

2911

622

9G

AT

A14

E09

8D

8S23

244

7430

4288

0.85

4694

1318

421

8G

AT

A11

6B01

N2

D2S

2952

480

9977

70.

8521

2920

142

207

GA

TA

50G

0615

D15

S643

457

4298

800.

8425

0719

187

259

AT

A25

F04

YD

YS3

923

2147

8942

0.84

0013

724

426

5U

T75

4419

D19

S559

450

0220

640.

8390

5517

148

177

GA

TA

43C

117

D7S

1804

413

1695

864

0.83

6512

2021

429

8G

GA

A21

G11

L14

D14

S617

490

1928

320.

8350

3215

131

187

GG

AA

3A07

M1

D1S

1612

478

2753

10.

8106

2326

123

185

GA

TA

29A

016

D6S

1959

420

0200

720.

8053

59

146

178

GA

TA

28F

034

D4S

3248

460

0248

760.

7955

2313

239

267

TC

TA

017M

94

1165

0000

00.

7951

3118

158

200

AG

AT

120

22SN

P34

3411

419

6651

740.

7910

6122

251

293

GA

TA

25A

0417

D17

S129

94

3936

7576

0.79

0268

1917

221

8G

AT

A69

C12

XD

XS6

810

441

9496

280.

7724

9217

195

234

GA

TA

91H

06M

12D

12S1

301

442

3489

120.

7654

5117

8712

9G

AT

A7F

053

D3S

3039

473

7632

320.

7651

649

280

312

GA

TA

129D

11N

21D

21S2

052

427

7404

360.

7592

556

109

129

GA

TA

8C04

17D

17S9

744

1071

9277

0.75

6707

1018

521

7

Tab

le2.

3:In

form

ati

on

on

the

top

30

mic

rosa

tell

ites

use

dfo

rth

isst

udy

ranked

by

info

rmati

ven

ess

.

23

2.4.2 Principal components analysis

We next carried out principal components analysis (PCA). This approach has been

shown to have similar power to capture population structure as STRUCTURE, but

also provides a formal way of assigning statistical significance to population subdi-

vision (Patterson et al., 2006a). When the PCA is applied to the chimpanzee data,

the results support four discontinuous populations into which almost all chimpanzees

and bonobos can be classified. The first three principal components (eigenvectors) are

all highly statistically significant (p < 10−12) and nearly perfectly separate Western,

Central, and Eastern chimpanzees, and bonobos (Figure 2.2). Only six chimpanzees

fall visually outside of the clusters, a subset of the nine identified by STRUCTURE as

having at least 5% genetic contribution from more than one population (Table 2.2).

The fourth eigenvector (p = 0.011) is also significant, and the fifth is not significant

(p = 0.44).

The eigenvectors are strongly correlated to the population labels. We used non-

parametric analysis (Kruskal-Wallis tests) to explore whether the values of each sam-

ple along the four significant eigenvectors were significantly correlated to the four

pre-existing population labels. The overall statistic is highly significant (p < 10−10)

for the first three eigenvectors but insignificant for the fourth (p = 0.97), indicating

that this eigenvector is capturing population subdivision that is different from the

traditional Western/Central/Eastern/bonobo designations.

To explore whether the fourth eigenvector might reflect an as-yet-undefined chim-

panzee population, we carried out analyses separately on the Western chimpanzee

(n = 49), Central chimpanzee (n = 16), Eastern chimpanzee (n = 6), and bonobo

(n = 6) samples (including all individuals that were clearly classified by both PCA

and STRUCTURE). Western chimpanzees are the only population with evidence for

24

Figure 2.2: PCA, without using population labels, divides the 84 chim-panzees into four apparently discontinuous populations of Western, Cen-tral, Eastern, and bonobo.Plots of eigenvectors 1 versus 2, and eigenvectors 2 versus 3, show clustering intopopulations, with the expected assignments for the 75 individuals identified as allof one ancestry by STRUCTURE (solid circles). The nine individuals identified bySTRUCTURE as hybrids (open circles) are for the most part identified as hybridsby PCA as well. There are two individuals (red open circles) reported as being of aparticular population but that in fact appear to be hybrids: number 23, reported asEastern but in fact a Western-Eastern hybrid, and number 54, a wild-born individualreported as Western but in fact a Western-Central hybrid.

internal substructure (p = 5.5× 10−5). The first eigenvector obtained when Western

chimpanzees are analyzed by themselves strongly correlates to the fourth eigenvector

in the main analysis (Figure 2.3, r2 = 0.92; p < 10−12), indicating that the fourth

eigenvector describes subdivision within Western chimpanzees.

Although the fourth eigenvector seems to be detecting real structure, it does

not mark out discontinuous subpopulations of Western chimpanzees (Figure 2.3).

The failure to reveal the details of the structure is evident not only in the PCA,

but also in an application of STRUCTURE to the Western chimpanzees only, in

which a model of only one cluster is most likely. There is also no pattern to the

classification of Western chimpanzees even when we consider a model of two clusters

25

-4

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3

Fourth eigenvector from analysis of all 84 samples

First eigenvector from western-only analysis

Figure 2.3: The significant fourth eigenvector from the analysis of all 84chimpanzees is correlated to the first eigenvector from analysis of Westernchimpanzees only.

Here, we present the correlation (r2 = 1) for the 49 individuals that are clearlyidentified as Western chimpanzees by both STRUCTURE and PCA, demonstratingthat these eigenvectors are revealing the same population structure.

(unpublished data). The most likely explanation is that there is not enough data

to assign individuals to different ancestries. Understanding of the fourth eigenvector

in the PCA will require more genetic data and better information about geographic

origin. In particular, we note that the only wild-caught Western samples for which

we had geographic information are from one location (Sierra Leone), thus we could

not perform a test for correlation with geography.

26

2.4.3 Testing for additional populations

Could there be additional population structure among the chimpanzees that we have

not yet detected (Gagneux, 2002)? A particular concern is that our sample size

is limited, decreasing our power to detect further structure especially among non-

Western chimpanzees.

To place an upper bound on further structure especially among the Central chim-

panzees, we considered the possibility that, among the 16 Central chimpanzees, a

subset is from a different population. We performed PCA on 10 Central/6 Eastern,

11 Central/5 Eastern, 12 Central/4 Eastern, 13 Central/3 Eastern and 14 Central/2

Eastern chimpanzees and assessed what fraction of 1, 000 random resamplings of Cen-

tral and Eastern chimpanzees showed evidence of structure (p < 0.05). This allowed

us to assess power to detect an additional population as diverged as the Eastern

chimpanzees.

The resampling analysis found that 6, 5, 4, 3, and 2 Eastern chimpanzees could be

detected from amidst the Central chimpanzees with 100%, 100%, 99%, 54%, and 7%

probability, respectively. Since the FST between Central and Eastern chimpanzees

is 0.05−0.09 (Table 2.2 and Fischer et al., 2006), this allowed us to place an upper

bound on the undetected structure that might exist among Central chimpanzees given

that we did not detect further structure. If the three samples with the P. t. vellorosus

mtDNA haplotype in our study constitute members of a distinct population, their

differentiation from Central chimpanzees is likely to be FST ≤ 0.09, lower than those

observed between some pairs of human populations (Cavalli-Sforza et al., 1994). An

important caveat is that we have no power to detect population structure for chim-

panzees missed by our sampling (we also have little power if there are fewer than

three individuals from a population). Thus, a more geographically systematic survey,

27

including more animals from a denser grid in Africa, may detect further structure.

2.4.4 Evidence for inbreeding

To test for inbreeding among the chimpanzees, we examined whether heterozygosity

within individuals was significantly lower than would be expected from random mat-

ing in the population (Materials and Methods). Western and Central chimpanzees

both show evidence for a reduced number of heterozygous genotypes (p < 0.05; Ap-

pendix A). We had too few Eastern and bonobo samples to perform an informative

test). A caveat is that misscoring of heterozygous genotypes, or the presence of

polymorphisms under the primers used for genotyping, could both result in an arti-

factual excess of homozygotes. To follow up this initial evidence of inbreeding among

chimpanzees, further analyses could search for multimegabase contiguous stretches of

homozygosity (Carothers et al., 2006).

2.4.5 First and second generation hybrids

To test for first and second generation hybrids, we calculated the likelihood of the

data under the hypothesis that an individual is an F1 hybrid, compared with the

alternative hypothesis of an older 50%−50% mixture of the ancestral populations.

To test whether the individual is an F2/backcross − a mixture of an F1 with an

unadmixed individual − we compared the likelihood of this model compared with the

alternative hypothesis of an older 75%−25% mixture of the two ancestral populations

(Materials and Methods).

Of the nine putative hybrids identified by STRUCTURE, the F1 hybrid test

identifies captive-born individual number 23 (an approximately 50%−50% Eastern-

Western hybrid by STRUCTURE analysis) as an F1, with a likelihood ratio (LR) of

28

∼24, 000, 000:1. The F2/backcross test identifies the captive-born individual number

68 (a 74%−26% Western-Central hybrid by STRUCTURE analysis) as an F2/backcross,

with an LR of ∼37:1 (the evidence is weaker because the signal of an F2/backcross is

more subtle). There are no other hybrids identified by either the F1 or F2/backcross

test, suggesting that the other animals in Table 2.2 could descend from third gener-

ation or older admixture events, or be members of as-yet unidentified populations.

The F1 test produced a particularly intriguing pattern in number 54, a wild-

caught individual with mtDNA that has been hypothesized to be diagnostic of P.

t. vellorosus origin (Gonder et al., 1997, 2006). Individual number 54 is estimated

to be a 55%−45% Western-Central mixture (Table 2.2) and shows an LR of 7:1 in

favor of being an old mixture, compared with the alternative of a first generation

F1 hybrid. However, a careful examination shows that the pattern of variation at

number 54 fits neither the hypothesis of a first generation hybrid or an older mixture.

To demonstrate this, we simulated 100 different Western-Central F1 hybrids and

100 older Western-Central mixtures by random sampling from the population allele

frequencies. Simulated older mixtures always generated an LR of >100, 000:1 relative

to the alternative hypothesis of an F1. Simulated F1 hybrids always gave an LR

<1:2. The LR for individual number 54 of 7:1 falls outside of either expectation.

This individual fits neither model, suggesting ancestry from an as-yet undetermined

population.

2.4.6 Allele frequency differentiation

To estimate the degree of allele frequency differentiation between chimpanzee groups,

we computed the RST statistic, a microsatellite-based estimator of FST (Slatkin,

1995; Schneider et al., 2000). RST assumes the stepwise mutation model (SMM),

29

Location Eastern Central BonoboWestern 0.31 (0.32) 0.25 (0.29) 0.68 (0.68)Eastern − 0.05 (0.09) 0.57 (0.54)Central − − 0.51 (0.49)

Table 2.4: Genetic differentiation among populations.Pairwise RST (versus FST from Fischer et al., 2006) is reported here. RST (amicrosatellite-based estimator of FST , Slatkin, 1995) is calculated using the Arlequinsoftware (Schneider et al., 2000) comparing 49 Western, six Eastern, and 16 Centralchimpanzees, and six bonobos. Analysis is restricted to autosomal loci with <5%missing data (leaving 220 markers in all cases). For the 49 Western chimpanzees,we used the 48 individuals identified as Western by STRUCTURE plus individualnumber 54, which was born in the Western range. All RST values are significantlydifferent from zero (p < 0.002), as determined by 10, 000 permutations. The values inparentheses are quoted from the SNP-based study of Fischer et al. (2006). Our studyhas less sampling error but relies on imperfect assumptions about the microsatellitemutation process, and so is more subject to systematic error. The close agreementbetween the two studies is encouraging.

in which the number of repeats changes by one or two or more units with an equal

probability of increasing or decreasing. A goodness-of-fit test suggests that this simple

model provides a reasonable match to our data (Figure 2.4). Encouragingly, the

RST estimates of population differentiation, obtained based on assuming this model,

are very similar to estimates of FST from a smaller multi-locus dataset based on

resequencing (Table 2.4 and Fischer et al., 2006).

A particularly intriguing feature of the allele frequency differentiation results is

that the allele frequency differentiation between bonobos and Western chimpanzees is

higher than that between bonobos and Central or Eastern chimpanzees (Table 2.4).

This likely reflects greater genetic drift in the Western lineage since divergence, as

has also been suggested by an analysis of resequencing data (Won and Hey, 2005).

30

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4

Simulated vs. observed E(σi2) for autosomal tetra−nucleotides

E(σi2)

ObservedSimulated

(a)

0 20 40 60 80 100 120 140

0.0

0.1

0.2

0.3

0.4

Simulated vs. observed E(σi2) for autosomal tri−nucleotides

E(σi2)

ObservedSimulated

(b)

20 40 60 80 100

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Simulated vs. observed E(σi2) for autosomal di−nucleotides

E(σi2)

ObservedSimulated

(c)

Figure 2.4: Goodness-of-fit of the SMM.

Distributions of the expected σ2i over markers, for data sets observed and simulated

under the SMM. The distributions of E(σ2i ) were computed separately for (a) 221

tetra- , (b) 62 tri- and (c) 11 dinucleotides genotyped in the 49 Western samples(shown in blue). In red are the results of 500 simulations for each class of microsatel-lites. In all three cases, the observed distribution is not significantly different from theexpected distribution, as assessed by a permutation test (see Materials and Methodsfor more details).

31

2.4.7 Central and Eastern chimpanzees are most closely related in

time

The high frequency differentiation of Western chimpanzees compared with other

groups (Table 2.4) is consistent with them having been the first population to di-

verge, but does not prove it. An alternative explanation for the data is that there has

been a smaller effective population size on the Western lineage since their divergence,

resulting in high genetic drift in this population. We therefore applied a formal test

to assess which pair of populations is most closely related.

We approached the problem by testing whether three unrooted phylogenetic trees

are consistent with the data for chimpanzees: (1) Western-Central and bonobo-

Eastern forming clades, (2) Western-Eastern and bonobo-Central forming clades, and

(3) Eastern-Central and bonobo-Western forming clades. If a tree provides a good de-

scription of the history of the population, then the allele frequency differences between

two populations should only reflect changes since they split. For example, the differ-

ence in allele frequency between Central and Eastern chimpanzees should have arisen

entirely since their divergence from a common ancestral population and so should

be uncorrelated to the allele frequency differences between Western chimpanzees and

bonobos.

To implement this idea, we calculated the difference in frequency within clades

for all alleles and then tested for a correlation across clades. When we carry out this

analysis for the first and second hypothesized trees, a correlation is observed, rejecting

these trees at a significance of p = 0.00025 and p = 0.0027, respectively (Table 2.5).

The hypothesized Central-Eastern/bonobo-Western clade is the only one consistent

with the data (p = 0.37). Thus, our analysis does not find any evidence for gene flow

between Western and Central chimpanzees since their initial split, as has previously

32

been hypothesized (Won and Hey, 2005). If gene flow did occur, it would have had

to be sufficiently low to fall below the threshold of detection with our present data

size and the test we applied. We conclude that Eastern and Central are more closely

related in time than either population is to Western chimpanzees.

2.4.8 Population separation times

To estimate the times of genetic divergence among chimpanzees, we used the average

squared distance (ASD) statistic (Goldstein and Pollock, 1997). For microsatellites

evolving under the SMM, the expected time since the most recent common ancestor

(tMRCA) of two chromosomes is expected to be ASD/2µ, where µ is the average

mutation rate per year per locus, averaged across loci. Because allele lengths change

according to a random walk, the ASD between allele lengths in two chromosomes is

expected to increase linearly with time and is thus expected to act like heterozygosity

in sequence comparisons. By averaging ASD over pairs of chromosomes within and

across populations, we can estimate the average tMRCA within and across popula-

tions.

The results confirm that genetic diversity (heterozygosity) is least for Western

chimpanzees and bonobos, higher for Eastern chimpanzees, and highest for Central

chimpanzees, consistent with results obtained from a nucleotide resequencing dataset

(Fischer et al., 2006). To estimate the absolute tMRCA within and across populations,

we used two estimates of the microsatellite mutation rate. The first, µ = 6.57 ×

10−3 per year, relies on a 7 Mya estimated average tMRCA between humans and

chimpanzees and the observation that two Western chimpanzees are ∼14.8 times less

genetically diverged than humans and chimpanzees (Patterson et al., 2006b). We

also obtained a second estimate, 4.71 × 10−5 per year, based on estimated rates of

33

microsatellite mutation in humans, and assuming 15 years per generation (Table 2.6).

The absolute values of the time estimates for these tMRCAs should be treated

with caution because of uncertainty about the microsatellite mutation rate process

and the calibrations used to obtain absolute dates. Nevertheless, the tMRCA esti-

mates are consistent with previous results from smaller resequencing based datasets

(Stone et al., 2002). We note that the Central-Eastern, Central-Western, and Central-

Central tMRCAs are all similar, which appears at first to contradict the claim that

the populations split at different times. However, most of the genetic divergence re-

flects ancestral diversity, which is shared among all the chimpanzees (explaining why

the tMRCA estimates are substantially older than estimates of population splitting

times; Reinartz et al., 2000; Won and Hey, 2005). More refined analyses are needed,

such as the allele frequency correlation analysis presented in Table 2.5, or model-

based approaches, to detect the subtle patterns that indicate the splitting order of

the chimpanzee populations.

34

Cla

de

1C

lade

2C

orre

lati

onb

etw

een

Allel

eF

requen

cyp-

Val

ue

Diff

eren

ces

inE

ach

Cla

de

(Tw

o-Sid

ed)

Cen

tral

-Wes

tern

Eas

tern

-bon

obo

0.09

0.00

025

Eas

tern

-Wes

tern

Cen

tral

-bon

obo

0.06

50.

0027

Cen

tral

-Eas

tern

Wes

tern

-bon

obo

-0.0

130.

37

Tab

le2.

5:E

ast

ern

and

Centr

al

chim

panze

es

are

phylo

geneti

call

ym

ost

close

lyre

late

d.

Ther

ear

eth

ree

pos

sible

unro

oted

tree

sre

lati

ng

toth

efo

ur

pop

ula

tion

s.If

the

clad

esin

tow

hic

hth

etr

ees

are

par

titi

oned

corr

ectl

yca

ptu

reth

ep

opula

tion

rela

tion

ship

s,th

eal

lele

freq

uen

cies

shou

ldb

eunco

rrel

ated

when

com

par

ing

clad

e1

and

clad

e2.

We

obse

rve

sign

ifica

nt

corr

elat

ion

acro

sscl

ades

for

all

phylo

genet

ictr

ees

other

than

the

one

inw

hic

hC

entr

alan

dE

aste

rnch

impan

zees

clust

er.

To

corr

ect

for

the

non

-indep

enden

ceof

mic

rosa

tellit

eal

lele

s,w

eca

lcula

ted

sign

ifica

nce

by

aw

eigh

ted

jack

knif

ean

alysi

sre

mov

ing

each

mar

ker

intu

rnto

gener

ate

nor

mal

lydis

trib

ute

dZ

-sco

res;

thes

ew

ere

then

conve

rted

top-

valu

es.

35

Tim

eSin

ceth

eM

ost

Rec

ent

t MR

CA

inM

ya,

Ass

um

ing

7M

yat M

RC

Ain

Mya

(Usi

ng

Mic

rosa

tellit

eC

omm

onG

enet

icA

nce

stor

(tM

RC

A)

for

Hum

an-C

him

pG

enet

icM

uta

tion

Rat

eE

stim

ates

from

Div

erge

nce

(Cal

ibra

tion

Tim

e)a

Hum

ans)

a

Wit

hin

-Wes

tern

0.47

0.71

(0.6

2−0.

81)

Wit

hin

-Cen

tral

0.85

(0.7

5−0.

98)

1.29

(1.1

5−1.

45)

Wit

hin

-Eas

tern

0.73

(0.6

1−0.

86)

1.09

(0.9

3−1.

28)

Wit

hin

-bon

obo

0.63

(0.5

3−0.

76)

0.95

(0.8

1−1.

10)

Cen

tral

-Eas

tern

0.89

(0.7

7−1.

03)

1.35

(1.2

0−1.

53)

Cen

tral

/Eas

tern

-Wes

tern

0.84

(0.7

3−0.

97)

1.30

(1.1

6−1.

43)

Cen

tral

/Eas

tern

/Wes

tern

-bon

obo

1.56

(1.2

9−1.

90)

2.36

(1.9

7−2.

79)

Tab

le2.

6:E

stim

ate

sof

div

erg

ence

tim

efr

om

ASD

.t M

RC

Are

pre

sents

the

aver

age

tim

eto

the

mos

tre

cent

com

mon

ance

stor

ofa

pai

rof

chro

mos

omes

sam

ple

din

the

sam

eor

diff

eren

tp

opula

tion

s.It

can

be

subst

anti

ally

older

than

the

split

tim

e,as

ital

sore

flec

tsdiff

eren

ces

accu

mula

ted

inth

ean

cest

ral

pop

ula

tion

(90%

confiden

cein

terv

als

from

10,0

00b

oot

stra

ps)

.aA

nab

solu

tet M

RC

Afo

rW

este

rnch

impan

zees

isob

tain

edby

assu

min

gth

atth

ehum

an-c

him

pan

zeet M

RC

Ais∼

7M

ya.

We

calibra

teth

et M

RC

Afo

rW

este

rnch

impan

zees

at0.

47M

ya,

since

hum

an-c

him

pan

zee

sequen

cediv

erge

nce

ises

tim

ated

tob

e14

.8ti

mes

hig

her

than

Wes

tern

-Wes

tern

div

erge

nce

(Pat

ters

onet

al.,

2006

b).

An

alte

rnat

ive

esti

mat

eof

the

abso

lute

dat

esco

mes

from

sett

ing

the

muta

tion

rate

for

mic

rosa

tellit

esfo

rhum

ans

tob

e7.

06×

10−

4p

erge

ner

atio

nan

das

sum

ing

15ye

ars

per

gener

atio

n.

This

isob

tain

edfr

omdir

ect

esti

mat

esof

muta

tion

rate

sin

hum

ans

for

tetr

a-,

tri-

,an

ddin

ucl

eoti

des

(Zhiv

otov

sky

etal

.,20

03),

adju

stin

gfo

rth

ere

lati

vepro

por

tion

sof

each

typ

eof

mar

ker

inou

rdat

aset

:22

2te

tran

ucl

eoti

de

(incl

udin

gm

arke

rD

12S29

7,w

hic

hw

asob

serv

edto

hav

ean

unusu

ally

hig

hm

uta

tion

rate

),62

trin

ucl

eoti

de,

and

11din

ucl

eoti

de.

36

2.5 Conclusions

We have carried out the largest analysis of chimpanzee genetic variation to date, which

shows that the Western, Central, and Eastern chimpanzee subspecies designations

correspond to clusters of individuals with similar allele frequencies that can be defined

from the genetic data without regard to the population labels. Moreover, we find little

evidence for admixture between groups in the wild. Our analysis also provides the

first formal test showing that the Central and Eastern chimpanzee populations are

more closely related to each other in time than either is to Western chimpanzees.

PCA also further suggests population structure within Western chimpanzees.

However, more data − more samples, genetic markers, and information about ge-

ographic origin − would be needed to understand this structure. We find no support

for the proposed fourth population of common chimpanzees P. t. vellerosus. However,

the failure to detect a distinct population cluster for these individuals could simply

reflect a lack of power. Our analysis does allow us to state that even if P. t. vellorosus

is a distinct population, its level of allele differentiation from either Western, Central,

or Eastern chimpanzees is not likely to exceed FST = 0.09.

We finally emphasize that although we attempted to include chimpanzees from as

wide a range of sites in Africa as possible, the geographic sampling of the chimpanzees

that we studied was likely nonrandom. The fact that our study did not include

chimpanzees from some regions − including where chimpanzees are now extinct −

could create the appearance of too much discontinuity (Kittles and Weiss, 2003).

Future studies including chimpanzees across a denser grid of populations within Africa

could, in principle, identify intermediate populations of chimpanzees and demonstrate

more graded patterns of variation.

37

2.6 Materials and Methods

Data collection. The samples for this study were assembled from four sources:

DNA collections at the Max Planck Institute (Leipzig, Germany), Anne Stone’s lab-

oratory at Arizona State University (Stone et al., 2002), the Coriell Cell Reposi-

tories (Coriell Institute for Medical Research, Camden, New Jersey, United States.

Available: http://locus.umdnj.edu/primates/species_summ.html. Accessed 20

March 2007), and the Integrated Primate Biomaterial and Information Resource

(Coriell Institute for Medical Research, Camden, New Jersey, United States. Avail-

able: http://www.ipbir.org/ipbir_cgi/tax.cgii?mode=8&id=init&lvl=0. Ac-

cessed 20 March 2007). A total of 91 samples were genotyped, and 84 were included

in analysis after filtering (below). We had information about the approximate geo-

graphic origin of 25 wild-caught chimpanzees (Table 2.1). For most remaining sam-

ples, we had a population designation, and sometimes Y chromosome and mtDNA

genotypes (A. Stone, unpublished data). As far as possible, we attempted to use

pedigree information to remove related individuals from the captive chimpanzees.

We assembled all the DNA samples at a single site (the Broad Institute) and car-

ried out whole-genome DNA amplification (WGA) for all samples to generate a quan-

tity sufficient for analysis (Hosono et al., 2003). The WGA samples were shipped to a

company that specializes in genotyping microsatellite markers for human disease gene

mapping studies (PreventionGenetics, http://www.preventiongenetics.com). The

microsatellite markers all contain tandem repeats of two, three, or four nucleotides

that vary in repeat number across individuals. For example, at a single marker, dif-

ferent individuals might have between three and 11 contiguous repeats of a GATA

sequence. The assays used for genotyping were designed for humans. However, we

hypothesized that many of them would work for chimpanzees because of the ∼98.8%

38

sequence similarity (The Chimpanzee Sequencing and Analysis Consortium, 2005).

Assays were attempted for 470 microsatellites. Most came from the Marshfield

Screening Set 13 (designed for linkage screens in humans; Ghebranious et al., 2003,

and Mammalian Genotyping Service [Internet], 2007. Marshfield (W. I.): National

Heart, Lung, and Blood. Available: http://research.marshfieldclinic.org/

genetics. Accessed 20 March 2007) and were supplemented with markers from a

separate mapping study. Genotyping quality was assessed by specialists at Preven-

tionGenetics using standard measures of genotyping performance. A score of one to

four was given to each marker (with one being the best and four the worst). Markers

were scored as > 2 because of uncertainty in allele assignment, or an excessive num-

ber of missing genotypes, or an excess in the numbers of homozygotes or non-integer

alleles (defined as alleles with non-integer length differences compared with frequent

alleles). We used the 310 markers that were designated as of highest or second-highest

quality because the two sets produced indistinguishable inferences on our data (un-

published). For all analyses other than the use of STRUCTURE, we considered only

autosomal or pseudoautosomal markers, since these could be treated uniformly. This

resulted in 295 markers; we also excluded two additional pseudoautosomal markers

for the PCA and F1/F2 hybrid analyses. Genotypes for all markers are available in

Dataset S1 (Raw genotype data in a format appropriate for STRUCTURE analysis,

found at doi:10.1371/journal.pgen.0030066.sd001).

A subset of 84 of the 91 genotyped samples were chosen for further study after

removing two due to a high missing data rate, one due to evidence for contamination

(more than two genotypes at many loci), and four due to evidence of genetic related-

ness: two accidental duplicates, and two apparent first degree relatives. For each pair

of related individuals, we dropped the one with the lower success rate. The duplicate

39

individuals allowed us to assess genotyping quality. For the two individuals studied

in duplicate, 1.18% of genotyping calls differed, suggesting an error rate per genotype

of ∼0.59% (i.e., we estimate that on average approximately two loci were mistyped

per individual).

Data analysis. We applied two complementary methods to characterize popula-

tion structure in chimpanzees. First, we used the software STRUCTURE in the

”admixture” mode, so that individuals were allowed to have ancestry from multiple

populations. We used a model of correlated allele frequencies, a ”burn-in” of 100, 000

Markov Chain Monte Carlo (MCMC) iterations, and 1, 000, 000 follow-on MCMC

iterations.

We also analyzed the data using a new implementation of PCA (Patterson et al.,

2006a) available online in the EIGENSOFT software package (Boston: Reich Labora-

tory, Harvard Medical School, Department of Genetics. Available: http://genepath.

med.harvard.edu/;reich/Software.htm. Accessed 20 March 2007). Briefly, the

analysis begins with a rectangular matrix, with the 84 rows corresponding to the

individuals, and the columns corresponding to the alleles (the cells give the number

of copies of each allele for each individual: zero, one, or two). To analyze the data,

we perform a singular value decomposition, a procedure that produces eigenvectors

and eigenvalues. The first eigenvector separates the samples in a way that explains

the largest amount of variability, while the second and subsequent ones explain lesser

amounts of variability. Thus, the first eigenvector distinguishes individuals from the

population that is most differentiated, and each subsequent eigenvector separates the

next most differentiated population. Eigenvalues above a significance cut-off repre-

sent significant population structure for the associated eigenvector (Patterson et al.,

2006a).

40

To examine evidence for inbreeding in these samples, we used the output of

STRUCTURE to assign individuals to populations, excluding individuals with >5%

ancestry from more than one population and focusing our study on the population

samples with more than ten individuals (treating captive and wild-caught individuals

separately). Thus, we analyzed 48 Western chimpanzees, including 34 wild-caught

and 14 captive-born individuals. We also analyzed 13 wild-caught Central chim-

panzees (Table 2.1). We considered two statistics: Hw, the average heterozygosity

within an individuals two chromosomes, and Rw, calculated as∑

m(m1−m2)2M , where

m1 and m2 are the alleles number of repeat units at marker m within an individual,

and M is the total number of markers considered. We used the average value of Hw

and Rw over individuals in a population to test the hypothesis of random mating and

assessed significance by a permutation test. Specifically, we generated 1, 000 samples

of n individuals, randomly assigning to each of them two alleles from the pool, then

counted how often Hw or Rw was as small or smaller than observed. Hw was signif-

icantly lower than expected for all samples considered (at the p < 0.05 level), while

Rw was significantly lower in wild-caught Western chimpanzees (Appendix A).

For other analyses in which we were interested in studying only individuals iden-

tified unambiguously as being from one population, we excluded the captive-born

individuals defined as putative hybrids by STRUCTURE.

Application of the stepwise mutation model. We examined whether our data

fit the SMM in the 49 individuals that are identified as Western chimpanzees by both

STRUCTURE and PCA. We focused on Western chimpanzees because they have the

largest sample size and hence provide us with the most power to detect a departure

from the SMM. Under the one-step SMM, σ2i , the variance of an allele with i repeats,

is an estimator of 2Nµ, the population mutation parameter (Valdes et al., 1993). For

41

each marker, we calculated E(σ2i ), the expectation over all alleles of a given marker,

thus obtaining an estimate of 2Nµ for the 221 tetra-, 62 tri-, and 11 dinucleotides

included in other analyses. Averaging this estimate over all markers of the same type

and dividing by N = 10, 000 (Won and Hey, 2005), we obtain an estimate of µ for

each type of marker, µ = 3.77 × 10−3, 1.91 × 10−3, and 2.17 × 10−3, respectively.

These estimates are roughly similar to independent estimates (Zhivotovsky et al.,

2003) based on microsatellites in humans: 6.40× 10−4, 7.10× 10−4, and 1.51× 10−3,

respectively.

To assess the goodness-of-fit of the SMM, we compared the observed distribution

of σ2i to the expected distribution, obtained by using the coalescent simulator SIM-

COAL2 (Excoffier et al., 2000). We generated 500 independent replicates for each

type of marker under a standard neutral equilibrium model, with an effective pop-

ulation size of 10, 000 (Won and Hey, 2005), a sample size of 98 chromosomes, and

a mutation rate per generation set to µ. The range constraints for the number of

repeat units were set to be equal to the maximum repeat observed in the sample for

each type of marker. We tested whether there was a significant difference in the dis-

tributions by an asymptotic Wilcoxon rank sum test, carrying out the test separately

for each type of marker. The observed distributions do not differ significantly from

the expectation under the SMM (Figure 2.4).

Tests for F1 and F2/backcross hybrids. We calculated a log Bayes factor to

test the hypothesis that a chimpanzee is an F1 hybrid of two known populations

versus the alternative that it is a 50%−50% mixture (i.e., it is an older hybrid).

For a given autosomal marker, one can compute a log-factor under the assumption

that the allele frequencies are known; these log-factors can then be summed across

all markers. In practice, our population samples are small, and so allele frequency

42

estimates are imprecise. To account for uncertainty in the allele frequencies, we used

a hierarchical Bayesian model for the unknown frequencies, with a Dirichlet prior

distribution for the frequencies (the details of this calculation are similar to those

described by Lockwood and Roeder, 2001). We verified the performance of the test

by simulation (see text).

2.7 Acknowledgments

We thank the Albuquerque Biological Park, Detroit Zoo, Lincoln Park Zoo, Riverside

Zoo, Sunset Zoo, the Primate Foundation of Arizona, New Iberia Research Cen-

ter, and the Southwest Foundation for Biomedical Research for sharing chimpanzee

samples. We thank Gavin McDonald for sample processing and Jim Weber at Pre-

ventionGenetics for personally ensuring that the genotypes for the chimpanzees were

carefully scored. We thank Jean Wickings, Svante Paabo, Philip Morin, Anne Fis-

cher, and four anonymous reviewers, for sharing their samples and/or comments on

earlier versions of the manuscript. We are also grateful to Jennifer Caswell, Graham

Coop, Sridhar Kudaravalli, Daniel Lieberman, Simon Myers, John Novembre, David

Pilbeam, Alkes Price, and Maryellen Ruvolo for useful discussions.

Author contributions. MP and DR conceived and designed the experiments.

CB, MP, and DR performed the experiments and wrote the paper. CB, NP, MP, and

DR analyzed the data. CB, NP, ACS, MP, and DR contributed reagents/materials/analysis

tools. MP and DR co-supervised the work.

Funding. This work was supported by a National Institutes of Health K-01 ca-

reer transition award to NP, an Alfred P. Sloan Fellowship to MP, and a Burroughs

43

Wellcome Career Development Award to DR. Chimpanzee-sampling and subspecies

identification analyses were supported by a grant from the National Science Founda-

tion (BCS-0073871) to AS.

2.8 Appendix A: Testing for inbreeding

Wild-caught Captive-born

Western 0.606 (< 10−3) 0.611 (< 10−3)

Central 0.697 (< 10−3) −

Table 2.7: Mean heterozygosity within individuals from each population.The samples are 34 wild-caught Western chimpanzees, 14 captive-born Western chim-panzees and 13 wild-caught Central chimpanzees. The p-value for being lower thatexpected, calculated from 1, 000 permutations, is indicated in parentheses (see Meth-ods).

Wild-caught Captive-bornWestern 61.9 (0.001) 65.5 (0.208)Central 117.6 (0.186) −

Table 2.8: Mean squared difference in repeat units within individuals ineach population.See details in Supplemental Table 2.7.

44

0.60 0.65 0.70

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Wild−caught western chimpanzees

Hw

HwMean observed HwMean randomized Hw

(a)

0.60 0.65 0.70

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Captive−born western chimpanzees

Hw


(b)

0.60 0.65 0.70

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Wild−caught central chimpanzees

Hw


(c)

Figure 2.5: Distribution of mean heterozygosity within individuals (Hw,orange).We carry out a separate analysis for (a) 34 wild-caught, (b) 14 captive-born Westernchimpanzees, and (c) 13 wild-caught Central chimpanzees. We assigned individualsto subpopulations according to the results of STRUCTURE, excluding individualswith > 5% ancestry in more than one population. The red vertical lines are the meanHw in each population and the blue lines are the mean Hw in permutations. Themean Hw is significantly lower (at the p < 0.05 significance level) than expected bythe randomized Hw in all populations (see Supplemental Table 2.7).

45

60 80 100 120

0.0

0.1

0.2

0.3

0.4

Wild−caught western chimpanzees

Rw

RwMean observed RwMean randomized Rw

(a)

60 80 100 120

0.0

0.1

0.2

0.3

0.4

Captive−born western chimpanzees

Rw


(b)

60 80 100 120

0.0

0.1

0.2

0.3

0.4

Wild−caught central chimpanzees

Rw


(c)

Figure 2.6: Distribution of mean squared difference in repeat units withinindividuals (Rw, orange).See details in legend of Figure 2.5. The red vertical lines are the mean Rw in eachpopulation and the blues lines are the mean Rw from permutation. Mean Rw issignificantly lower (p < 0.05) than expected by the randomized Rw in wild-caughtWestern and Central chimpanzees (see Supplemental Table 2.8).

CHAPTER 3

A NEW APPROACH TO ESTIMATE PARAMETERS OF

SPECIATION MODELS, WITH APPLICATION TO APES

46

47

3.1 Abstract

How populations diverge and give rise to distinct species remains a fundamental ques-

tion in evolutionary biology, with important implications for a wide range of fields,

from conservation genetics to human evolution. A promising approach is to estimate

parameters of simple speciation models using polymorphism data from multiple loci.

Existing methods, however, make a number of assumptions that severely limit their

applicability, notably, no gene flow after the populations split and no intra-locus

recombination. To overcome these limitations, we developed a new Markov Chain

Monte Carlo method to estimate parameters of an isolation-migration model. The

approach uses summaries of polymorphism data at multiple loci surveyed in a pair of

diverging populations or closely related species and, importantly, allows for intra-locus

recombination. To illustrate its potential, we applied it to extensive polymorphism

data from populations and species of apes, whose demographic histories are largely

unknown. The isolation-migration model appears to provide a reasonable fit to the

data. It suggests that the two chimpanzee species became reproductively isolated

in allopatry ∼850 Kya, while Western and Central chimpanzee populations split ap-

proximately 440 Kya, but continued to exchange migrants. Similarly, Eastern and

Western gorillas and Sumatran and Bornean orangutans appear to have experienced

gene flow since their splits ∼90 and over 250 Kya, respectively.

3.2 Introduction

Although central to evolutionary biology, the question of how species form remains

largely open. In fact, the very definition of species is a subject of active debate (Hey,

2006a). The most common definition is the “biological” one, in which species are

48

defined as groups of interbreeding organisms that are reproductively isolated from

other populations. The introduction of this concept over 60 years ago transformed

the study of speciation into a research program to examine the conditions under which

reproductive isolation emerges and to uncover its genetic architecture (Mayr, 1963).

Accumulating evidence suggests that incipient species arise primarily in popula-

tions with restricted gene flow, as alleles (or combination of interacting alleles) that

contribute to reproductive isolation reach fixation (e.g., Wittbrodt et al., 1989; Sawa-

mura et al., 1993; Ting et al., 1998; Wang et al., 1999; Fossella et al., 2000; Barbash

et al., 2003; Presgraves et al., 2003; Coyne and Orr, 2004a). The speciation process

initiates after two populations become completely isolated from one another (i.e., are

in allopatry) or as they continue to exchange migrants (i.e., in parapatry).

Under a model of allopatric speciation, the process occurs through the homoge-

neous divergence of the genome. Shortly after the split, the two populations share

alleles due to the persistence of ancestral polymorphism (more so if the ancestral pop-

ulation size is large). Eventually, however, the shared alleles are lost or reach fixation

and the two populations start to accumulate fixed differences, either by genetic drift

or due to differential adaptation (Coyne and Orr, 2004d). Under a simple allopatric

model with no selection, it will take approximately 9−12N generations (where N

is the effective size of the descendant population) for the genealogies of >95% of

loci to be reciprocally monophyletic, and hence for the two populations not to share

alleles that are identical by descent (Hudson and Coyne, 2002). Given these assump-

tions, humans and common chimpanzees should almost never share alleles (as they

are thought to have diverged ∼20−25N generations ago; Wall, 2003; Hobolth et al.,

2006; Patterson et al., 2006b), while bonobos and common chimpanzees are expected

to share alleles at about 50% of loci (since they are estimated to have diverged ∼4N

49

generations ago, Won and Hey, 2005).

If the incipient species are in parapatry, however, divergence is not believed to oc-

cur homogeneously across the genome, but instead in a number of stages (Wu, 2001b).

First, alleles that cause a decrease in hybrid fertility or viability reach fixation in the

parental populations. The populations may continue to exchange migrants but in the

genomic regions carrying functionally divergent or incompatible alleles, gene flow is

selected against and hence effectively restricted. By contrast, in unlinked (or loosely

linked) genomic regions, alleles can be brought in by migrants with no associated

fitness costs. Thus, at neutral loci, populations share alleles longer than expected

under allopatric speciation. Eventually, reproductive isolation factors accumulate in

sufficient numbers as to prevent gene flow throughout the genome − the final stage of

speciation. This model predicts variation in the number of shared alleles and levels of

divergence along the genomes of closely related species. While shared alleles are also

expected under a model of recent allopatric speciation, greater variation is expected

along the genome under parapatry, such that, with enough data, the two scenarios

should be distinguishable.

In these simple speciation models, the salient parameters are the split times, ef-

fective population sizes, and, in the case of parapatry, the gene flow rates. Thus,

learning about these parameters should greatly deepen our understanding of spe-

ciation. This realization has motivated the development of statistical methods to

estimate the parameters from multi-locus patterns of polymorphism in closely related

species.

Existing methods all assume that genetic variation data are available from both

populations, at a number of independently-evolving loci, but differ in their assump-

tions about gene flow and recombination, and in whether they use all the polymor-

50

phism data or summaries. Loosely, they can be classified into two groups. The first

set assumes an extreme model of allopatry, in which a panmictic (i.e. randomly

mating) ancestral population instantaneously splits into two panmictic descendant

populations, with no subsequent gene flow. In this model, there are four parameters:

the three effective population sizes and the split time. The parameters are estimated

using summaries of the polymorphism data, either by a moment estimator (Wakeley

and Hey, 1997; Kliman et al., 2000), or by maximum likelihood (Leman et al., 2005;

Putnam et al., 2007). While, in its current version, the method of Leman et al. (2005)

can only be applied to one, non-recombining locus, other methods can be applied to

multiple loci, and incorporate recombination (Wakeley and Hey, 1997; Putnam et al.,

2007). They use highly summarized versions of the data, however, at the potential

cost of much information. Moreover, in the presence of gene flow after the split,

their estimates will be biased − the ancestral effective population size will tend to be

over-estimated (Wall, 2003) and the split time under-estimated (Leman et al., 2005).

The second set of methods considers a more general model, often called the

“isolation-migration” model, in which there is gene flow between incipient species

throughout the genome, either at fixed (Hey and Nielsen, 2004) or at locus-specific

rates (Won and Hey, 2005). The parameters are estimated from all the polymor-

phism data at a single locus (Nielsen and Wakeley, 2001) or at multiple loci (Hey and

Nielsen, 2004), using Markov Chain Monte Carlo (MCMC). The Hey and Nielsen

method, henceforth called IM, has been applied to a number of species, from He-

liconius (Bull et al., 2006) to cichlids (Hey et al., 2004; Won et al., 2005). These

applications suggest that speciation often occurs in the presence of some gene flow

(Hey, 2006a).

While IM considers a wide range of models, it assumes that haplotypes are known

51

and that there is no intra-locus recombination. Although not ideal, the first assump-

tion is not restrictive, as a two-step procedure can be used, in which haplotype phase

is inferred (e.g., using the program PHASE, Stephens et al., 2001), then IM is run on

the phased data. In contrast, the assumption of no recombination is more limiting,

because the method can only be applied to autosomal loci by excluding segments or

haplotypes with evidence for recombination. This practice is likely to bias estimates

of the parameters, as excluding segments with visible recombination will tend to lead

to shorter genealogical histories (Hey and Nielsen, 2004). Moreover, if intra-locus

recombination is not taken into account, a small variance in divergence times across

segments may be confounded with a small ancestral effective population size (Taka-

hata and Satta, 2002). The assumption of no intra-locus recombination represents an

especially severe limitation in species in which the ratio of recombination to mutation

is thought to be high (e.g., Drosophila melanogaster, Andolfatto and Wall, 2003; or

Papilio glaucus, Putnam et al., 2007). In such species, any segment with polymor-

phisms in a sample is likely to have experienced numerous recombination events in

its genealogical history, making recombination hard to ignore (Hudson and Kaplan,

1985; Nordborg and Tavare, 2002).

To overcome this limitation, we developed a new Bayesian approach to estimate

parameters of an isolation-migration model from recombining loci. We have in mind

data sets similar to the ones most commonly collected to date: short non-coding

sequences distributed throughout the genome. Our approach is to summarize the

polymorphism data at each locus by four statistics known to be sensitive to the

parameters of interest (Wakeley and Hey, 1997; Leman et al., 2005). We then estimate

the posterior probability of the parameters given these summaries using MCMC.

Simulations suggest that, in the absence of recombination, our method performs as

52

well or almost as well as the full likelihood approach. Moreover, the approach presents

the advantage of being quite flexible in the demographic model that it can consider,

and in allowing for intra-locus recombination.

We illustrate the potential of our method by applying it to multi-locus polymor-

phism data from non-coding loci in chimpanzees, gorillas and orangutans. Very little

is known about the evolutionary history of great apes, in part because of a poor

fossil record. Chimpanzees, the closest living relatives of humans, are classified into

two species, common chimpanzees (Pan troglodytes) and bonobos (P. paniscus), both

found exclusively in Africa. The two chimpanzee species were thought to have di-

verged as a result of the formation of the River Congo 1.5−3.5 million years ago

(Mya) (Beadle, 1981; Myers Thompson, 2003), but recent estimates of their split

time based on genetic data appear to be too recent for this scenario to be plausi-

ble (Fischer et al., 2004; Won and Hey, 2005). Common chimpanzees are usually

subdivided further into three (or sometimes four) “subspecies”, including Eastern

(P. troglodytes schweinfurthii), Central (P. troglodytes troglodytes), and Western (P.

troglodytes verus) (Hill, 1969). The meaning of the term “subspecies” is unclear, at

least to us, but the labels are thought to correspond to the most pronounced popula-

tion structure within the species. This view is supported by a recent analysis of 310

microsatellites, which found three populations within common chimpanzees, which

correspond to the three subspecies, and little evidence of recent gene flow between

them (Becquet et al., 2007).

Gorillas, in turn, are classically subdivided into two subspecies: Western (Gorilla

gorilla) and Eastern gorilla (G. beringei), found in western and central African forest,

respectively (Groves, 1970). Some controversy surrounds this classification; the range

of the two populations does not currently overlap in the wild, but on the basis of

53

morphological and genetic diversity, it has been proposed that the subspecies should

be elevated to the rank of species (e.g., Grubb et al., 2003). Here, we refer to Western

and Eastern gorillas as subspecies or populations. A recent application of IM to

polymorphism data from the two gorilla populations suggests that they split between

0.08 and 1.6 Mya, and experienced low levels of gene flow since (Thalmann et al.,

2006).

Even less is known about the history of orangutans (Pongo pygmaeus), currently

found only in Indonesia and Malaysia, but whose range is thought to have spanned

much of South East Asia until recently (Smith and Pilbeam, 1980). Some taxonomies

consider Sumatran (P. p. abelii) and Bornean (P. p. pygmaeus) orangutans to be

subspecies (e.g., Groves, 1971), and others to be species (e.g., Zhi et al., 1996). Again,

these populations do not overlap in their range, so that the classification is based on

morphology and behavior, as well as on patterns of genetic diversity. The islands

of Sumatra and Borneo were fully formed 500 thousand years ago (Kya), but were

reconnected by land bridges during the two last glaciations, ∼130−200 Kya and

∼10−100 Kya, respectively (Muir et al., 2000; Hughes et al., 2006). Estimates of

the average time to the most recent common ancestor for both populations based on

mitochondrial DNA (mtDNA) and a small number of microsatellites and autosomal

and X-linked loci are 1.5−2.5 Mya (Zhi et al., 1996; Kaessmann et al., 2001; Zhang

et al., 2001), but to our knowledge, there are no published estimates of the population

split time.

54

Here, we analyze a compilation of multi-locus polymorphism data recently pub-

lished in the three great ape species (Yu et al., 2003; Fischer et al., 2006; Thalmann

et al., 2006), refining population parameter estimates for chimpanzees and gorillas

and providing the first estimates for orangutans.

3.3 Results

We developed a method that estimates the demographic parameters of an

“isolation-migration” model from recombining loci (Figure 3.1). There are five param-

eters of interest: the population mutation rates for the two descendant populations,

θ1 and θ2, and the ancestral population, θA, the time since the populations split

in generations, T , and the migration rate, m. To estimate these parameters, the

method requires resequencing data from two populations (or closely related species)

at independently-evolving loci, and an outgroup sequence. Briefly, the polymorphism

data for each locus are summarized by the four statistics studied by Wakeley and Hey

(1997), as these carry information about the divergence time and other parameters

of interest (Wakeley and Hey, 1997; Leman et al., 2005). We choose the parame-

ters of the model from prior distributions and for each locus, we generate a set of

genealogies under a model with those parameters. We then estimate the likelihood

by calculating the probability of the data summaries at all the loci given the set of

genealogies and the parameters. Finally, we obtain a sample from the posterior dis-

tribution of the parameters given the data summaries using MCMC (see Methods).

Thus, our method follows a number of Bayesian approaches that use summaries of

the data, but differs in that we update the parameters using MCMC (see Methods

for more details). Hereafter, we refer to our method as MIMAR: MCMC estimation of

the isolation-migration model allowing for recombination.

55

Figure 3.1: The “isolation-migration” model.Two populations diverged T generations ago from a common ancestral population.The parameters θ1, θ2 and θA are the population mutation rates per base pair for pop-ulations 1, 2 and the ancestral population, respectively. The split time in generationsis T and m is the symmetrical migration rate between populations per generation(see Methods for details).

3.3.1 Performance of MIMAR under the allopatric model

In order to assess the performance of our method, we generated 30 simulated data

sets, each consisting of 20 non-recombining loci, with parameter values applicable

to Drosophila species in which related studies have been conducted (Llopart et al.,

2005, see Methods). Supplemental Figure 3.8 (Appendix B) shows the 30 posterior

distribution samples for the four parameters of interest. As can be seen, the posterior

distributions estimated by MIMAR for θ1, θ2 and T are centered around their true

values with relatively little variance, suggesting that the summaries that we use carry

much information to estimate these parameters precisely and accurately. However, for

these parameters, 20 non-recombining loci do not seem to contain as much information

about the ancestral effective population size, leading to a wider posterior distribution

estimate for θA. This does not appear to be a feature of our statistics, since the use

56

of IM yields similar results, even though it is based on the full polymorphism data

set (results not shown). As expected, our estimates of θA become more precise with

larger data sets (results not shown).

3.3.2 Comparison to IM for the case of no recombination

Mean absolute error Mean of the estimatedivided by the true value

Parameters MIMAR IM MIMAR IMθ1 0.0003 0.0002 1.000 0.983θ2 0.0004 0.0003 1.001∗ 0.968∗

θA 0.0027 0.0037 0.927 0.875

T 5.19× 105 5.94× 105 1.004 0.980

Table 3.1: Performance of MIMAR and IM.Precision and accuracy for the four parameters of the allopatric model (using themode as a point estimate). MIMAR and IM were applied to 30 multi-locus data setssimulated under the allopatric model (see Methods for details).∗ The biases in θ2 estimates from IM and MIMAR are significantly different at the 5%level, after Bonferroni correction (p = 0.006 using a Wilcoxon signed rank test).

Next, we studied the performance of MIMAR by generating 30 simulated data sets

under the allopatric model for 20 non-recombining loci, but this time drawing the

parameters from prior distributions (see Methods for details); the parameter values

are, as above, applicable to Drosophila species. The results confirm that the estimates

of θ1, θ2 and T are precise and have very little bias, while the estimates of θA are less

precise (Table 3.1 and Figure 3.2).

We analyzed the same simulated data sets with IM to compare the estimates from

MIMAR, which are based on summaries, to a full likelihood approach (since IM does

not allow for recombination, we set the intra-locus recombination rate to 0 when

generating the 30 data sets). We found that the two methods perform similarly well,

in terms of both accuracy and precision (see Table 3.1). For example, the mean

57

absolute error over the 30 simulated data sets for the estimate of T is 5.19 × 105

using MIMAR and 5.94 × 105 using IM. Similarly, if we consider the estimate divided

by the true value as a measure of bias, the mean over the 30 data sets is 1.004 for

MIMAR and 0.980 for IM. Similar results were obtained for all parameters, with the

possible exception of the current effective population sizes, for which MIMAR appears

to yield slightly more reliable estimates (see Table 3.1). Moreover, we found that

the two methods have similar coverage: For both, the central 95th percentiles of the

marginal posterior distribution sample for T included the true value in 29 out of 30

cases; for θA, this occurred in 29 out of 30 cases for IM and 30 out of 30 cases for

MIMAR. We also compared the results of MIMAR and IM on larger simulated data sets

of 100 loci, and found that, in this case, IM outperformed MIMAR. However, with such

large data sets, both methods provided highly accurate and precise estimates (results

not shown).

In the comparison, we ran both methods long enough for them to appear to have

reached convergence (Supplemental Figure 3.9). For the same number of iterations

of the MCMC, IM was 2−3 times faster than MIMAR (results not shown).

58

0.000 0.005 0.010 0.015

0.0

00

0.0

05

0.0

10

0.0

15

θA^ − θA

IM

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

2.0

θA^ θA

0 500000 1000000 1500000 2000000 2500000

0500000

1500000

2500000 T − T

MIMAR

IM

0.9 1.0 1.1 1.2

0.9

1.0

1.1

1.2

T T

MIMAR

Figure 3.2: Performance of MIMAR (x-axis) and IM (y-axis).Precision and accuracy of the estimates of θA (upper panel) and T (lower panel)obtained from the mode of the marginal posterior distribution samples. Each datapoint is the result from one of 30 data sets simulated under the allopatric model,with parameter values sampled from prior distributions (see Methods for details). Ifboth methods provided estimates with perfect precision, all the data points in the leftpanels would be located at (0, 0). If one method provided systematically more preciseestimates, the data points would align along its axis: x-axis for MIMAR or y-axis forIM. Similarly, if both methods provided estimates with perfect accuracy, all the datapoints in the right panels would be at the intersection of the two grey lines (1, 1). Ifone method provided more precise estimates, the data points would align along thevertical grey line for MIMAR or the horizontal grey line for IM. As can be seen, bothmethods perform similarly well (see Table 3.1; p-value from Wilcoxon signed ranktest >5%, after Bonferroni correction for multiple tests).

59

3.3.3 Assessing the evidence for gene flow

0 5 10 150.0

000

0.0

008

θA^ − θA

T − T

04.1

04

0 5 10 15

0 5 10 15

02

46

M − M

# data set

0 5 10 15

0.7

1.0

1.3

θA^ θA

T T

0.2

1.0

0 5 10 15

M M

# data set

1.0

5.0

0 5 10 15

Figure 3.3: Performance of MIMAR in the presence of gene flow.Precision and accuracy of the estimates of θA (upper panel), T (middle panel) and M(lower panel) obtained from the mode of the marginal posterior distribution samples.Each data point is the result from one of 20 data sets (randomly numbered alongthe x-axis) simulated under the parapatric model with intra-locus recombination andparameter values sampled from prior distributions. The black lines are the meansover the 20 simulated data sets (see Table 3.2 and Methods for details).

To assess our ability to distinguish a model with gene flow from one without, we

generated 20 simulated data sets (each consisting of 40 recombining loci) under both

an allopatric and a parapatric model, with parameter values applicable to apes. In

the parapatric model, we fixed the expected number of migrants, M = 4N1m, to 1,

which corresponds to an average of 11 migration events in the history of the sample.

We applied MIMAR to the 40 data sets, allowing for recombination and sampling the

expected number of migrants from the prior ln(M) ∼ U [−5, 2] (see Methods for

details). When applied to data sets generated under a model with no gene flow,

60

Mean absolute error Mean of the estimatedivided by the true value

Parameters M > 0 M = 0 M > 0 M = 0θ1 0.0008 0.0005 1.144 1.153θ2 0.0008 0.0005 1.092 1.085θA 0.0003 0.0004 1.000 0.880∗

T 1.81× 104 5.66× 103 0.721∗ 0.965M 1.0436 0.487 1.293 NA

Table 3.2: Performance of MIMAR when detecting gene flow.Precision and accuracy for the five parameters of the isolation-migration model (usingthe mode as a point estimate). MIMAR was applied to 20 multi-locus data sets simu-lated under parapatric and allopatric models (see Methods for details). When M = 0,the mean estimate of θA is significantly lower than the true value (p = 0.0003, usinga Wilcoxon signed rank test). This can be explained as follows: The prior on M doesnot include 0 (the true value) so M is necessarily an over-estimate and consequently,θA tends to be under-estimated slightly. This problem is likely to apply to IM as well,since the prior on M is likewise exclusive of 0. When M = 1, the mean estimate ofT is significantly lower than the true value (p = 0.005, using a Wilcoxon signed ranktest). This can be explained by the fact that, whenever M and/or θA are slightlyunder-estimated, T tends to be under-estimated (see Figure 3.3).∗ A significant bias in the estimates at the 5% level, after Bonferroni correction formultiple tests.

MIMAR suggested no migration (using the criterion that the mode of the marginal

posterior distribution, M , be <0.1) in 14 out of 20 cases; moreover, in 1 out of the 6

cases in which M ≥ 0.1, most of the posterior probability mass for M was close to 0

(results not shown). For the data sets simulated with gene flow, there was evidence

of migration (i.e., M ≥ 0.1) in 17 out of 20 cases. Other parameter estimates were

generally accurate and precise (see Table 3.2 and Figure 3.3), although the estimates

of θA were slightly under-estimated in data sets generated with M = 0 , and the

estimates of T were slightly under-estimated in data sets generated with M = 1 (see

Table 3.2 for possible explanations).

When we applied either MIMAR or IM to smaller simulated data sets (i.e., 20 loci

and no intra-locus recombination), estimates of the split times and migration rates

61

provided by both methods were much less reliable (results not shown).

3.3.4 Sensitivity to intra-locus recombination rates

Intra-locus recombination rates are often unknown, or estimated with substantial

error. To assess how this might affect the reliability of MIMAR, we generated 16

data sets under an allopatric model, with parameter values applicable to Drosophila

(see above). Each data set consisted of 10 recombining loci, with the locus-specific

recombination rates chosen from an exponential distribution with mean c/µ = 10.

These data sets were analyzed using MIMAR by fixing all the parameters but T to their

true values, and i) setting the locus-specific recombination rates to their true values,

ii) sampling the recombination rates from the same prior as used when generating the

simulated data and iii) setting the intra-locus recombination rates to 0 (see Methods

for details). The results from i) and ii) were virtually identical, suggesting that error

in the locus-specific recombination rates does not have much effect on the results,

so long as intra-locus recombination is taken into account. In contrast, when we

assumed no recombination in our analysis of recombining loci, the estimates of the

split time were significantly less accurate and precise (See Figure 3.4). These results

highlight the importance of allowing for intra-locus recombination when estimating

demographic parameters.

62

0 5 10 15

015

0000

0T − T

0 5 10 15

1.0

2.0

T T

# data set

Figure 3.4: Sensitivity of MIMAR to intra-locus recombination.Precision (upper panel) and accuracy (lower panel) of the estimate of T obtainedfrom the mode of the marginal posterior distribution samples. Each data point is theresult from one of 16 data sets (randomly numbered along the x-axis) simulated underthe allopatric model with intra-locus recombination and parameter values sampledfrom prior distributions. These data sets were analyzed by fixing all the parametersbut T to their true values, and i) fixing the locus-specific recombination rates to theirtrue value (red squares), ii) sampling the recombination rates from the same prioras used when generating the simulated data (blue discs) and iii) setting the intra-locus recombination rates to 0 (grey triangles). The lines are the means over the 16simulated data sets (see Methods for details). The precision and bias in the estimatesof T from iii) and i) (or ii) are significantly different at the 5% level (p = 6 × 10−5

and p = 0.002, respectively, using paired Wilcoxon signed rank tests).

63

3.3.5 Application to ape data

We compiled a set of recently published resequencing data in bonobo and common

chimpanzee, gorilla and orangutan populations (Yu et al., 2003; Fischer et al., 2006;

Thalmann et al., 2006). Won and Hey (2005) had previously reported evidence for

intra-locus recombination at some of the loci included in this study, and we found

further evidence of recombination, in spite of low power to do so (given the small sam-

ple sizes). We therefore analyzed these data sets with MIMAR allowing for intragenic

recombination (see Methods). For these analyses, we assumed that the recombination

rate is exponentially distributed across loci, but constant within a locus. This model

seems sensible for the short fragments (∼650 bp on average) that we considered, but

may not be appropriate for longer loci.

Chimpanzee species (Pan paniscus and P. troglodytes) and subspecies (P.

t. verus, P. t. troglodytes and P. t. schweinfurthii)

In Figures 3.5−3.6 are the marginal posterior distributions for the parameters of

the model, averaging the results for two independent runs (see Methods for details).

We considered each pair of populations in turn. Running MIMAR under a model that

allows for gene flow strongly suggests that the bonobo and the common chimpanzee

populations split without subsequent migration (Table 3.3). In contrast, there is

evidence of gene flow since the split of Western, Central and Eastern chimpanzee

populations (Table 3.4, see also Won and Hey, 2005). Figure 3.5 shows the posterior

distribution estimates for the parameters of the model for bonobo and common chim-

panzee populations and Figure 3.6, for Western, Central and Eastern chimpanzee

populations. We note slight support for gene flow between Eastern chimpanzees and

bonobos (see Figure 3.5c), whose ranges are closer together than those of bonobos

64

0 10000 20000 30000 40000 50000 60000

0.0

00

.05

0.1

00

.15

0.2

00

.25

0.3

0

Population size

BonobosWestern chimpanzeesAncestral population

0 10000 20000 30000 40000 50000 60000

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Population size

BonobosCentral chimpanzeesAncestral population

0 10000 20000 30000 40000 50000 60000

0.00

0.02

0.04

0.06

0.08

0.10

Population size

BonobosEastern chimpanzeesAncestral population

0.0 0.5 1.0 1.5 2.0 2.5

0.00

0.02

0.04

0.06

0.08

0.10

Migration

0.0 0.5 1.0 1.5 2.0 2.5

0.00

0.05

0.10

0.15

0.20

0.25

Migration

0.00 0.05 0.10 0.15 0.20 0.25

0.00

0.02

0.04

0.06

0.08

Migration

400000 600000 800000 1000000 1200000

0.00

0.05

0.10

0.15

0.20

Time

(a)

400000 600000 800000 1000000 1200000

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Time

(b)

400000 800000 1200000 1600000

0.00

0.05

0.10

0.15

Time

(c)

Figure 3.5: Smoothed marginal posterior distributions estimated by MIMAR

from bonobo and common chimpanzee polymorphism data.See Methods for details. The range of the x-axis corresponds to the support of theprior. The distributions are for the analyses of a) bonobos and Western chimpanzees,b) bonobos and Central chimpanzees and c) bonobo and Eastern chimpanzees.

and other chimpanzee subspecies. However, more data and more precise geographic

information are needed to evaluate this possibility, especially in light of the relatively

unreliable estimates of migration from small data sets (see simulation results above).

65

0 20000 40000 60000 80000

0.0

00

.02

0.0

40

.06

0.0

80

.10

0.1

2

Population size

Western chimp.Central chimpanzeesAncestral population

0 20000 40000 60000 80000

0.0

00

.02

0.0

40

.06

0.0

80

.10

Population size

Western chimpanzeesEastern chimpanzeesAncestral population

Population size

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0 20000 40000 60000 80000

Central chimp.Eastern chimp.Ancestral pop.

0.0 0.5 1.0 1.5 2.0 2.5

0.00

0.02

0.04

0.06

0.08

Migration

0 1 2 3

0.00

0.01

0.02

0.03

0.04

0.05

0.06

Migration

0.0 0.5 1.0 1.5 2.0 2.5

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Migration

0 500000 1000000 1500000

0.0

00.0

20.0

40.0

60.0

80.1

0

Time

(a)

0 500000 1000000 1500000

0.0

00

.01

0.0

20

.03

0.0

40

.05

Time

(b)

0 500000 1000000 1500000

0.0

00.0

20.0

40.0

60.0

8

Time

(c)


from the common chimpanzee subpopulations polymorphism data.See Methods and legend of Figure 3.5 for details. The distributions are for the analysesof a) Western and Central chimpanzees, b) Western and Eastern chimpanzees and c)Central and Eastern chimpanzees.

Assuming 20 years per generation and a mutation rate of 2×10−8 per base pair per

generation (see Methods), the estimates of the effective population sizes of bonobos

and Western chimpanzees are ∼10, 000 in all analyses involving these populations. In

66

turn, the estimates of split time for bonobos and common chimpanzee populations

range from 790 to 920 Kya, and the estimates of the ancestral effective population

size are ∼30, 000. These estimates are consistent with those obtained by Won and

Hey (2005), who applied IM to a smaller data set, which overlaps with ours. The

only exception is that they estimated a smaller ancestral effective population size

than we did, but the confidence intervals overlap slightly. These results confirm that

polymorphism data from bonobos and common chimpanzees are consistent with an

allopatric speciation model and that the divergence between the chimpanzee species

occurred more recently than the estimated formation of the River Congo.

In the analyses of Western and Central chimpanzees and Western and Eastern

chimpanzees, the time estimates range from 280−440 Kya, the ancestral effective

population sizes from 11, 000−15, 000, and the migration rate, M = 4N1m, from

0.32−0.43 (where N1 is the effective population size of Western chimpanzees). These

results are roughly consistent with those of Won and Hey (2005): using a model that

allowed for asymmetric migration rates, they estimated that M ' 0.28 from West-

ern to Central chimpanzees, but did not find evidence for gene flow in the opposite

direction.

For the analyses of Central and Eastern chimpanzees, the split time estimate is

∼220 Kya, the ancestral effective population size is ∼46, 000 and the migration rate,

M = 4N1m, is ∼0.80 (where N1 is the effective population size of Central chim-

panzees). Thus, it appears that the split time for Central and Eastern chimpanzees

is about half that of Western and Central (or Eastern) chimpanzees.

67

Anal

ysi

saL

oci

bn

1cn

2N

1dN

2N

AT∗e

Mf

Bon

obos×

Wes

tern

chim

pan

zees

6918

(16)

20(1

2)M

ode

9,79

09,

790

33,3

0087

3,00

00.

007

5th

per

centi

le8,

360

7,82

025,2

0068

1,00

00.

007

95th

per

centi

le12,0

0011,7

0044,3

001,

070,

000

0.03

1

Bon

obos×

Cen

tral

chim

pan

zees

6818

(16)

20(1

0)M

ode

9,90

021,9

0033,8

0091

8,00

00.

008

5th

per

centi

le7,

870

18,3

0027,3

0075

9,00

00.

007

95th

per

centi

le11,3

0027,0

0046,8

001,

170,

000

0.03

6

Bon

obos×

Eas

tern

chim

pan

zees

2618

20M

ode

11,5

0019,9

0031,6

0078

5,00

00.

062

5th

per

centi

le9,

150

15,3

0022,2

0061

6,00

00.

001

95th

per

centi

le15,2

0025,6

0048,7

001,

350,

000

0.10

0

Tab

le3.

3:R

esu

lts

for

chim

panze

esp

eci

es.

aE

stim

ates

are

obta

ined

from

two

indep

enden

tru

ns

(see

Met

hods)

.b

Num

ber

oflo

ciuse

din

the

anal

yse

s.cn

1an

dn

2ar

eth

enum

ber

ofch

rom

osom

esin

the

firs

tan

dse

cond

pop

ula

tion

ofth

ean

alysi

s,re

spec

tive

ly(t

he

sam

ple

size

vari

esb

ecau

sew

ep

ool

edlo

cifr

omm

ult

iple

studie

san

db

ecau

seof

mis

sing

dat

a).

dN

A,N

1an

dN

2ar

eth

ees

tim

ates

ofth

eeff

ecti

vep

opula

tion

size

for

the

ance

stra

l,firs

tan

dse

cond

pop

ula

tion

ofth

ean

alysi

s,re

spec

tive

ly.

eT∗

isth

ees

tim

ate

ofth

eti

me

since

the

pop

ula

tion

ssp

lit

inye

ars.

fM

=4N

1m,

wher

eN

1is

the

effec

tive

pop

ula

tion

size

ofth

efirs

tp

opula

tion

ofth

ean

alysi

s.

68

Anal

ysi

sL

oci

n1

n2

N1

N2

NA

T∗

MW

este

rn×

Cen

tral

chim

pan

zees

6820

(12)

20(1

0)M

ode

9,75

033,0

0015,0

0043

9,00

00.

315

5th

per

centi

le7,

690

24,2

006,

140

325,

000

0.09

7

95th

per

centi

le12,9

0059,7

0022,4

001,

100,

000

0.52

3

Wes

tern×

Eas

tern

chim

pan

zees

2620

20M

ode

10,8

0024,7

0011,0

0028

2,00

00.

425

5th

per

centi

le8,

040

18,6

002,

270

230,

000

0.14

3

95th

per

centi

le21,1

0071,8

0021,9

001,

210,

000

2.62

2

Cen

tral×

Eas

tern

chim

pan

zees

2620

20M

ode

14,4

008,

590

46,0

0021

9,00

00.

797

5th

per

centi

le8,

560

5,07

033,5

0014

3,00

00.

084

95th

per

centi

le22,3

0012,7

0075,1

001,

400,

000

1.38

9

Tab

le3.

4:R

esu

lts

for

chim

panze

esu

bsp

eci

es.

See

lege

nd

ofT

able

3.3

for

det

ails

.

69

While the estimates are generally consistent across pairwise analyses, the effective

population size estimates for Central and Eastern chimpanzees are not. In both anal-

yses of Central chimpanzees and bonobos and of Central and Eastern chimpanzees,

the effective population size of Central chimpanzees is estimated to be 15, 000−22, 000

(consistent with the results of Won and Hey, 2005). However, a larger population size

estimate is obtained from the analysis of Western and Central chimpanzees. Similarly,

in both analyses of bonobos and Eastern chimpanzees and of Western and Eastern

chimpanzees, estimates of the effective population size of Eastern chimpanzees are

20, 000−25, 000, while in the analysis of Eastern and Central chimpanzees, the es-

timate is smaller. These discrepancies may reflect complex histories of chimpanzee

populations not captured by the model (see the goodness-of-fit test below). For

example, analyses of other data sets suggest that Central chimpanzees may have

experienced a recent population expansion (Fischer et al., 2004; D. Reich, personal

communication).

70

0 10000 20000 30000 40000 50000 60000

0.00

0.01

0.02

0.03

0.04

0.05

Population size

Western gorillasEastern gorillasAncestral population

0 50000 100000 150000 200000 250000

0.0

00

.05

0.1

00

.15

Population size

Sumatran orangutansBornean orangutansAncestral population

0 1 2 3 4 5

0.00

0.01

0.02

0.03

0.04

0.05

Migration

0 1 2 3 4 5

0.00

0.02

0.04

0.06

0.08

0.10

Migration

0 500000 1000000 1500000

0.0

00

0.0

05

0.0

10

0.0

15

0.0

20

Time

(a)

0 500000 1000000 1500000 2000000

0.0

00

0.0

05

0.0

10

0.0

15

0.0

20

Time

(b)


from the gorilla (a) and orangutan (b) subspecies polymorphism data.See Methods and legend of Figure 3.5 for details. a) Distributions for the analysis ofWestern and Eastern gorillas. The apparent multi-modality of the marginal posteriordistribution estimated for the split time was also noted by (Thalmann et al., 2006).b) Distributions for the analysis of Sumatran and Bornean orangutans. Note thatthe posterior distribution for the split time is rather flat, suggesting that the data donot contain enough information about this parameter.

71

Gorilla subspecies, Western (Gorilla gorilla) and Eastern gorillas (G.

beringei)

Figure 3.7a shows the posterior distributions of the five parameters of the parap-

atric model of speciation. Assuming 15 years per generation and a mutation rate of

2 × 10−8 per base pair per generation (see Methods), the estimates of the effective

population sizes for Western and Eastern gorillas and their ancestral population are

∼9, 000, ∼8, 000 and ∼27, 000, respectively (see Table 3.5). The divergence time esti-

mate between Western and Eastern gorilla subspecies is ∼92 Kya and the migration

rate, M = 4N1m, is ∼0.87 (where N1 is the effective population size of Western

gorillas).

To compare our estimates to those recently obtained by Thalmann et al. (2006)

using IM, we considered their mutation rate estimate (1.44× 10−8 per base pair per

generation). Our estimates of the effective population sizes of Western gorillas and

ancestral population and the split time are of the same order (∼13, 000 vs. 17, 500,

∼37, 000 vs. 42, 000 and 92 vs. 78 Kya), but our estimate of the effective population

size of Eastern gorillas is larger (11, 000 vs. 3, 000). Whether this discrepancy reflects

differences in the use of summaries vs. the whole data, in allowing vs. ignoring

intra-locus recombination or in the prior distributions is unclear.

72

Anal

ysi

sL

oci

n1

n2

N1

N2

NA

T∗

MW

este

rn×

Eas

tern

gori

llas

1530

6(2

)M

ode

9,13

08,

140

26,4

0091,5

000.

867

5th

per

centi

le5,

090

3,57

05,

990

84,3

000.

282

95th

per

centi

le14,1

0018,1

0049,1

001,

440,

000

2.05

9

Sum

atra

n×

Bor

nea

nor

angu

tans

1912

20(1

8)M

ode

17,2

0010,2

0086,9

001,

390,

000

0.86

8

5th

per

centi

le10,2

006,

230

52,4

0025

4,00

00.

361

95th

per

centi

le26,6

0015,0

0014

3,00

01,

900,

000

2.23

5

Tab

le3.

5:R

esu

lts

for

gori

lla

and

ora

nguta

nsu

bsp

eci

es.

See

lege

nd

ofT

able

3.3

for

det

ails

.

73

Orangutan subspecies, Sumatran (Pongo pygmaeus abelii) and Bornean

orangutans (P. p. pygmaeus)

The posterior distributions of the five parameters of the parapatric model of spe-

ciation are shown in Figure 3.7b. Assuming 20 years per generation and a mutation

rate of 2 × 10−8 per base pair per generation (see Methods), the estimates of the

effective population sizes for Sumatran and Bornean orangutans and their ancestral

population are ∼17, 000, ∼10, 000 and ∼87, 000, respectively (see Table 3.5). The

estimate of the symmetrical migration rate, M = 4N1m, is ∼0.87 (where N1 is the

effective population size of Sumatran orangutans). The data further suggest that the

split time for Sumatran and Bornean orangutan populations is likely to be older than

250 Kya. However, the data (19 loci) do not appear to carry much information about

this parameter (see the posterior distribution estimate in Figure 3.7b), and in par-

ticular, the mode of the posterior distribution, 1.4 Mya, is likely to be an unreliable

estimate of the split time.

Since the islands of Borneo and Sumatra were connected during the two last

glaciations ∼130−200 Kya and ∼10−100 Kya ago, it is not surprising to find evidence

of gene flow between those two populations. Our results further suggest that the

Sumatran and Bornean orangutan populations diverged before the second to last

Ice Age. To our knowledge, this analysis provides the first estimates of population

parameters for the two orangutan subspecies.

Goodness-of-fit test

To examine whether the isolation-migration model is an appropriate description of

the history of the ape species and subspecies, we generated simulated data sets for pa-

rameters sampled from the posterior distributions estimated by MIMAR, and compared

the simulated data to what is observed for a number of statistics. Encouragingly, the

74

isolation-migration model appears to provide a reasonable fit to the four statistics

used in the inferences of MIMAR as well as to the mean FST , π and Tajima’s D across

loci (Supplemental Figure 3.10, and see Methods for details). The one exception is

for Central and Eastern chimpanzees (Supplemental Figure 3.10f): there is a poor fit

to FST and to Tajima’s D for the Central chimpanzees (see also Supplemental Figure

3.10b). This suggests either that an isolation-migration model is not appropriate for

these subspecies or that a crucial demographic feature is missing from the model.

Given the proximity of Central and Eastern chimpanzees and their low FST , one

possibility is that, rather than a split model, a model of isolation-by-distance is more

appropriate (Fischer et al., 2006). Interestingly, though, there does not appear to be

substantial gene flow between the Eastern and Central ranges (see the estimates of

the migration rate in this study and Becquet et al., 2007). We also find that, while

the model fits most aspects of the bonobo data quite well, the observed Tajima’s

D is lower than expected (Supplemental Figure 3.10a−c), perhaps reflecting recent

demographic events in bonobos not included in the model.

3.4 Discussion

3.4.1 Advantages and limitations of MIMAR

We have developed a new method to estimate parameters of simple allopatric and

parapatric speciation models. It considers summaries of the polymorphism data from

each locus, rather than the entire data set. Extensive simulations, and comparisons to

IM for the case of no recombination, suggest that the use of these summaries provides

accurate and precise estimates of parameters of interest from data sets comparable

in size to those analyzed to date (see Table 3.1).

75

The method presents the important advantage of allowing for intra-locus recom-

bination. This feature makes the approach applicable to autosomal data, even in

species where the ratio of recombination to mutation events is high (ρ/θ �1), such

as in Drosophila (Andolfatto and Wall, 2003) and Papilio (Putnam et al., 2007) and

hence where any segment containing polymorphisms is likely to have experienced re-

combination in its genealogical history. In contrast, when applied to recombining

regions, IM requires one to exclude loci that show evidence of recombination and

assumes that no recombination occurred at the other loci, potentially biasing the

estimates.

In other respects, the model of speciation that we consider is more restrictive than

the one used in IM. Mutation rates for each locus are estimated from divergence data

and then fixed, rather than co-estimated along with other parameters (see Methods).

We set the migration rate, m, to be symmetric between populations, which may be

inappropriate. Finally, we assume that the distribution of coalescent times only varies

across loci due to differences in the mode of inheritance, and therefore that it can be

specified a priori. In contrast, IM allows one to estimate inheritance scalars for each

locus from the data, which may be important if a subset of loci have experienced

recent selection (Hey and Nielsen, 2004). Our model could readily be extended to

allow for these features, notably for asymmetric migration rates (in fact, the MIMAR

program that we make available already allows for this feature). However, the data

from a given locus carry limited information, and it is unclear how many parameters

can reliably be estimated, even using all the information. Indeed, our simulations

suggested that IM and MIMAR estimates of the migration rate from a small data set

can be unreliable even in the absence of these complications (see Results).

In its current implementation, our method is also limited in the type of data

76

that it can consider, as it is not applicable to surveys of variation that suffer from

ascertainment bias. Moreover, it assumes an infinite site model, so only two alleles

can be present at a given site. As long as the ascertainment bias and mutation model

are known, however, it should be reasonably straight-forward to extend the model to

consider these cases (Nielsen and Signorovitch, 2003). MIMAR is further intended for

use on resequencing data from short, independently-evolving loci, in which there is

little information about how genealogies change along the genome, or viewed another

way, about linkage disequilibrium (McVean et al., 2004), and for which it is reasonable

to assume that recombination rates are uniform. Applying MIMAR to longer stretches

of sequence may require a change in the model of recombination to capture fine-

scale heterogeneity in recombination rates. In that setting, it may also be helpful to

consider summaries of linkage disequilibrium in addition to the four statistics used

here. More generally, our approach could be extended to consider a number of other

aspects of the data. For instance, one could consider the number of singletons in each

population (in addition to the four current statistics), or the joint frequency spectrum

in two population samples.

In addition to improving the inference method, it will also be important to consider

more realistic models of speciation. For example, detailed studies of closely related

species reveal that many apparent cases of parapatry may in fact reflect allopatric

speciation followed by secondary contact (Coyne and Orr, 2004a; Llopart et al., 2005).

One approach to distinguishing between the two scenarios might be to allow migration

between diverging populations to stop at different time points, and estimate which

times are most likely given the polymorphism data. Similarly, for sets of species (or

populations) that split over a short time period, it may be important to consider

more than two species at a time (Wall, 2000; Degnan and Rosenberg, 2006; Pollard

77

et al., 2006).

Another salient feature, ignored in existing methods, may be population structure

in the ancestral population. Indeed, in many of our analyses of ape data, as well as

in most analyses of the isolation-migration model published to date (e.g., Hey et al.,

2004; Hey, 2005; Won and Hey, 2005; Thalmann et al., 2006), the estimate of the

ancestral effective population size is larger than that of the descendant populations.

Since it seems unlikely that so many populations have shrunk over time, this suggests

that a salient and fairly common demographic feature is being ignored. One possibility

is that the assumption of a panmictic ancestral population is inappropriate. If so, it

may be relevant to consider a model of population structure in which a geographic

barrier becomes stronger over time (e.g., Innan and Watanabe, 2006). In this respect,

an attractive feature of our method is that it is easy to generalize to other demographic

settings (see Methods).

Finally, our approach could also be extended to scan the genome for regions that

contribute to reproductive isolation (Won et al., 2005; Bull et al., 2006; Geraldes et al.,

2006; Miller et al., 2006). Indeed, models of parapatric speciation predict that loci

involved in the formation of species will experience no or little gene flow since the split,

and therefore have more fixed differences and fewer shared alleles than do background

loci. Moreover, theoretical results suggest that, in this setting, and unless selection in

very strong, regions of marked differentiation should be relatively short (Barton and

Bengtsson, 1986). Thus, identifying regions with evidence for decreased gene flow

should be an effective way to find the specific loci that contribute to reproductive

isolation. This idea has been implemented by estimating gene flow for each locus

separately (Won et al., 2005). However, this approach may have limited power to

detect loci with reduced gene flow. An alternative may be to use the goodness-of-fit

78

test results for individual loci to identify outliers that behave as expected if they

contributed to reproductive isolation.

3.4.2 Analyses of ape polymorphism data

Analyses of genetic polymorphism data from apes can help to characterize the ge-

ographic distribution of variation (e.g., Becquet et al., 2007), shed light on their

demographic history and place the evolutionary history of humans in context (Stone

and Verrelli, 2006). Here, we considered the largest set of polymorphism data to date

for all three species of non-human great apes, and estimated parameters of a simple

isolation-migration model. Using a goodness-of-fit test, we find that this model pro-

vides a reasonable point of departure for analyzing ape data, other than for Eastern

and Central common chimpanzees.

The use of the model suggests that the effective populations sizes of the ape

populations range from 8, 000 to 33, 000, on the same order as estimates for human

populations (10, 000−15, 000; Frisse et al., 2001; Voight et al., 2005). In contrast, the

subspecies split times appear to be older than those of human populations (Cavalli-

Sforza and Feldman, 2003; Goebel, 2007), ranging from 92−440 Kya.

We find no evidence for gene flow since the split for chimpanzee species (with the

possible exception of Eastern chimpanzees and bonobo), consistent with the results

of Won and Hey (2005), but do detect limited migration (M≤1) for all ape sub-

species. The split time estimate for chimpanzee species is 790−920 Kya, suggesting

that speciation occurred after the formation of the River Congo, 1.5−3.5 Mya. These

estimates do not take into account possible error in the mutation rate per year. But

even if we consider a time to the most recent common ancestor between human and

chimpanzee at the upper limit of what has been estimated so far, 8 Mya, and a gen-

79

eration time of only 15 years, the central 95th percentile for the split time is 1−2.3

Mya. Moreover, the recent finding of a chimpanzee fossil in Kenya indicates that

common chimpanzees may have occupied a much wider range than inferred on the

basis of their current distribution (McBrearty and Jablonski, 2005). Thus, existing

data support a more recent speciation time for common chimpanzees and bonobos,

which may have occurred outside of their current habitats.

More generally, this application illustrates how the increasing availability of multi-

locus polymorphism data sets, together with development of novel statistical ap-

proaches, can yield insights into speciation, both in apes and in other organisms.

3.5 Methods

3.5.1 Model

We consider a neutral model in which an ancestral population suddenly splits into

two populations, which either diverge in isolation or continue to exchange migrants

(Figure 3.1). We further assume that n1 and n2 chromosomes have been sampled from

two populations and fully resequenced at Y randomly chosen, independently-evolving

loci.

The population model, often called “isolation-migration”, is described by the pop-

ulation split time in generations, T , and three population mutation rates, θ1 = 4N1µ,

θ2 = 4N2µ and θA = 4NAµ (Figure 3.1). Throughout, the subscripts 1, 2 and A refer

to parameters that describe populations 1, 2 and the ancestral population, respec-

tively. Following IM, we assume that there is an independent estimate of the average

mutation rate across loci, µ, which can be used to estimate the effective population

sizes from the population mutation rates (e.g., as N1 = θ1/4µ). In addition, there

80

is a symmetric migration rate, m, which corresponds to the fraction of a population

that is replaced by migrants from the other population each generation.

The parameters θ1, θ2 and θA are defined per base pair and are chosen from

uniform distributions; the time in generations, T , is also chosen from a uniform

distribution. The prior for the migration rate is on the expected number of individuals

in population 1 replaced by migrants (backward in time), M = 4N1m, where N1 is

obtained from θ1 by dividing by 4µ (µ is the estimate of µ). Specifically, ln(M) is

chosen from a uniform distribution.

In addition to the five demographic parameters, there are a number of locus-

specific parameters. We assume that each locus follows the infinite sites mutation

model (Kimura, 1969), then define an inheritance scalar, u, which, for example, is

equal to 1 for autosomal, 3/4 for X-linked and 1/4 for Y- and mtDNA-linked loci.

To allow for mutation rate variation among loci with the same mode of inheritance,

we introduce an additional scalar, v, for each locus. Given this parameterization,

the locus-specific mutation rate in population 1 is given by uvZθ1, where Z is the

length of the locus in base pairs; the locus-specific population mutation rates for other

populations are defined analogously.

The population recombination rate per base pair is defined as ρ = 4N1c, where c

is the per base pair per generation recombination rate. We ignore gene conversion,

treating all recombination as crossovers alone. We also define an inheritance scalar

for recombination, w (w = 0 for the mtDNA and Y, 2/3 for X and 1 for autosomes).

We then consider three options to specify the locus-specific population recombination

rate. We either fix ρ across loci, such that the population recombination rate at a

locus is wZρ. Alternatively, if an estimate, ρ, of the population recombination rate is

available for each locus, we set the scalar w to the inheritance scalar for recombination

81

multiplied by ρ to incorporate this knowledge in the estimation. The final option is

to allow rates to vary for each locus, in which case the locus-specific population

recombination rate is r · wZθ1 and we draw the ratio r = ρ/θ1 from an exponential

distribution with mean λ for each locus. Thus, we allow for rate variation among loci,

but assume a constant rate within a locus. This model should be a sensible description

of the rate variation if the loci are short (e.g., 1 kb), as in most data sets collected

to date. The set of locus-specific population recombination rates, (ρ1, · · · , ρY ), is

referred to as P.

3.5.2 Data summaries

Our goal is to estimate the parameters of the isolation-migration model illustrated in

Figure 3.1. We do so by estimating the posterior distribution π(Θ|D) ∝ p(D|Θ)p(Θ),

where Θ = (θ1, θ2, θA, T,M,P), D is the data and p(Θ) denotes the prior distribu-

tion. Unfortunately, when D is the entire polymorphism data set under our model,

estimating the likelihood of the data given the parameters, p(D|Θ), is computation-

ally extremely intensive, and becomes prohibitive when recombination is included

(Nielsen and Wakeley, 2001; Hey and Nielsen, 2004). In their program IM, Hey

and Nielsen (2004) address this problem by considering the full data set and using

a Markov Chain Monte Carlo (MCMC) approach, but restricting themselves to a

model with no intra-locus recombination (i.e., P = 0). Instead, we focus on a model

with intra-locus recombination, but summarize the polymorphism data from each lo-

cus with the summary statistics described below. To do so, we initially explored an

importance sampling approach, which provided reliable estimates but was inefficient.

We then implemented an MCMC algorithm, which is more efficient than our initial

algorithm when the prior and posterior distributions differ substantially.

82

To summarize the data, we use the statistics introduced by Wakeley and Hey

(1997) for this type of inference problem: for each locus, we consider the number of

polymorphisms unique to the samples from populations 1 and 2 (S1 and S2 respec-

tively), the number of shared alleles between the two samples (S3), and the number

of fixed alleles in either sample (S4). Previous work has shown that these statistics

contain considerable information about the demographic parameters of the isolation-

migration model (e.g., Clark, 1997; Wakeley and Hey, 1997; Hudson and Coyne,

2002; Leman et al., 2005). In what follows, we refer to the vector of summaries, Sk,

k ∈ [1, 4], for locus y as Dy. In turn, we refer to the set of statistics for the Y loci as

D = (D1, · · · ,DY).

In calculating these statistics, we assume that an outgroup sequence is available

and can be used to determine which allele is derived without error. We note that,

in practice, it may be advisable to use two outgroup sequences to minimize error in

inferring the ancestral state. We assign each polymorphic site to one of the statistics

depending of the frequency of the derived allele in the population i, fi. Specifically,

if 0 < fi ≤ 1 in each population sample, the allele is shared, if fi = 0, fj = 1, i 6= j,

the allele is fixed in the sample j, and if fi = 0 and fj < 1, i 6= j, the allele is specific

to sample j. The statistics are easy to calculate, and do not require determination of

haplotypes.

3.5.3 Estimation method

Our goal is to sample from the posterior distribution, π(Θ|D) ∝ p(D|Θ)p(Θ), which

is the likelihood of the data summaries given the parameters times the prior dis-

tributions of the parameters. The parameters are initially chosen from these prior

distributions and subsequently updated using MCMC, which requires information

83

about the likelihood of the data given the parameters. Very briefly, our strategy is

to estimate the likelihood of the data summaries at all the loci for a chosen set of

parameters by, for each of the Y loci, i) Generating a set of X ancestral recombi-

nation graphs (ARGs) (Hudson, 1983) given the parameters and ii) Calculating the

probability of the data summaries given the set of ARGs. Specifically, we estimate

the likelihood p(D|Θ) as:

Y∏y=1

1

X

X∑x=1

p(Dy|Gyx,Θ)p(Gyx|Θ), (3.1)

where Gyx is the xth ARG at locus y (Hudson, 1983). In other words, we estimate

p(D|Θ) by taking the average of p(Dy|Gyx,Θ) over X ARGs, then taking the product

over loci (since they are assumed to be independent). The term p(Gyx|Θ) is given

by the coalescent, using a modified version of the program ms (Hudson, 2002).

We can readily calculate p(Dy|Gyx,Θ). Given a coalescent genealogy, Gyx, we

compute the sum of the lengths of all the branches (in coalescent units), which would

lead to unique polymorphisms in sample 1 and 2 (L1 and L2 respectively), alleles

shared by both samples (L3) and alleles fixed in either samples (L4). Given the

infinite site mutation model, the numbers of mutations, Sk, randomly placed along the

branches of type k ∈ [1, 4], is Poisson distributed with mean LkuvZθ1. Conditional

on a genealogy, the probabilities of observing S1, S2, etc... are independent, so the

probability of the data Dy for the locus y is given by:

p(Dy|Gyx,Θ) =4∏

k=1

Po(Sk|LkuvZθ1). (3.2)

Equation 3.2 also applies to a recombining locus, but in this case, Gyx is an ARG and

84

Lk is computed as follows: With recombination, a locus of size Z has R segments of

length Zj , j ∈ [1, R], with different genealogical histories. The genealogy of a segment

has branch length Ljk, such that Lk =∑R

j=1 LjkZj/Z for the ARG.

Our prior distributions for the parameters, p(Θ), are uniform over a bounded

support (except for P and a uniform prior on ln(M)). For the MCMC, we use random

walk Metropolis transition kernels to propose parameter values, so that the proposed

value of a parameter is taken from a normal distribution with mean its previous value

and variance defined to maximize the acceptance rate (after exploratory simulations;

Gilks et al., 1996). If a parameter value lies outside the support of the prior, the

proposed set of parameters is rejected. In turn, P is a nuisance parameter and its

values are either fixed (when ρ is fixed), or drawn from the distribution described

above (see Model, subsection 3.5.1); in the MCMC, the values of P are sampled

independently at each step from the prior.

Our approach follows a number of Bayesian methods based on summaries of the

data, developed in other contexts (e.g., Tavare et al., 1997; Pritchard et al., 1999;

Beaumont et al., 2002; Przeworski, 2003). It differs in that we update the parameters

using MCMC rather than sampling them independently from the prior. This gen-

eral approach was described by Beaumont (2003). As pointed out to us by Matthew

Stephens (personal communication), our approach can also be viewed as a MCMC on

the set of all genealogies, G = (G11, · · · , G1X , · · · , GY 1, · · · , GY X), and the param-

eters. In this case, the X ARGs are independent samples from the coalescent prior

across the Y independent loci. Thus, for the MCMC, the set of ARGs is updated

using the transition kernel q(G → G′) = p(G′|Θ), while the parameters of interest

are updated using Metropolis transition kernels. We sample sets (G,Θ) from the

85

following target distribution:

π(G,Θ) ∝Y∏

y=1

1

X

X∑x=1

p(Dy|Gyx,Θ)

p(Θ)p(G|Θ). (3.3)

The marginal distribution of sampled values of Θ from the target distribution is

π(Θ|D) (as shown in the Supplemental Materials; see also the Appendix of Beaumont,

2003). A nice feature of viewing our approach in this way is that it demonstrates

that the stationary distribution of the Markov Chain is the correct distribution, i.e.,

that we are exploring the true posterior distribution rather than an approximation.

MIMAR − MCMC estimation of the Isolation-Migration model Allowing for

Recombination

To sample from the target distribution π(G,Θ), we use an MCMC approach

(MIMAR). In the initial step, Θ is chosen from the prior, p(Θ), and G is sampled

from the coalescent with those parameters. Subsequent sets (G,Θ) are updated

following a Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970).

More specifically, we proceed as follows:

I1. If now at (G,Θ), propose a move to (G′,Θ′) according to the transition kernels

q(Θ → Θ′) and q(G → G′) (i.e. Generate X ARGs given the parameters Θ′

for each of the Y loci).

I2. For the yth locus:

a. Calculate p(Dy|G′yx,Θ′) for each of the X ARGs using Equation 3.2.

b. If the average of p(Dy|G′yx,Θ′) over the X ARGs is 0, record (G,Θ) and

go to I1 ; else go to I2a for the locus y + 1.

86

I3. Calculate:

h = min

(1,A′

A

), (3.4)

where

A′ =Y∏

y=1

1

X

X∑x=1

p(D|G′yx,Θ′)

p(Θ′)p(G′|Θ′)q(Θ′ → Θ)q(G′ → G)

A =Y∏

y=1

1

X

X∑x=1

p(D|Gyx,Θ)

p(Θ)p(G|Θ)q(Θ→ Θ′)q(G→ G′).

I4. Move to (G′,Θ′) with probability h (i.e. record (G′,Θ′)) or else record (G,Θ).

Return to I1.

The choice of proposal distribution for G and P and normal kernel distributions

and uniform prior distributions for the parameters of interest lead to the following

simplification of Equation 3.4:

h = min

1,

∏Yy=1

(1X

∑Xx=1 p(D|G′yx,Θ

′))

∏Yy=1

(1X

∑Xx=1 p(D|Gyx,Θ)

) . (3.5)

In practice, we consider X = 100 (or X = 50 or 5, see below), thus generating

100 (50 or 5) ARGs given the locus-specific parameters. Generating so many ARGs

is computationally demanding, but we find that this approach has improved mixing

over X = 1.

We note that our approach presents the advantage of being flexible, since it can

easily be extended to consider any summaries for which p(Dy|Gyx,Θ) can be readily

calculated, such as the allele frequency spectrum at each locus.

MIMAR and its documentation are available at http://mplab.bsd.uchicago.edu/

dataNprograms.htm and http://pps-spud.uchicago.edu/~cbecquet/download.html.

87

Assessing the performance of the estimator

To assess the performance of our method, we ran MIMAR on simulated data sets

with two independent seeds (see below). We considered that MIMAR reached conver-

gence when the posterior distributions from the two independent runs were highly

similar (e.g., Supplemental Figure 3.9). In the documentation provided with MIMAR,

we describe a number of other criteria that can be used to assess convergence and

proper mixing). We took the mode and the central 95th percentile of the marginal

posterior distribution averaged over the two independent runs as the point estimate

and measure of uncertainty, respectively.

3.5.4 Simulated data and performance analyses

We generated simulated data sets under the isolation-migration model using a mod-

ified version of the program ms (called MIMARsim available at http://pps-spud.

uchicago.edu/~cbecquet/download.html; Hudson, 2002). Unless otherwise indi-

cated, we considered 20 loci of 1 kb each, and sampled 20 chromosomes from each of

two populations.

Performance of MIMAR under allopatry. We generated 30 simulated data sets

with no recombination and fixed parameter values relevant for Drosophila yakuba and

D. santomea (Llopart et al., 2005), assuming a per base pair per generation mutation

rate of µ = 2× 10−9 and 20 generations per year (Andolfatto and Przeworski, 2000).

We analyzed the 30 simulated data sets for 60 hours with 1× 105 burn-in steps and

prior distributions as indicated (see Supplemental Figure 3.8).

Comparison to IM under allopatry. In order to compare our estimates with

those generated by IM (Hey and Nielsen, 2004), which does not allow for intra-locus

88

recombination, we set the population recombination rate, ρ, to 0. To be comparable to

IM, we also chose uniform prior distributions with 0 as the lower limit. We generated

30 simulated data sets with parameters relevant for Drosophila species as above,

setting M to 0 and drawing the other parameters from prior distributions: θ1 and

θ2 from U (0, 0.01] and θA from U (0, 0.02] per base pair and T from U(0, 1.5× 107]

generations. We analyzed those 30 simulated data sets with MIMAR and IM using the

same prior distributions as used when simulating the data sets, 4×106 recorded steps

and 5× 105 burn-in steps.

Assessing the evidence for gene flow. We generated 40 data sets, consisting of

40 recombining loci with parameter values relevant for apes (see below). We assumed

that µ = 2 × 10−8 per base pair per generation to translate coalescent time units

into generations (Nachman and Crowell, 2000). This mutation rate estimate is also

obtained assuming a most recent common ancestor of human and chimpanzee of 7

Mya and an average nucleotide divergence of 1.28% (The Chimpanzee Sequencing and

Analysis Consortium, 2005). The intra-locus recombination rate was set for each locus

by choosing r = c/µ from the prior exp (1/0.6) (assuming that the mean c is 1.2×10−8;

Kong et al., 2002). The other parameter values were sampled from the following

prior distributions: θ1, θ2 and θA from U (0.0006, 0.006] per base pair and T from

U(0, 1× 105] generations. M was either fixed to 0 (for 20 data sets simulated under

the allopatric model) or to 1 (for 20 data sets simulated with parapatric divergence).

We analyzed the 40 simulated data sets with MIMAR, choosing ln(M) from U [−5, 2]

and the other parameters from the same prior distributions as used when simulating

the data sets, the number of ARGs per locus set to X = 50, 4 × 106 recorded steps

and 5× 105 burn-in steps.

89

Effect of uncertainty in the intra-locus recombination rates. We generated

16 simulated data sets, consisting of 10 recombining loci with parameter values rel-

evant for Drosophila species. The intra-locus recombination rate was set for each

locus by choosing r = c/µ from the prior exp (1/10) (assuming that the mean c is

2×10−10; Andolfatto and Przeworski, 2000). M was fixed to 0 and the other param-

eter values were sampled from the following prior distributions: θ1, θ2 and θA from

U (0.001, 0.01] per base pair and T from U(0, 1× 106] generation. We then analyzed

the data sets with MIMAR in three ways: i) the locus-specific population recombination

rates were fixed to their true values, ii) the locus-specific population recombination

rates were sampled from the same prior as used when generating the simulated data

and iii) the locus-specific population recombination rates were set to 0. For the three

sets of analysis, we fixed θ1, θ2 and θA to their true value and used the same prior

distribution for T as when generating the simulated data. MIMAR was run with X = 5

(i and ii) or X = 100 (iii), 4.5× 105 recorded steps and 5× 104 burn-in steps.

3.5.5 Analysis of ape polymorphism data

Polymorphism data. We analyzed the ape polymorphism data reported in Fischer

et al. (2006), Thalmann et al. (2006) and Yu et al. (2003). The first set was kindly pro-

vided by A. Fischer (Max Planck Institute for Evolutionary Anthropology, Leipzig,

Germany); we downloaded the two other data sets from Genbank (http://www.

ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=Nucleotide). The data

from Fischer et al. (2006) (and Thalmann et al., 2006) and Yu et al. (2003), con-

sisted of loci of median length ∼780 bp and ∼470 bp, respectively. The data sets were

as follows (see Tables 3.3−3.5): 69 loci surveyed in nine unrelated bonobos (pigmy

chimpanzee, Pan paniscus), 26 loci in ten and 43 loci in six Western chimpanzees

90

(P. t. verus), 26 loci in ten and 42 in five Central chimpanzees (P. t. troglodytes),

26 loci in ten Eastern chimpanzees (P. t. schweinfurthii), 15 loci in 15 Western

gorillas (Gorilla gorilla) and three Eastern gorillas (G. beringei), and 19 loci in six

Sumatran orangutans (Pygmaeus pygmaeus abelii) and ten Bornean orangutans(P.

p. pygmaeus).

For each locus, we obtained two outgroup sequences. For the chimpanzee data sets,

one orangutan and one human sequences were available for 26 and 19 loci, respectively

(Fischer et al., 2006); one human (Yu et al., 2002) and one gorilla sequences (G. g.

gorilla, Yu et al., 2004) were obtained for 43 loci. We blasted the seven remaining loci

and downloaded a homologous human sequence for each of them (BLASTN, http:

//www.ncbi.nlm.nih.gov/BLAST, Altschul et al., 1990). For the gorilla data set,

one orangutan and one human sequences were available for all loci; for the orangutan

data set, one chimpanzee and one human sequences were available for all loci (Fischer

et al., 2006). We used CLUSTALW in MEGA3.1 (Thompson et al., 1994; Kumar

et al., 2001) to align the resequencing data and the two outgroup sequences. We

then wrote a Perl script that requires both outgroup sequences to be consistent to

infer the ancestral state at each site, thus minimizing error in the reconstruction of

the ancestral state. We ignored sites with gaps, missing data and more than two

variants. (There were only one site with three alleles in the entire gorilla data set,

three in the orangutan data, and six in the chimpanzee/bonobo data.) We used a

Perl script to calculate for each locus the four statistics S1, S2, S3 and S4 (see above)

and FST (Hudson et al., 1992) for pairs of populations, as well as the mean pairwise

differences (π; Nei and Li, 1979) and Tajima’s D (Tajima, 1989) in each population.

Estimates of mutation rate variation. To allow for variation in mutation rates,

we used the scalars v described above. To do so, we calculated the mean pairwise

91

divergence per site between a human sequence and an ape sequence (div), using a Perl

script. We obtained the expected locus divergence given the number of base pairs,

E(div) · Z, where E(div) is the mean divergence per base pair over the loci, and

performed a goodness-of-fit test using Pearson’s χ2 (Frisse et al., 2001). The gorilla

and orangutan data did not deviate significantly from expectation (p-value= 0.24 and

0.85 respectively); therefore, we set v = 1 for all loci in the analysis of these two data

sets. However, data from the three common chimpanzee populations and the bonobo

rejected the null hypothesis of a homogeneous mutation rate across loci (at the 5%

level). Thus, for a pair of chimpanzee populations or species, we set v at a locus to

the observed divergences per base pair divided by E(div).

Recombination rates. Won and Hey (2005) found evidence of recombination in

bonobos, Central and Western chimpanzees in ten of the 43 short segments surveyed

by Yu et al. (2003) used in this study. We estimated the locus-specific recombina-

tion rate in the data sets using MAXDIP (http://genapps.uchicago.edu/maxdip/

index.html, Hudson, 2001), setting 0.005 as the initial value and the gene conver-

sion rate to 0. From each species, we chose the subspecies with the largest estimate

of the mean recombination rate across loci, which were Central chimpanzees, West-

ern gorillas and Sumatran orangutans. Then, to assess whether the point estimates

were significantly greater than 0, we simulated 1, 000 data sets using ms (Hudson,

2002), setting the number of segregating sites to what was observed and ρ to 0. We

ran MAXDIP on the simulated data sets and calculated how many times ρ (i.e.,

estimated by MAXDIP) was equal or larger than observed under the standard neu-

tral model. By this approach, we rejected ρ = 0 for four out of 35 loci in Central

chimpanzees, one out of seven loci in Western gorillas and three out of 15 loci in

Sumatran orangutans (at the 5% level). Given the small sample sizes, our power

92

to detect recombination was limited. Nonetheless, our results suggest that ignoring

recombination will result in a loss of data − even in species in which ρ/θ is rela-

tively small. In the analyses of the ape data, we chose r = ρ/θ1 for each locus from

exp (1/0.6) (see above). We chose this distribution because it has been shown to be

a good description of fine-scale recombination rate variation in humans and may also

apply to a number of other organisms, notably to other apes (Coop and Przeworski,

2007).

Analyses. We ran MIMAR for 2 × 107 recording steps with 1 × 106 burn-in steps,

X = 50, recording the parameters every 50 steps and using prior distributions chosen

after preliminary analyses. We repeated our analyses for two independent seeds, and

considered that convergence was reached when the posterior distributions of both

runs were very similar (results not shown). Results reported are for the average from

the two independent runs. We obtained estimates of the effective population sizes

and split times in years for all the ape species and subpopulations using µ = 2×10−8

per base pair per generation and assuming 20 years per generation for chimpanzees

and orangutans (Gage, 1998; Fischer et al., 2004) and 15 years per generation for

gorillas (Thalmann et al., 2006).

Goodness-of-fit test. We investigated how well the data fit the estimated model

by generating the posterior predictive distributions of the four statistics S1, S2, S3,

and S4 summed over all loci, the mean FST (Hudson et al., 1992), and, in each

population, the mean pairwise differences (π; Nei and Li, 1979) and the mean Tajima’s

D (Tajima, 1989) across loci. To do so, we simulated data sets under the isolation-

migration model, sampling the parameters from the posterior distribution estimated

by MIMAR. We then compared the observed values of the statistics to the simulated

93

distribution (see Supplemental Figure 3.10), conservatively considering the model to

be a poor fit if the observed value of a data summary fell in the 2.5th percentile tails

of any statistic. We note that, since this goodness-of-fit test takes into account the

uncertainties associated with the estimates, it is similar to the Bayesian posterior

predictive p-value (e.g., Meng, 1994).

3.6 Acknowledgments

Many thanks to G. Coop, R. Hudson, J. Novembre, J. Pritchard, D. Reich, M.

Stephens, K. Teshima and K. Zeng, as well as three reviewers for helpful discussions

and/or comments on earlier versions of the manuscript. This work was supported by

an Alfred P. Sloan fellowship in Computational Molecular Biology to M.P. C.B. would

also like to acknowledge support from the Summer Institute in Statistical Genetics

(2006).

94

3.7 Appendix A: Supplemental Materials

Recall that our goal is to sample from the posterior distribution:

π(Θ|D) ∝ p(D|Θ)p(Θ).

To do so, we perform MCMC on the set of all genealogies,

G = (G11, · · · , G1X , · · · , GY 1, · · · , GY X),

and the parameters, Θ = (θ1, θ2, θA, T,M,P), by sampling sets (G,Θ) from the tar-

get distribution, π(G,Θ) (see Equation 3.3 in Methods). By integrating Equation

3.3 over genealogies G, we show that the marginal distribution of sampled values of

Θ from the target distribution is the posterior distribution of interest, π(Θ|D):

∫π(G,Θ)dG ∝

∫ Y∏y=1

1

X

X∑x=1

p(Dy|Gyx,Θ)

p(Θ)p(G|Θ)dG

(realizing that the integration is over independent Gyx)

=Y∏

y=1

1

X

X∑x=1

∫p(Dy|Gyx,Θ)p(Gyx|Θ)dGyx

p(Θ)

=Y∏

y=1

1

X

X∑x=1

p(Dy|Θ)

p(Θ)

= p(D|Θ)p(Θ).

3.8 Appendix B: Supplemental Figures

95

0.000 0.004 0.008 0.012

0.00

00.

004

0.00

8

θ1

0.000 0.004 0.008 0.012

0.00

00.

004

0.00

8

θ2

0.000 0.004 0.008 0.012

0.00

00.

002

0.00

40.

006

θA

0.0e+00 5.0e+06 1.0e+07 1.5e+07

0.00

00.

010

0.02

00.

030

T(a)

Figure 3.8: Smoothed posterior distributions estimated by MIMAR from sim-ulated data sets.The distributions are the average over two independent runs. Each panel (a−c) showsresults for 10 simulated data sets consisting of 20 loci surveyed at 20 chromosomesfrom each population, with no intra-locus recombination or migration and θ1 = θ2 =0.005 per base pair, a) with θA = 0.005 per base pair and T = 2.5× 106 generations,b) with θA = 0.005 and T = 5× 106 and c) with θA = 0.008 and T = 2.5× 106. Therange of the x-axis corresponds to the support of the prior and the red vertical linedenotes the true value of the parameter (see Methods in main text for details).

96

0.000 0.004 0.008 0.012

0.00

00.

004

0.00

8

θ1

0.000 0.004 0.008 0.012

0.00

00.

004

0.00

8

θ2

0.000 0.004 0.008 0.012

0.00

000.

0010

0.00

200.

0030

θA

0.0e+00 5.0e+06 1.0e+07 1.5e+07

0.00

00.

005

0.01

00.

015

T(b)

Figure 3.8 − continued: Results for 10 data sets simulated with no intra-locusrecombination or migration, θ1 = θ2 = θA = 0.005 per base pair and T = 5× 106

generations.

97

0.000 0.004 0.008 0.012

0.00

00.

004

0.00

8

θ1

0.000 0.004 0.008 0.012

0.00

00.

004

0.00

8

θ2

0.000 0.004 0.008 0.012

0.00

00.

001

0.00

20.

003

0.00

4

θA

0.0e+00 5.0e+06 1.0e+07 1.5e+07

0.00

00.

010

0.02

0

T(c)

Figure 3.8 − continued: Results for 10 data sets simulated with no intra-locusrecombination or migration, θ1 = θ2 = 0.005 and θA = 0.008 per base pair andT = 2.5× 106 generations.

98

0.000 0.002 0.004 0.006 0.008 0.010

0.00

00.

004

0.00

8

θ1

0.000 0.002 0.004 0.006 0.008 0.010

0.00

00.

004

0.00

8

θ2

0.000 0.005 0.010 0.015 0.020

0.00

000.

0010

0.00

200.

0030

θA

0.0e+00 5.0e+06 1.0e+07 1.5e+07

0.00

00.

010

0.02

0

T(a)

Figure 3.9: Smoothed posterior distributions estimated by IM (black) andMIMAR (grey) from simulated data sets.Results from two independent runs are shown. Both methods ran for the same numberof steps and were smoothed similarly. a) and b) are the results for two data setschosen from the 30 data sets simulated under the allopatric model without intra-locus recombination to show a case in which IM (a) or MIMAR (b) performed better(see Figure 3.2 in main text). The range of the x-axis corresponds to the supportof the prior and the red vertical line denotes the true value of the parameter (seeMethods in main text for details).

99

0.000 0.002 0.004 0.006 0.008 0.010

0.00

00.

004

0.00

8

θ1

0.000 0.002 0.004 0.006 0.008 0.010

0.00

0.04

0.08

0.12

θ2

0.000 0.005 0.010 0.015 0.020

0.00

000.

0010

0.00

20

θA

0.0e+00 5.0e+06 1.0e+07 1.5e+07

0.00

00.

002

0.00

40.

006

T(b)

Figure 3.9 − continued: A case in which MIMAR performed better.

100

0 50 100 150 200

0.00

0.05

0.10

0.15

0.20

S statistics

S1 (Bo.)S2 (We.)S3 (sh.)S4 (fi.)

0.5 0.6 0.7 0.8

0.00

0.04

0.08

Fst

0.5 1.0 1.5

0.00

0.04

0.08

0.12

π

π for Bono.π for West.

−0.6 −0.2 0.2 0.4 0.6

0.00

0.04

0.08

0.12

D

D for Bono.D for West.

(a)

Figure 3.10: Goodness-of-fit of the isolation-migration model for the apespecies and subspecies data.Distributions of the four statistics used by MIMAR (the polymorphisms specific to thefirst and second sample and shared and fixed between the two samples, S1, S2, S3and S4, respectively), as well as Fst, π and Tajima’s D (see Methods in main textfor details). The vertical lines correspond to the observed values. Shown are resultsfor a) bonobos (blue) and Western chimpanzees (red), b) bonobos and Central chim-panzees, c) bonobos and Eastern chimpanzees, d) Western and Central chimpanzees,e) Western and Eastern chimpanzees, f) Central and Eastern chimpanzees, g) West-ern and Eastern gorillas and h) Sumatran and Bornean orangutans. In this case, theestimated model does not seem to provide a good fit to D for bonobos.

101

0 100 200 300

0.00

0.10

0.20

S statistics

S1 (B.)S2 (C.)S3 (s.)S4 (f.)

0.4 0.5 0.6 0.7

0.00

0.04

0.08

Fst

0.5 1.0 1.5 2.0

0.00

0.04

0.08

0.12

π

π for Bon.π for Cen.

−0.6 −0.2 0.2 0.4 0.6 0.8

0.00

0.04

0.08

0.12

D

D for Bon.D for Cen.

(b)

Figure 3.10 − continued: Results for bonobos (blue) and Central chimpanzees (red).In this case, the estimated model does not seem to provide a good fit to D for bothbonobos and Central chimpanzees and as well as to π for Central chimpanzees.

102

0 50 100 150 200

0.00

0.05

0.10

0.15

0.20

0.25

S statistics

S1 (Bon.)S2 (East.)S3 (shar.)S4 (fixed)

0.3 0.4 0.5 0.6 0.7

0.00

0.02

0.04

0.06

0.08

0.10

Fst

1 2 3 4

0.00

0.04

0.08

0.12

π

π for Bono.π for East.

−1.0 −0.5 0.0 0.5 1.0

0.00

0.04

0.08

0.12

D

D for Bono.D for East.

(c)

Figure 3.10 − continued. Results for bonobos (blue) and Eastern chimpanzees(red). In this case, the estimated model does not seem to provide a good fit to D forbonobos.

103

0 50 100 150 200 250 300

0.00

0.10

0.20

0.30

S statistics

S1 (Western)S2 (Central)S3 (shared)S4 (fixed)

0.1 0.2 0.3 0.4 0.5

0.00

0.05

0.10

0.15

Fst

0.5 1.0 1.5 2.0

0.00

0.04

0.08

0.12

π

π for West.π for Cent.

−1.5 −1.0 −0.5 0.0 0.5

0.00

0.04

0.08

0.12

D

D for West.D for Cent.

(d)

Figure 3.10 − continued: Results for Western (blue) and Central chimpanzees (red).

104

0 50 100 150

0.0

0.2

0.4

0.6

0.8

S statistics

S1 (Western)S2 (Eastern)S3 (shared)S4 (fixed)

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.00

0.04

0.08

Fst

0.5 1.0 1.5 2.0 2.5 3.0

0.00

0.04

0.08

0.12

π

π for Westernπ for Eastern

−1.5 −1.0 −0.5 0.0 0.5 1.0

0.00

0.05

0.10

0.15

D

D for West.D for East.

(e)

Figure 3.10 − continued: Results for Western (blue) and Eastern chimpanzees (red).

105

0 50 100 150 200

0.0

0.1

0.2

0.3

0.4

S statistics

S1 (Central)S2 (Eastern)S3 (shared)S4 (fixed)

0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.00

0.02

0.04

0.06

Fst

0 1 2 3 4 5

0.00

0.04

0.08

π

π for Centralπ for Eastern

−1.0 −0.5 0.0 0.5 1.0 1.5

0.00

0.05

0.10

0.15

D

D for Cent.D for East.

(f)

Figure 3.10 − continued: Results for Central (blue) and Eastern chimpanzees (red).In this case, the estimated model appears to provide a poor fit to the observed FST ,and D for Central chimpanzees.

106

0 20 40 60 80 100

0.0

0.1

0.2

0.3

0.4

0.5

0.6

S statistics

S1 (Western)S2 (Eastern)S3 (shared)S4 (fixed)

0.0 0.2 0.4 0.6 0.8

0.00

0.05

0.10

0.15

Fst

0 1 2 3 4

0.00

0.05

0.10

0.15

0.20

π

π for Westernπ for Eastern

−1.5 −0.5 0.5 1.0 1.5 2.0

0.00

0.05

0.10

0.15

0.20

D

D for West.D for East.

(g)

Figure 3.10 − continued: Results for Western (blue) and Eastern (red) gorillas.

107

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4

0.5

S statistics

S1 (Suma.)S2 (Born.)S3 (shared)S4 (fixed)

0.0 0.2 0.4 0.6

0.00

0.04

0.08

Fst

0 2 4 6 8

0.00

0.04

0.08

π

π for Suma.π for Born.

−1.5 −0.5 0.0 0.5 1.0 1.5

0.00

0.02

0.04

0.06

D

D for Suma.D for Born.

(h)

Figure 3.10 − continued: Results for Sumatran (blue) and Bornean (red)orangutans.

CHAPTER 4

CAN WE LEARN ABOUT MODES OF SPECIATION BY

COMPUTATIONAL APPROACHES?

108

109

4.1 Abstract

Enduring debate in evolutionary biology centers around the question of whether the

early stages of speciation can occur in the presence of gene flow. With this question

in mind, a number of recent papers have used multi-locus polymorphism data from

closely related species to estimate parameters of simple speciation models, usually an

isolation-migration model. Application of such computational approaches to a variety

of species has yielded extensive evidence for migration, with the results interpreted

as supporting the widespread occurrence of parapatric speciation. Here, we conduct

a simulation study to assess the reliability of such inferences, using a program that

we recently developed (MIMAR) as well as the program IM of Hey and Nielsen (2004).

We find that when one of many assumptions of the isolation-migration model are

violated, the methods tend to yield biased estimates of the parameters, potentially

lending spurious support for parapatric speciation. To some extent, this problem

can be avoided by carefully testing the fit of estimated models to the data. When a

parapatric model is appropriate, we propose a test that may help to pinpoint regions

involved in reproductive isolation.

4.2 Introduction

The biological species concept proposed by Mayr (1963) defines species as “groups

of interbreeding natural populations that are reproductively isolated from other such

groups”. Reproductive isolation can act either prezygotically, through mating pref-

erences for example, or postzygotically, in which case it stems from decreased hybrid

fitness (Coyne and Orr, 2004d). The decrease in hybrid fitness, in turn, may be due

to extrinsic environmental interactions, such as a higher predation on hybrids, or to

110

intrinsic genetic incompatibilities, i.e., genetic interactions between heterospecific al-

leles present in a hybrid genome that directly reduce the fitness of hybrids (so-called

Dobzhansky-Muller (DM) incompatibilities; Dobzhansky, 1937; Muller, 1940). To

date, the specific mechanisms by which loci involved in pre- or post-zygotic reproduc-

tive isolation fix between nascent species remain poorly understood. In particular,

an enduring question is how often reproductive isolation factors begin to accumulate

between populations in the face of gene flow rather than in complete isolation (e.g.,

Orr, 2005).

A common model for the origin of new species assumes that the early stage of

divergence occurs in the absence of gene flow, i.e., in allopatry. Many cases of al-

lopatric divergence have been documented and, until recently, it was believed to be

the only − or at least main − process by which species originate (e.g., Coyne and

Orr, 2004a). In this model, an exogenous barrier (e.g., a river, mountain...) leads to

the isolation of two populations, and divergence then occurs homogeneously as the

genomes of the species accumulate differences. Both DM incompatibilities that fix

through genetic drift (i.e., neutrally) and loci that underlie local adaptation will con-

tribute to the reproductive isolation of species should they come back into secondary

contact but importantly, in this scenario, reproductive isolation is a by-product of

divergence rather than its cause (Mayr, 2001; Orr, 2001).

Alternatives are that species start diverging while occupying either the same geo-

graphical area (i.e., they are in sympatry) or non-overlapping, contiguous areas (i.e.,

in parapatry). Because they are in close geographic proximity, the populations ex-

change migrants and form hybrids. In this context, divergence occurs in a number of

stages, which have recently been characterized as the “genic view of speciation” (Wu,

2001b). First, natural selection leads to the fixation of alleles that contribute to the

111

differential adaptation of the two populations. These differentially adapted loci are

sometimes referred to as “speciation genes” because they actively reduce the fitness

of the hybrids, either through differential geographic adaptation or by contributing

to DM incompatibilities. These alleles create an effective barrier to gene flow in the

linked genomic regions, while other, unlinked regions can introgress without causing

decreased hybrid fitness (Coyne and Orr, 1998; Wu, 2001a,b; Orr et al., 2004; Coyne

and Orr, 2004c; Wu and Ting, 2004). Thus, some regions of the genome experienced

a reduced effective gene flow rate between populations relative to others. As the

populations become increasingly diverged and adapted to their local environments,

more genes contributing to reproductive isolation accumulate − until the final stage

of speciation, when gene flow is prevented throughout the genome. Thus, in contrast

to the classical models of allopatric speciation, models where speciation occurs in the

presence of gene flow assume that natural selection plays a major role in the accumu-

lation of reproductive isolation loci between species. They predict that the genomes

of closely related species should be mosaics of recently introgressed and highly di-

verged regions, leading to far greater heterogeneity than expected under a model of

allopatric speciation.

Theoretical modeling suggests that speciation can occur in presence of high levels

of gene flow (e.g., Dieckmann and Doebeli, 1999; Navarro and Barton, 2003), and the

possibility of gene flow at the initial stage of speciation is supported by the observa-

tion that species parapatrically or sympatrically distributed tend to show pre-mating

isolation more often than species distributed allopatrically (Coyne and Orr, 1989).

But, these alternative models of speciation remain controversial (Wu, 2001a,b; Mayr,

2001; Orr, 2001; Mallet, 2005), in part because of the difficulty in distinguishing para-

patry or sympatry from secondary contact after an allopatric phase (e.g., Goodman

112

et al., 1999; Llopart et al., 2002; Austin et al., 2004; Aagaard et al., 2005; Hall, 2005;

Knaden et al., 2005; Petren et al., 2005; Geraldes et al., 2006; Fritz et al., 2007). There

is indirect evidence that the early stages of speciation may often occur in presence

of some gene flow (e.g., Dixon et al., 2007; Nosil, 2008). Notably, molecular studies

have recently characterized several genes that contribute to the reproductive isolation

of sister species in model organisms such as Drosophila (Barbash et al., 2003; Pres-

graves et al., 2003; Ting et al., 1998; Wang et al., 1999; Sawamura et al., 1993), mus

(Fossella et al., 2000; Greene-Till et al., 2000) and Xiphophorus (Wittbrodt et al.,

1989; Kallman and Kazianis, 2006) and the reproductive isolation factors all show ev-

idence of strong positive selection (Wu, 2001a; Orr, 2005). While it remains unclear if

these factors drove speciation or accumulated since the species formed, these findings

suggest that Darwinian selection is the major force in the emergence of reproductive

isolation (Orr, 2005), consistent with models of parapatric speciation. Unfortunately,

fewer than ten reproductive isolation factors have been identified to date, limiting the

generality of the conclusions that can be reached.

An alternative approach, facilitated by recent technical advances, is to examine

patterns of genetics variation at multiple loci surveyed in closely related species or

recently diverged populations. Such analyses suggest that divergence levels are highly

heterogeneous, consistent with the notion that speciation often occurs in presence of

gene flow (e.g., Machado et al., 2002; Emelianov et al., 2004; Mallet, 2005; Turner

et al., 2005; Barluenga et al., 2006; Savolainen et al., 2006; Stadler et al., 2008).

Whether the detected introgressions occurred in the early or late stages of speciation

remains unclear however (Barton, 2001), as well as whether the variation in divergence

rates across loci is due to introgression or to the persistence of polymorphism that

was present in the common ancestor of the two populations.

113

Computational approaches have recently been developed to help distinguish be-

tween these two cases by estimating the parameters of divergence models (e.g., Wake-

ley, 1996; Wakeley and Hey, 1997; Kliman et al., 2000). The simplest such model,

referred to as the isolation model, is illustrated in Figure 4.1a: It describes the case

where an ancestral population suddenly split into two populations, which subse-

quently diverged neutrally in total isolation, and assumes panmixia and constant

effective population sizes. Under this cartoon of allopatric divergence, the descen-

dant populations will differ by an increasing number of fixed polymorphisms and

will share fewer polymorphisms as the divergence time increases, with some variation

across loci due to genetic drift. For example, after approximately 9−12N genera-

tions (where N is the effective size of the descendant population), over >95% of loci

will not share polymorphisms across the descendant populations (Hudson and Coyne,

2002). A second model, termed isolation-migration, also assumes a similar sudden

split from an ancestral population, but the two descendant populations experienced

a constant rate of gene flow since (Figure 4.1b). This scenario predicts that regions

that experienced gene flow in the history of the sample will harbor shared polymor-

phisms and few fixed polymorphisms, while regions that did not experience gene flow

will have more fixed polymorphisms, with any shared polymorphisms resulting from

the persistence of ancestral polymorphisms. Thus, in the isolation-migration model,

the variation in number of shared and fixed polymorphisms across loci is expected to

be much greater than in the isolation model. In that sense, the isolation-migration

model can be thought of as a neutral approximation to parapatric (or with a high

enough migration rate, sympatric) speciation.

Currently, there are two main methods, IM (Hey and Nielsen, 2004) and MIMAR

(Becquet and Przeworski, 2007) to estimate parameters of the isolation-migration

114

model (Figure 4.1b). Both use multi-locus polymorphism data collected in two species

or populations and Markov Chain Monte Carlo to estimate posterior distributions of

the parameters. The first method, IM, relies on the full polymorphism data from

a variety of genetic markers (i.e., AFLP, microsatellites, resequencing data...) and

allows for a wide range of complex demographic models, including population bottle-

necks and growth. However, it assumes no intra-locus recombination, so that regions

with evidence of recombination in the polymorphism data need to be excluded and

non-detectable recombination is ignored. This limitation can lead to biased estimates

of the isolation-migration model parameters (Takahata and Satta, 2002; Hey and

Nielsen, 2004; Becquet and Przeworski, 2007). IM has nonetheless been widely used

to analyze autosomal data, by considering subsets of the data with no apparent re-

combination (Hey and Nielsen, 2004). While some studies did not detect gene flow

(e.g., Dolman and Moritz, 2006), most did (e.g., Llopart et al., 2005; Won et al., 2005;

Bull et al., 2006; Geraldes et al., 2006; Kronforst et al., 2006; Carling and Brumfield,

2008; Mazzoni et al., 2008). The authors usually interpreted the gene flow as recent

or as reflecting secondary contact (e.g., Llopart et al., 2005; Geraldes et al., 2006),

under the assumption that the early stages of speciation likely occur in the absence

of gene flow. However, two recent reviews have taken the evidence to mean that the

gene flow occurred in the first stages of species formation, concluding that parapatric

speciation may be common (Hey, 2006b; Niemiller et al., 2008).

To circumvent the assumption of no intra-locus recombination made by IM, we

recently developed a second approach, MIMAR (Becquet and Przeworski, 2007). In

contrast to IM, the method uses summaries of the polymorphism data (from each

of the independently-evolving loci) rather than the full data. Specifically, it uses

four statistics known to be sensitive to the parameters of the isolation-migration

115

model (Wakeley and Hey, 1997; Leman et al., 2005). In simulations, MIMAR performs

comparably to IM for medium to large size data sets, providing reliable estimates of

the parameters and reasonable power to detect gene flow. MIMAR, like the original

version of IM, assumes a constant and uniform gene flow rate since the split, so

cannot distinguish between models of speciation with gene flow occurring at an early

or late stage. Thus, interpreting the results in terms of modes of speciation is not

straightforward.

In drawing inferences about speciation from computational methods, a further

complication is that the approaches may not be robust to realistic departures from the

assumptions of the models. For instance, in many analyses of the isolation-migration

model published to date (e.g., Hey et al., 2004; Hey, 2005; Won and Hey, 2005;

Thalmann et al., 2006; Becquet and Przeworski, 2007), the estimate of the ancestral

effective population size is more than two fold larger than that of the descendant

populations. Although in some cases these results may reflect a true reduction in

effective population sizes after the split, this is unlikely to be the case for so many

populations. Instead, the large estimates of ancestral population size may indicate

that a salient and fairly common demographic feature is being ignored. For instance,

geographical structure in the ancestral population or changing gene flow rates since

the split could lead to biased estimates of the parameters when not taken into account

(e.g., Innan and Watanabe, 2006).

Motivated by these considerations, we wanted to assess the reliability of these two

methods and in particular to investigate how realistic violations of model assumptions

affect parameter estimates. We find that parameter estimates tend to be biased in

the presence of realistic complications. In particular, non-zero gene flow is often

detected under models that would not be considered cases of parapatric or sympatric

116

speciation. When a model of parapatry is appropriate, we propose a simple approach

to identify candidate regions that contribute to reproductive isolation between species.

4.3 Method

4.3.1 The isolation model and violations

Figure 4.1a depicts an isolation model in which, T generations ago, a panmictic an-

cestral population of constant effective population size NA suddenly split into two

panmictic populations of constant effective population sizes N1 and N2, respectively,

which subsequently diverged neutrally in total reproductive isolation (i.e., the pro-

portion of a population that is replaced by migrants from the other population each

generation, m, is 0). This model can be viewed as representing a simple case of

allopatric speciation.

We wanted to assess the reliability of the methods with parameters that are

roughly similar to those of real data sets analyzed to date. For example, a num-

ber of papers have analyzed common chimpanzee and bonobo polymorphism data,

two species whose divergence time is estimated to be approximately 1 million year

and whose effective population sizes are on the order of N = 10, 000 (Won and Hey,

2005; Becquet and Przeworski, 2007). Assuming 20 years per generation (Gage, 1998;

Fischer et al., 2004), these great ape species therefore split ∼5N generations ago. In

turn, one of the most studied cases of Drosophila speciation are D. santomea and

D. yakuba, who are thought to have split roughly 500 thousand years (Kya), which

corresponds to 5 million generations (assuming 10 generations per year). Their ef-

fective population sizes are about N = 2× 106 (Bachtrog et al., 2006), so they split

roughly 2N generations ago. Motivated by these two cases, we arbitrarily considered

117

T = 3.2N . With this parameter value, on average ∼25−34% of loci will have complete

lineage sorting (i.e., will not share polymorphisms) between the two species under the

isolation model (Hudson and Coyne, 2002). We further set all effective population

sizes to be the same, at N1 = N2 = NA = 6.25× 105. Computational burden limited

our ability to consider other parameters, but we did run a subset of the models with

T = 0.32N generations, and results were qualitatively similar (results not shown).

For the performance of IM and MIMAR to be comparable, we set the intra-locus

recombination rate to 0 (since IM does not allow for intra-locus recombination).

Thus, in what follows, we consider data sets of 20 independently-evolving and non-

recombining 1kb loci. In real data, there usually is intra-locus recombination so a

data set of this size would generally contain more information than modeled here.

Given these parameters, we generated 20 data sets under the isolation model. We

further simulated 10 data sets under two models in which one of the assumptions of the

isolation model is violated. The parameters were as above, other than in the following

respects: We took T1= 175 , 2 or 4·T (Figure 4.1c) or T2= 0.25, 0.50 or 0.75·T (Figure

4.1d), where the split time, T , was fixed in all cases to 3.2N generations, and the

gene flow rate after T1 (or T2) was m1 (or m2)= 4× 10−7 or 4× 10−6, while directly

after the split m = 0. For these gene flow rates, the average number of migrants

exchanged each generation by the two populations, M1 = 4N1m1 (or M2 = 4N1m2),

is 1 or 10, respectively. We use a modification of the program ms (Hudson, 2002) to

generate the simulated data sets. The command lines used to generate each case are

given in the Supplemental Materials (Appendix A).

118

1N

2N

AN

Present

Past

T

m = 0

(a)

T

m>0

(b)

T1

m1>0

T

m = 0

(c)

T2

m2>0

m = 0

T

(d)

T2

m>0

m2 = 0

T

(e)

Fig

ure

4.1:

Sim

ple

models

of

speci

ati

on

.a)

The

isol

atio

nm

odel

,b)

The

isol

atio

n-m

igra

tion

model

,c)

Am

odel

ofis

olat

ion

from

ast

ruct

ure

dan

cest

ralp

opula

tion

,d)

Am

odel

ofis

olat

ion

follow

edby

seco

ndar

yco

nta

ctan

de)

Am

odel

ofis

olat

ion

wit

hm

igra

tion

only

atan

earl

yst

age.

Not

eth

atin

c)th

ep

opula

tion

stru

cture

forT<t<T

1diff

ers

from

that

fort<T

.T

he

model

sdis

pla

yed

ina)

,c)

and

d)

are

neu

tral

cart

oon

sof

allo

pat

ric

spec

iati

on,

while

model

sb)

and

e)hav

eb

een

vie

wed

asneu

tral

cart

oon

sof

par

apat

ric

spec

iati

on.m

,m

1an

dm

2den

ote

the

gener

atio

nal

frac

tion

ofa

pop

ula

tion

that

isre

pla

ced

by

mig

rants

from

the

other

pop

ula

tion

.N

1,N

2an

dN

Aar

eth

eeff

ecti

vep

opula

tion

size

s,T

isth

ediv

erge

nce

tim

ean

dT

1an

dT

2ar

eti

mes

ofot

her

even

tsin

gener

atio

ns;

see

Met

hods

for

furt

her

det

ail.

119

Simulations under a model of isolation from a structured ancestral pop-

ulation (Figure 4.1c). We generated simulated data sets under a neutral model

of isolation from a structured ancestral population. Specifically, we assumed that,

T1 generations ago, an ancestral population became structured into two populations

that exchanged migrants at constant rate m1 > 0. Then, T generations ago, this

structured ancestral population suddenly split into two descendant populations that

subsequently diverged in total reproductive isolation (i.e., m = 0). Note that the

population structure in the ancestral population does not correspond to the two

populations that subsequently diverged. This scenario can be viewed as allopatric

speciation from a structured ancestral population, since the split leading to the two

contemporary populations occurs in the absence of gene flow.

Simulations under a model of isolation followed by secondary contact (Fig-

ure 4.1d). We generated simulated data sets under a neutral model of isolation

followed by secondary contact as follows: Starting T generations ago, the descendant

populations diverged in total reproductive isolation (i.e., m = 0) until T2 generations

ago, when they experienced gene flow at constant rate m2 > 0. This scenario can be

viewed as allopatric speciation followed by secondary contact, since the split initially

occurs in the absence of gene flow.

4.3.2 The isolation-migration model and violations

In Figure 4.1b, we represent an isolation-migration model, in which T generations

ago, the ancestral population suddenly split into two populations, which subsequently

diverged in presence of a constant rate of gene flow (i.e., a constant m > 0). This

model can be viewed as representing a neutral cartoon of parapatric speciation (see

120

Introduction). We generated 20 simulated data sets under this model, each consisting

of 20 non-recombining 1 kb loci (Note that to make the results comparable, the

results for this model in Figure 4.5 are for a randomly chosen subset of 10 data

sets). The parameter values were as described above for the isolation model, but

with m = 4× 10−7 since the split.

Simulations under a model of isolation with migration at an early stage

(Figure 4.1e). We generated simulated data sets under an isolation-migration

model, but with changing gene flow rates over time: Starting T generations ago,

the descendant populations diverged in the presence of gene flow at constant rate

m > 0 until T2 generations ago, when they subsequently diverged in total reproduc-

tive isolation (i.e., m2 = 0). This scenario can be viewed as a neutral representation

of parapatric speciation, since the split initially occurs in the presence of gene flow.

Six sets of 10 data sets were simulated under this model, each with combinations of

T2= 0.25, 0.50 or 0.75·T and m = 4× 10−7 or 4× 10−6.

Simulations under an isolation-migration model with a locus with an un-

usual history. We generated simulated data sets, consisting of 20 non-recombining

1 kb loci, under the isolation-migration model, but varying the introgression rates

among loci. Specifically, the polymorphism data for 19 of the loci were simulated

under the isolation-migration model with the same parameters as described above for

the simple isolation-migration (i.e., model depicted in Figure 4.1b with m = 4×10−7

since the split). The remaining locus was used to mimic a region closely linked to a

reproductive isolation factor, which experienced restricted effective migration because

of selection against introgression. To do so, we simply generated polymorphism data

at the remaining locus under the same model as the 19 other loci, but with the gene

121

flow rate, m, set to 0 (model i, see Figure 4.1a).

We were also interested in considering a model where the data included a locus

where a recent selective sweep occurred at a nearby locus (e.g., a reproductive isolation

factor). We did so by simulating a reduced effective population size in one of the

populations (Galtier et al., 2000). In this model ii, the data were generated with the

same parameters as above for the simple isolation-migration (i.e., model depicted in

Figure 4.1b with m = 4 × 10−7 since the split), but the data at one of the 20 loci

was simulated with N1 = 110N2, while the data at the other 19 were generated with

N1 = N2.

4.3.3 Estimating the parameters of the isolation-migration model

From the simulated data, we calculated the statistics used by MIMAR, namely for each

locus: the number of derived polymorphisms unique to the samples from populations

1 and 2 (S1 and S2, respectively), the number of shared derived alleles between the

two samples (Ss), and the number of fixed derived alleles in either sample (Sf ),

assuming a known ancestral state (Becquet and Przeworski, 2007).

We then analyzed the summarized simulated data sets using MIMAR and the full

simulated data sets using IM to estimate the five parameters of the isolation-migration

model (see Figure 4.1b). For the two methods to be comparable the five parameters of

the model, including m, were sampled from the same uniform prior distributions with

0 as the lower limit (as required by IM). Each set of 10 (or 20) data sets generated

under a specific model was analyzed using the same uniform prior distributions. In

a subset of cases, we increased the range of the priors after preliminary analyses

to ensure that marginal posterior distributions estimated by both methods fit in

the prior ranges for all 10 (or 20) data sets. Specifically, we set the upper limit of

122

the uniform prior distributions to be either 1.25 × 106 or 2.5 × 106 for N1 and N2,

2 × 10−5, 4 × 10−6 or 4 × 10−5 for m, 2.5 × 106 for NA and 1 × 107 generations

for T (i.e., either 4N1 or 8N1). We ran both MIMAR and IM with two different seeds

for each analysis, to try to gauge that the methods had converged properly. We also

varied the number of burn-in and recorded steps to this end. We note that when

the assumptions of the isolation model are violated, MIMAR sometimes takes much

longer to reach convergence than when the model is valid. The problem is even more

acute with IM, often requiring more burn-in and recorded step to reach convergence,

presumably because the approach uses the full polymorphism data. For instance,

we ran the analyses with MIMAR on a cluster of 2.60 GHz dual and 2.40 GHz quad

AMD processors, using 5 × 105 burn-in and 4 × 106 recorded steps. When the data

were simulated under an isolation model (Figure 4.1a), the analyses took about 3

days on average; in turn, when the data were simulated under a model of isolation

followed by secondary contact (Figure 4.1d), MIMAR ran for about 5 and 13 days when

m2 = 4× 10−7 and 4× 10−6, respectively. We ran IM on a cluster of 2.40 GHz Intel

processors, which is about 3.8 times slower than the cluster described above. IM

required 5× 107 burn-in and 5× 107 recorded steps for most of the analyses to reach

convergence. When the data were simulated under an isolation model (Figure 4.1a),

IM analyses took about 9.5 days on average; in turn, when the data were simulated

under a model of isolation followed by secondary contact (Figure 4.1d), IM required

5 independent heated chains to converge and took about 22.5 and 25.5 days when

m2 = 4× 10−7 and 4× 10−6, respectively (see also Results). In this particular case,

we do not report results for 11 of the 60 simulated data sets, as the estimates of the

posterior distribution did not seem to have converged, even after 5× 107 burn-in and

5× 107 recorded steps for 5 independent heated chains.

123

We used the mode of the marginal posterior distribution estimated by MIMAR or

IM as our point estimate of the parameter. We considered that the method detected

evidence of gene flow when m ≥ 4 × 10−8 (corresponding to M = ˆ4N1m ≥ 0.1).

We introduced this admittedly arbitrary criterion in previous work (Becquet and

Przeworski, 2007), because it captures the approach taken more informally in other

studies (e.g., Hey and Nielsen, 2004; Won et al., 2005; Bull et al., 2006; Geraldes

et al., 2006; Kronforst et al., 2006; Carling and Brumfield, 2008; Mazzoni et al.,

2008). Other criteria, such as whether most of the density of the estimated marginal

posterior distribution of m clusters around 0 or peaks away from 0 (e.g., Hey and

Nielsen, 2004) yield the same qualitative results (not shown).

4.3.4 Goodness-of-fit test

Brief description of the test. We previously described the use of posterior predic-

tive probabilities to test whether the model estimated by MIMAR provides an adequate

fit to the data (Becquet and Przeworski, 2007). In this study, we used this goodness-

of-fit test on models estimated by MIMAR and IM from a simulated data set (hereafter

referred to as the “observed” data). We first summarized the polymorphism data

at each locus by the four statistics described above (S1, S2, Ss and Sf ). We also

calculated the following statistics: a measure of the differentiation between the two

populations, FST (Hudson et al., 1992), and, in each population, the number of pair-

wise differences, π (Nei and Li, 1979) and Tajima’s D (Tajima, 1989). From those, we

obtained a set {S+1 , S

+2 , S

+s , S

+f , FST , π1, π2, D1, D2} where each element is the sum

(X+) or mean value (X) of statistic X across the 20 loci in the data set. We wanted

to use these statistics to gauge how well the estimated isolation-migration model fits

the data. To do so, we generated 5, 000 or more data sets by sampling the parameters

124

from the posterior distributions estimated by MIMAR (or IM) from the observed data

set. From these simulations, we obtained the distributions of the nine statistics under

the estimated model. These distributions can be used to obtain “posterior predictive”

p-values (Meng, 1994), i.e., to calculate the probability of the “observed statistic” or

a more extreme value under the estimated model.

Some of the statistics that we considered are discrete and take on few values, so

that the p-values obtained in this way are not continuous. We therefore calculated

randomized probabilities, PR, as follows: We obtained the probabilities P (X < X∗)

and P (X ≤ X∗), where X∗ is the value of the statistic X in the “observed” data

set, then chose a randomized probability uniformly on (P (X < X∗), P (X ≤ X∗)).

We conservatively considered the model to be a poor fit if the observed value of a

statistic falls in the 2.5th percentile tails of any statistic, i.e., if PR(X < X∗) < 0.025

or > 0.975.

Locus-specific goodness-of-fit test. We took a simple approach to identify loci

with an unusual history. The strategy is the same as described for the goodness-

of-fit of the model, but here, the idea is to apply the test to each locus, i.e., use

the simulated distribution of the nine statistics for a single locus (as opposed to

the full data sets of 20 loci) to look for unusual patterns. To do so, we considered

the randomized probabilities PR(X < X l) and PR(Y > Y l), where, X l and Y l

are the values of the statistics X and Y at locus l, X ∈ {S1, S2, Ss, π1, π2, D1, D2}

and Y ∈ {Sf , FST }. We focus on these p-values to detect potentially interesting

loci because one expects fewer shared and more fixed polymorphisms and larger FST

values at loci involved in the reproductive isolation of two species (e.g., Hey, 1991,

2006b), and fewer polymorphism and lower π and D values at loci that have been

subject to recent positive selection (Maynard Smith and Haigh, 1974).

125

A concern is whether such probabilities behave as regular p-values, i.e., are uni-

formly distributed under the null model. To assess this, we performed a locus-specific

goodness-of-fit test at each of 20 loci in 20 data sets simulated under the isolation

and isolation-migration model (Figure 4.1a and b and see above), and obtained the

posterior predictive probabilities of the nine statistics for each locus for both models.

We then assessed whether the posterior predictive p-values for a given statistic are

approximately uniformly distribution (using a Kolmogorov-Smirnov test). We find

that while they are not strictly uniform, the fit is reasonable (results not shown) and

we therefore used these Bayesian posterior predictive p-values as we would regular

p-values, considering a locus to be significant when PR(X) < 0.05. We also investi-

gated the power to detect loci with an unusual history by combining the statistics.

To do so, we took the product of the randomized probabilities PR(X < X l) and

PR(Y > Y l) even though the statistics are not independent (following Voight et al.,

2005).

4.4 Results

We previously simulated data sets under the isolation-migration model and used

MIMAR to estimate parameters, finding that the method provides reliable estimates

of the parameters, and that, in the absence of recombination, the performance is

comparable to that of IM for medium to large data sets. We further found that

MIMAR has reasonable power to distinguish between a model with and without gene

flow since the split (Tables 1 and 2 of Becquet and Przeworski, 2007). In the present

study, we investigated the reliability of such methods when the data do not conform

to all the assumptions of the model.

126

4.4.1 Performance of MIMAR and IM under the isolation and

isolation-migration models

We began by investigating the reliability of MIMAR and IM under the simple isolation

(Figure 4.1a) and isolation-migration models (Figure 4.1b). We confirmed our previ-

ous results that MIMAR reliably distinguishes between a model with or without gene

flow, usually estimating m < 4× 10−8 for the isolation model and m ≥ 4× 10−8 for

the isolation-migration model (Methods and Supplemental Table 4.3 in Appendix B;

see also Table 2 of Becquet and Przeworski, 2007); as expected, IM also distinguishes

well between the two models. Both methods usually provide comparable and reliable

estimates of the parameters of interest. Specifically, the estimates of the ancestral

effective population size, in particular, tend to be fairly precise and are close to un-

biased (Figure 4.2 and Supplemental Table 4.3). In turn, the gene flow rates are

over-estimated a bit more by MIMAR than by IM under the isolation model (result not

shown), but tend to be well estimated under the isolation-migration model (results

not shown).

The least reliable estimate appears to be that of the split time, which is under-

estimated by MIMAR (and to a lesser extent, IM) under the isolation-migration model

(Supplemental Table 4.3 and Figure 4.2; see also Table 2 of Becquet and Przeworski,

2007). Because the split time, ancestral effective population size and gene flow rate

are correlated, estimates of other parameters also tend to be off in cases where T is

poorly estimated.

We performed a goodness-of-fit test of the estimated models and found that in

over 90% of the simulations, the models estimated by MIMAR or IM fit the data. In

the remaining few cases, the model were not well fitted for Tajima’s D calculated in

each population.

127

MIMAR IM MIMAR IM

0.0

0.5

1.0

1.5

Method

NA^

NA

m=0 m=4x10−7

(a)

MIMAR IM MIMAR IM

0.5

1.0

1.5

2.0

Method

TT

m=0 m=4x10−7

(b)

Figure 4.2: Estimates provided by MIMAR and IM from data simulated underthe isolation (Figure 4.1a) and isolation-migration models (Figure 4.1b).The y-axis shows the estimate provided by the mode of the marginal posterior dis-tributions divided by the true value. The results are shown for two parameters: theancestral effective population size, NA (a), and the divergence time, T (b), for 20 datasets simulated under the isolation model (m = 0, black) and the isolation-migrationmodel (m = 4× 10−7, blue).

In summary, under the isolation-migration model, the models estimated by MIMAR

and IM are usually comparable, providing concordant marginal posterior distributions

(results not shown), and a good fit to the data. Both methods detect whether or not

the data were generated in the presence of a constant gene flow rate since the split

and provide close to unbiased estimates of the ancestral effective population size.

However, MIMAR tends to under-estimate the time of divergence when there is some

gene flow since the split.

4.4.2 Effect of violations of the isolation model (Figure 4.1a)

Allopatric divergence from a structured ancestral population

Isolation-migration models estimated by MIMAR or IM from empirical data often

128

yield an estimate of the ancestral effective population size that is much larger than

that of either descendant populations (e.g., Hey et al., 2004; Hey, 2005; Won and

Hey, 2005; Thalmann et al., 2006; Becquet and Przeworski, 2007). Yet, MIMAR and

IM do not tend to over-estimate this parameter when the data are generated under

the isolation-migration model (Figure 4.3a and Supplemental Table 4.3; and see Table

2 of Becquet and Przeworski, 2007), so these results do not simply reflect a bias in

the estimators. Instead, they suggest that an aspect of real data not taken into

account by the models considered by the two methods leads to large estimates of the

ancestral effective population size. We investigated whether such results could reflect

geographical structure in the ancestral population. To do so, we simulated data sets

under a model of isolation from a structured ancestral population (Figure 4.1c), for

a range of parameter values. We then applied MIMAR and IM to these data sets to

estimate the parameters of the isolation-migration model (see Methods).

As shown in Figure 4.3, the estimates provided by MIMAR and IM tend to be less

reliable and precise than when the data are simulated under a simple isolation model

(case T1 = T in Figure 4.3; see also Supplemental Table 4.3). For instance, MIMAR

can yield over-estimates of the ancestral effective population size and large under-

or over-estimates of the divergence time, depending on the parameter values (Figure

4.3). In contrast to MIMAR, IM tends to systematically over-estimate the divergence

time but can either under- or over-estimate the ancestral effective population size. We

note that the marginal posterior distributions of the parameters provided by IM and

MIMAR are often markedly dissimilar in this case (results not shown). Perhaps most

importantly, MIMAR and IM reject the isolation model in all cases (i.e., m ≥ 4×10−8),

even though the only gene flow in the model occurs in the ancestral populations, before

the split that led to the two descendant population (Table 4.1).

129

3.2N 4.3N 6.4N 12.8N 4.3N 6.4N 12.8N

01

23

4

T1

NA^

NA

m1=4x10−7 m1=4x10−6

T1=T

(a) MIMAR

3.2N 4.3N 6.4N 12.8N 4.3N 6.4N 12.8N

01

23

4

T1

NA^

NA

m1=4x10−7 m1=4x10−6

T1=T

(b) IM

3.2N 4.3N 6.4N 12.8N 4.3N 6.4N 12.8N

01

23

45

T1

TT

m1=4x10−7 m1=4x10−6

T1=T

(c) MIMAR

3.2N 4.3N 6.4N 12.8N 4.3N 6.4N 12.8N

01

23

45

T1

TT

m1=4x10−7 m1=4x10−6

T1=T

(d) IM

Figure 4.3: Estimates provided by MIMAR (a and c) and IM (b and d) fromdata simulated under models of isolation from a structured ancestral pop-ulation (Figure 4.1c).See legend of Figure 4.2 for details. The results are for NA (a−b) and T (c−d) undersix models of isolation from a structured ancestral population (see Methods) in whichT1= 1

75 , 2 or 4·T and m1 = 4 × 10−7 (blue) or 4 × 10−6 (red). For comparison,we show the estimates provided by the methods for a randomly chosen subset of 10data sets generated under the correct model, i.e., the isolation model (Figure 4.1a, inwhich T1 = T ). Note, these are preliminary results as all analyses were not finishedor did not reach convergence in time for this draft.

130

m1

4×

10−

74×

10−

6

T1

inge

ner

atio

ns

4.3N

6.4N

12.8N

4.3N

6.4N

12.8N

Model

c:A

nce

stra

lst

ruct

ure

MIMAR

1.00

1.00

1.00

1.00

1.00

1.00

IM1.

001.

001.

00N

A1.

00N

A

morm

24×

10−

74×

10−

6

T2

inge

ner

atio

ns

0.8N

1.6N

2.4N

0.8N

1.6N

2.4N

Model

d:

Sec

ondar

yco

nta

ctMIMAR

1.00

1.00

1.00

1.00

1.00

1.00

IM1.

001.

001.

001.

001.

001.

00

Model

e:E

arly

gene

flow

MIMAR

0.80

0.20

0.00

1.00

0.40

0.00

IM0.

200.

000.

000.

000.

000.

00

Tab

le4.

1:P

rop

ort

ion

of

analy

ses

inw

hic

hnon-z

ero

gene

flow

was

dete

cted.

Pro

por

tion

ofan

alyse

sfo

rw

hic

hth

ees

tim

ate

ofth

ege

ne

flow

ratem≥

4×

10−

8(s

eeM

ethods

and

Bec

quet

and

Prz

ewor

ski,

2007

,fo

rdet

ails

).W

esh

owth

ere

sult

spro

vid

edbyMIMAR

and

IMon

dat

ase

tssi

mula

ted

under

thre

em

odel

sillu

stra

ted

inF

igure

s4.

1c−

ew

ith

each

ofth

esi

xco

mbin

atio

nof

par

amet

ers

that

we

consi

der

ed(s

eeM

ethods)

.N

ote,

the

resu

lts

for

model

c)ar

epre

lim

inar

yas

all

anal

yse

sw

ere

not

finis

hed

ordid

not

reac

hco

nve

rgen

cein

tim

efo

rth

isdra

ftan

dfo

rm

odel

d)

we

show

the

resu

lts

ofth

e49

of60

IMan

alyse

sth

atre

ached

conve

rgen

ce.

131

For some parameters, the goodness-of-fit test (notably using FST ) could detect

that the models estimated by MIMAR do not fit data simulated under models of isolation

from a structured ancestral population. But for others (e.g., T1 = 175T = 4.2N

generations and m1 = 4×10−7), our goodness-of-fit test usually tended not to detect

that the estimated model is invalid (Table 4.1 and Supplemental Tables 4.4−4.5).

In summary, the simulations indicate that the large estimates of the ancestral

effective population size provided by IM and MIMAR on real data could reflect structure

in the ancestral population − even structure unrelated to subsequent divergence.

Worryingly, however, the results from both methods could be interpreted incorrectly

as rejecting a model of allopatric speciation. Overall, MIMAR seems to be slightly more

sensitive to model misspecification than IM (Supplemental Tables 4.4−4.5), which is

perhaps not surprising since MIMAR considers only a subset of the information used

by IM when there is no intra-locus recombination, such as in these cases.

Allopatric divergence followed by secondary contact

In some cases when allopatric species came back in secondary contact recently,

the data could be incorrectly interpreted as reflecting parapatric speciation. This

motivated us to investigate the effect of isolation followed by secondary contact on

estimates provided by MIMAR and IM. To do so, we simulated sets of 10 data sets

under such a model for several combinations of parameter values (see Methods and

Figure 4.1d).

As shown in Figure 4.4, the estimates provided by the two methods become less

reliable and precise than when the data are simulated under the simple isolation model

(case T2 = T in the Figures 4.4; see also Supplemental Table 4.3). The two methods

tend to under-estimate the ancestral effective population size; in turn, MIMAR provides

under-estimate of the time of divergence while IM tends to slightly over-estimate this

132

parameter. We also find that MIMAR estimates substantial gene flow rates, in all cases

rejecting the isolation model; while IM also detects gene flow in all cases, the estimates

tend to be closer to our arbitrary cut-off m = 4×10−8 (results not shown). Moreover,

the goodness-of-fit test could usually not detect that the estimated model is incorrect

(Supplemental Tables 4.6−4.7). In this case, MIMAR seems to be more affected than

IM by the misspecification of the model: The estimated posterior distributions of

MIMAR are often not well resolved: the marginal posterior distribution for the split

time is often bimodal and for other parameters, the distributions are often flat (data

not shown). Similarly, IM has great trouble reaching convergence in this case (see

Methods). These observations likely reflect the difficulty of fitting a simple isolation-

migration model to data simulated under a more complex model. Also in this case

the marginal posterior distributions provided by IM and MIMAR are often markedly

dissimilar.

4.4.3 Effect of violations of the isolation-migration model (Figure

4.1b): Parapatry with gene flow only at an early stage

Thus far, we considered cases where models of allopatry give rise to parameter esti-

mates that are likely to be interpreted as supporting parapatric speciation. But the

reverse may also occur: In some cases of parapatric speciation, early gene flow may

not be detected by IM or MIMAR and thus the data could be incorrectly interpreted

as reflecting allopatric speciation. In addition, the large estimates of the ancestral

effective population size provided by the methods on real data could also results from

such cases, for example if there was decreasing gene flow rates since the population

split (e.g., Innan and Watanabe, 2006). To assess this possibility, we investigated

the effect of a change in gene flow rates over time, simulating 10 data sets under a

133

0N 0.8N 1.6N 2.4N 0.8N 1.6N 2.4N

0.0

0.5

1.0

1.5

T2

NA^

NA

m2=4x10−7 m2=4x10−6

T2=0

(a) MIMAR

0N 0.8N 1.6N 2.4N 0.8N 1.6N 2.4N

0.0

0.5

1.0

1.5

T2

NA^

NA

m2=4x10−7 m2=4x10−6

T2=0

(b) IM

0N 0.8N 1.6N 2.4N 0.8N 1.6N 2.4N

0.0

0.5

1.0

1.5

2.0

2.5

T2

TT

m2=4x10−7 m2=4x10−6

T2=0

(c) MIMAR

0N 0.8N 1.6N 2.4N 0.8N 1.6N 2.4N

0.0

0.5

1.0

1.5

2.0

2.5

T2

TT

m2=4x10−7 m2=4x10−6

T2=0

(d) IM

Figure 4.4: Estimates provided by MIMAR (a and c) and IM (b and d) fromdata simulated under models of divergence in isolation followed by sec-ondary contact (Figure 4.1d).Results for six models of isolation followed by secondary contact are shown, withcombinations of T2 = 0.25, 0.50 and 0.75·T and m2 = 4 × 10−7 (blue) or 4 × 10−6

(red) (See legend of Figure 4.3 and Methods for details). For comparison, we showthe estimates provided by the methods for a randomly chosen subset of 10 data setssimulated under the correct model, i.e., the isolation model (Figure 4.1a, in whichT2 = 0). Note that only 49 of 60 IM analyses reached convergence in this case.

134

model of isolation with migration only at an early stage, for several combinations of

parameter values (see Methods and Figure 4.1e).

IM (but not MIMAR) tends to slightly over-estimate the ancestral effective popu-

lation size in this case (Supplemental Table 4.9). Since IM does not over-estimate

this parameter under the isolation-migration model, these results are likely due to

the presence of gene flow shortly after the split. Moreover, IM, and to a lesser extent

MIMAR, tend not to detect gene flow in this case (using m ≥ 4 × 10−8 as a crite-

rion). In this case again, both methods often provide markedly dissimilar marginal

posterior distributions for the parameters of the model, with IM apparently slightly

more affected by the misspecification of the model (Figure 4.5 and Supplemental Ta-

bles 4.8−4.9). This result presumably reflects that the summary statistics considered

by MIMAR are less sensitive to the presence of gene flow only at the early stage of

divergence.

We also performed a goodness-of-fit test and found that in most cases the models

estimated by MIMAR or IM fit the data. So, our simple goodness-of-fit test cannot

detect that the data did not conform to the assumption of a constant gene flow rate

since the split.

In summary, our results suggest that the large estimates of the ancestral effective

population size provided by IM on real data could reflect changing gene flow rates since

the split. Moreover, IM tends to not detect gene flow in this case, and so the results

would incorrectly be interpreted as consistent with allopatric speciation. In contrast,

MIMAR does not seem to provide upwardly biased estimates of this parameter (at least

for the models we investigated here) and usually detects some gene flow, despite the

fact that the introgression events are old.

135

0N 0.8N 1.6N 2.4N 3.2N 0.8N 1.6N 2.4N

0.0

0.5

1.0

1.5

2.0

T2

NA^

NA

M=4x10−7 M=4x10−6

T2=TT2=0

(a) MIMAR

0N 0.8N 1.6N 2.4N 3.2N 0.8N 1.6N 2.4N

0.0

0.5

1.0

1.5

2.0

T2

NA^

NA

M=4x10−7 M=4x10−6

T2=TT2=0

(b) IM

0N 0.8N 1.6N 2.4N 3.2N 0.8N 1.6N 2.4N

0.5

1.0

1.5

2.0

T2

TT

M=4x10−7 M=4x10−6

T2=TT2=0

(c) MIMAR

0N 0.8N 1.6N 2.4N 3.2N 0.8N 1.6N 2.4N

0.5

1.0

1.5

2.0

T2

TTM=4x10−7 M=4x10−6

T2=TT2=0

(d) IM

Figure 4.5: Estimates provided by MIMAR (a and c) and IM (b and d) whendata are simulated under models of isolation with migration only at anearly stage (Figure 4.1e).The results for six models of isolation with migration at an early stage are shown:T2 = 0.25, 0.50 and 0.75·T, and m = 4 × 10−7 (blue) or 4 × 10−6 (red) (See legendof Figure 4.3 and Methods for details). For comparison, we show the estimatesprovided by the methods for a randomly chosen subset of 10 data sets generatedunder the correct models, i.e., the isolation-migration model with m = 4 × 10−7

(Figure 4.1b, in which T2 = 0) and the isolation model with m = 0 (Figure 4.1a,in which T2 = T ). The methods tend to under estimate T , but this is not an effectof the model misspecification since this parameter is also under estimated when thegene flow is constant after the split.

136

4.4.4 Detecting loci with unusual history

To date, only a handful of genes involved in the reproductive isolation between pairs

of species have been characterized, through time-consuming molecular approaches.

To investigate whether computational approaches such as MIMAR could be used to

help identify candidate regions for further investigation, we assumed that our set

of independently-evolving loci contains one locus closely linked to a reproductive

isolation factor. To model this, we simulated polymorphism data under an isolation-

migration model. We further assumed that one locus out of the set has either experi-

enced no gene flow since the split (model i), or that it has a lower effective population

size than other loci (model ii; see Methods for details). Applying MIMAR to these data

yields estimates similar to those obtained when the data are simulated under a simple

isolation-migration model (i.e., when all 20 loci have the same history; Supplemental

Figure 4.6 in Appendix C). Moreover, the estimated model fits the data at the 19 loci

with the same history as assessed by out goodness-of-fit test. Although these results

suggest that the inclusion of one locus with a different history has little effect on the

demographic inference for model ii, there appear to be an effect when one locus did

not experience gene flow (model i; see Supplemental Table 4.10).

Next, we wanted to investigate whether one could detect the locus with an unusual

history. We first performed a goodness-of-fit test on the full data sets and found that

the estimated models fit the full data sets well (i.e., of 20 loci, including the one

with an unusual history). The results would thus be interpreted as consistent with

parapatric speciation. Next, we applied a goodness-of-fit test to each locus separately.

Specifically, for each locus, we obtained the posterior predictive p-values for nine

statistics at this locus given the model estimated by MIMAR (see Methods). As shown

in Table 4.2, using the number of shared or fixed polymorphisms as our statistic (i.e.,

137

Ss and Sf ), a locus with no gene flow since the split had a significant PR value at

the 5% level in approximately two-thirds of the cases; the use of other statistics leads

to lower power. To detect a locus with a reduced effective population size, in turn,

measures of diversity (i.e., S1 and π1) appear to afford most power.

We also investigated whether the test can be improved by combining statistics

(see Methods). We find that if we combine five statistics (S1, π1, Ss, Sf , and FST ),

the power becomes quite high (Supplemental Table 4.11) and the false positive rate

remains low.

Thus, this simple approach seems helpful as a first step in detecting candidate

reproductive isolation loci between species pairs. We note that this method is a

simple quantification of the approach used informally in previous papers (Won and

Hey, 2005; Geraldes et al., 2006; Niemiller et al., 2008). However, we do not put

weight on the specific estimates of power or false discovery rates of the tests because

our models of loci with a reduced effective gene flow rate are highly simplified cartoons

of reproductive isolation factor. One would need much more realistic models to gain

an accurate sense of the performance of our approach, but these results suggest that

extensions of this method are worthwhile investigating.

138

S1

S2

S1/S

2S

sS

fM

odel

iii

iii

iii

iii

iii

Fra

ctio

nof

unusu

allo

cith

atar

esi

gnifi

cant

0.00

0.75

0.00

0.00

0.05

0.60

0.65

0.20

0.70

0.00

Fra

ctio

nof

regu

lar

loci

that

are

sign

ifica

nt

0.02

0.01

0.02

0.02

0.05

0.03

0.01

0.01

0.01

0.01

Fra

ctio

nof

sign

ifica

nt

loci

not

unusu

al1.

000.

121.

001.

000.

950.

480.

190.

560.

221.

00

Tab

le4.

2:R

esu

lts

of

the

locu

s-sp

eci

fic

goodness

-of-

fit

test

s.R

esult

sar

efo

r20

dat

ase

tsof

20lo

ciw

hen

the

dat

aat

19lo

ciw

ere

sim

ula

ted

under

asi

mple

isol

atio

n-m

igra

tion

model

(wit

hco

nst

ant

ratem

=4×

10−

7si

nce

the

split)

,w

hile

the

dat

aat

one

locu

sw

ere

gener

ated

wit

han

unusu

alhis

tory

,

eith

erunder

am

odel

wit

hno

gene

flow

since

the

split

(modeli)

,or

wit

h1 10

thth

eeff

ecti

vep

opula

tion

size

ofot

her

loci

inp

opula

tion

1(m

odelii

).W

ere

por

tfo

rea

chst

atis

tic

the

frac

tion

ofunusu

allo

cith

atar

esi

gnifi

cant

(this

corr

esp

onds

toth

ep

ower

ofth

ete

st),

the

frac

tion

ofot

her

loci

that

are

sign

ifica

nt

(this

corr

esp

onds

toth

efa

lse

pos

itiv

era

te)

and

the

frac

tion

ofsi

gnifi

cant

loci

that

are

not

unusu

al(t

his

corr

esp

onds

toth

efa

lse

dis

cove

ryra

te).

See

Met

hods

for

det

ails

.

FS

Tπ

1π

2D

1D

2M

odel

iii

iii

iii

iii

iii

Fra

ctio

nof

unusu

allo

cith

atar

esi

gnifi

cant

0.40

0.20

0.15

0.85

0.05

0.10

0.05

0.05

0.05

0.05

Fra

ctio

nof

regu

lar

loci

that

are

sign

ifica

nt

0.03

0.07

0.02

0.03

0.04

0.05

0.04

0.06

0.04

0.04

Fra

ctio

nof

sign

ifica

nt

loci

not

unusu

al0.

580.

880.

730.

430.

940.

900.

930.

960.

940.

94T

able

4.2−

conti

nued.

139

4.5 Discussion

We showed previously that MIMAR provides reliable estimates of the isolation-migration

model and has reasonable power to distinguish between the isolation and isolation-

migration models (Becquet and Przeworski, 2007). Here, we confirmed this obser-

vation for MIMAR and show similar results for IM (see also Hey and Nielsen, 2004).

However, real data are unlikely to conform to the numerous assumptions of these

simple models of divergence. To investigate the robustness of parameter estimates

when the model assumptions are violated, we simulated data for cases with popula-

tion structure in the ancestral population or where gene flow rates change over time,

either preceding or following a period of isolation.

We find that the methods become unreliable, often providing biased estimates of

the parameters of interest, such as the time of divergence and the ancestral effec-

tive population size. In particular, IM tends to over-estimate the ancestral effective

population size when there is gene flow only at an early stage of divergence or geo-

graphic structure in the ancestral population. The reason for this can be understood

as follows: large estimates of the effective population sizes reflect a low rate of genetic

drift or, viewed another way, longer and more variable coalescence times. Under as-

sumptions of panmixia and constant population size, estimates of the parameter N

thus have a reasonably straight-forward interpretation. However, many factors can

increase the distributions of coalescence times, notably geographic structure, inflating

estimates of N . In fact, in these cases, the parameter N may no longer have a clear

meaning.

We note that the applications to real data tend to yield very large estimates

of the ancestral effective population size (e.g., Hey et al., 2004; Hey, 2005; Won

and Hey, 2005; Thalmann et al., 2006; Becquet and Przeworski, 2007), while in our

140

simulations, we find estimates that are at most 2-fold higher than the truth. Thus,

while the simulations presented here point to features that may be salient in the

interpretation of MIMAR and IM, they seem, somewhat unsurprisingly, to be missing

additional aspects of real data. Alternatively, the parameter values we considered in

this simulation study did not have as large an effect on this parameter.

A second problem highlighted by this simulation study is that MIMAR and IM tend

to detect some gene flow, even when the data are simulated under a model of allopatric

speciation (Figures 4.1c−d). Thus, application of these computational methods may

lead researchers to incorrectly conclude that the populations have evolved in para-

patry when in fact speciation initiated is the absence of gene flow. Indeed, recent

reviews have highlighted the evidence of gene flow found by IM as evidence that spe-

ciation often initiates in the presence of gene flow (Hey, 2006b; Niemiller et al., 2008).

The latter interpretation is consistent with models of parapatric or sympatric specia-

tion and the idea that natural selection plays a major role in speciation (Orr, 2005).

While it may well be correct, the present study suggests that the computational evi-

dence per se is unreliable. The problem is that the signal relied on by computational

methods (ours as well as others) is whether a null model of allopatric speciation can

be rejected based on excess variation in divergence time among loci. This excess is

judged relative to the isolation model, i.e., with panmixia in the ancestral population,

among other features. Any departures from the null model that increase the variance

in divergence times can lead to spurious rejections of allopatry.

To try to address this problem, we used a simple goodness-of-fit test to assess

whether the estimated model does not fit the data (i.e., if the gene flow varied over

time). However, our goodness-of-fit test is often unable to detect that assumptions of

the model are violated. Better goodness-of-fit tests can undoubtedly be constructed.

141

For now, an alternative, when possible (i.e., when there is little intra-locus recombina-

tion in the data), may be to apply both IM and MIMAR to a data set, as our simulations

suggest that the methods provide incongruent estimates of the parameters when the

data violate assumptions of the model, but tend to yield more similar results un-

der the correct model. Using both methods may therefore help gauge whether the

estimates are readily interpretable.

Another approach would be to obtain estimates of the times of introgression

events. IM has recently been extended to allow the estimation of locus-specific gene

flow rates (Won and Hey, 2005), a feature that can potentially be used to identify loci

with restricted effective gene flow rates (i.e., putative reproductive isolation factors)

and to date introgression events (Won and Hey, 2005; Geraldes et al., 2006; Niemiller

et al., 2008). We attempted to use this feature on data where one locus out of 20

did not experience gene flow (model i), but IM did not estimate an unusually low

gene flow rate at the locus with an unusual history for the data sets we investigated

(results not shown). More generally, variation data at a locus contain only so much

information and it remains unclear if the estimates of locus-specific gene flow rates

and their timing are reliable (Becquet and Przeworski, 2007).

In summary, our results show that using computational methods to learn about

speciation mechanisms may be misleading in ways that are not always detectable

by simple goodness-of-fit tests. What are needed are methods that use more of the

polymorphism data than does MIMAR, but that allows for intra-locus recombination.

Also needed are more realistic models, allowing for varying migration rates over time

and across the genome, as well as demographic complications such as those already

considered by IM.

In cases where the isolation-migration model is found to be appropriate, computa-

142

tional approaches may help to pinpoint candidate loci closely linked to reproductive

isolation factors. As a first step in this direction, we proposed a goodness-of-fit tests

to detect loci with unusual evolutionary histories. The results from simulated data

are encouraging. A positive control of sorts might be to collect polymorphism data

from reproductive isolation factors as well as a series of unlinked regions. By apply-

ing our goodness-of-fit approach to the data, one could then verify that such loci are

indeed detectable by this approach.

4.6 Acknowledgments

We thank G. Coop, J. Wall and Chung-I Wu for helpful discussions and XXX for

comments on the manuscript.

4.7 Appendix A: Supplemental Materials

Command lines to generate simulations under the isolation model and

violations

To generate the data sets under the isolation model, we used a modification of the pro-

gram ms (Hudson, 2002), called MIMARsim (available at http://pps-spud.uchicago.

edu/~cbecquet/download.html, and see documentation therein for details) with the

following command line:

./mimarsim 20 -lf input20loci -u 2e-9 -w .05 -t .005 -ej 2e6 -o

input4IM > input4MIMAR

The flag “-o iminputfilename“ is not described in the documentation of MIMARsim,

but allows output file in the format of the input file required by IM.

143

Command lines to generate simulations under a model of isolation from a

structured ancestral population (Figure 4.1c in main text). The command

line for the model with T1= 175 ·T and m1 = 4× 10−7 was:

./mimarsim 20 -lf input20loci -q 3 -u 2e-9 -w .05 -t .005 -ej 2666700

-M 4e-07 -eh 0.75 -es 0.75 .5 -eM 0.75 0 -o input4IM > input4MIMAR

The flag “-q 3“ is not described in the documentation of MIMARsim, but allows to

specify the gene flow rate using the parameter m instead of M . The command line

for the model with T1= 4·T and m1 = 4× 10−6 was:

./mimarsim 20 -lf input20loci -q 3 -u 2e-9 -w .05 -t .005 -ej 8000000

-M 4e-06 -eh 0.25 -es 0.25 .5 -eM 0.25 0 -o input4IM > input4MIMAR

In turn, the command lines for the other models of isolation from a structured

ancestral population were similar, changing the values of the flags ”-ej”, “-M”, “-eh”,

“-es” and “-eM”.

Command lines to generate simulations under a model of isolation followed

by secondary contact (Figure 4.1d in main text). The command line for the

model with T2= 0.25·T and m2 = 4× 10−7 was:

./mimarsim 20 -lf input20loci -q 3 -u 2e-9 -w .05 -t .005 -ej 2e6 -M

4e-07 -eM 0.25 0 -o input4IM > input4MIMAR

In turn, the command line for the model with T2= 0.75·T and m2 = 4 × 10−6

was:


4e-06 -eM 0.75 0 -o input4IM > input4MIMAR

144

The command lines for the other models of isolation followed by secondary contact

were similar, changing the values of the flags “-M” and “-eM”.

Command lines to generate simulations under the isolation-migration

model and violations

To generate the data sets under the isolation-migration model, we used the following

command line:


4e-07 -o input4IM > input4MIMAR

Command lines to generate simulations under a model of isolation with

migration at an early stage (Figure 4.1e in main text). The command line

for the model with T2= 0.25·T and m2 = 4× 10−7 was:

./mimarsim 20 -lf input20loci -u 2e-9 -w .05 -t .005 -ej 2e6 -eM .25 1

-o input4IM > input4MIMAR

In turn, the command line for the model with T2= 0.75·T and m2 = 4 × 10−6

was:

./mimarsim 20 -lf input20loci -u 2e-9 -w .05 -t .005 -ej 2e6 -eM 0.75

10 -o input4IM > input4MIMAR

The command lines for the other models of isolation with migration at an early

stage were similar, changing the values of the flag “-eM”.

145

Command lines to generate simulations under an isolation-migration model

with a locus with an unusual history. To generate the data at 19 loci for models

i and ii, we used the following command line:


4e-07 -N .005 -n .005 -o input4IM > input4MIMAR

To generate the data at the unusual locus for model i, we used the following

command line (see Methods in main text for details):

./mimarsim 1 -lf input1loci -q 3 -u 2e-9 -w .05 -t .005 -ej 2e6 -N

.005 -n .005 -o input4IM > input4MIMAR

In turn, to generate the data at the unusual locus for model ii, we used the

following command line (see Methods in main text for details):


4e-6 -N .005 -n .005 -o input4IM > input4MIMAR

146

4.8 Appendix B: Supplemental Tables

Performance of MIMAR and IM under the isolation and

isolation-migration models

Isolation Isolation-migrationMIMAR IM MIMAR IM

N1 1.00 1.00 1.00 1.00N2 1.00 1.00 1.00 1.00NA 0.95 1.00 0.80 0.80T 1.00 1.00 0.50 0.95

m ≥ 4× 10−8† 0.00 0.00 1.00 1.00Poor fit 0.15 0.05 0.10 0.05

Table 4.3: Proportion of MIMAR and IM analyses with parameter estimateswithin two fold of their true value when the data are simulated under theisolation and isolation-migration models (Figure 4.1a−b in main text).Results for the estimates of effective population sizes, N1, N2 and NA, the divergencetime, T and the gene flow rate, m. We report the proportion of 20 analyses in whichthe estimated model did not fit the data for one of the statistics.† For m, we report the proportion of 20 analyses in which gene flow was detected,i.e., where the estimate of the gene flow rate m ≥ 4 × 10−8 (see Methods in maintext for details).

147

Effect of violations of the isolation model

Allopatric divergence from a structured ancestral population

m1 = 4× 10−7 m1 = 4× 10−6

T1 in generations 4.3N 6.4N 12.8N 4.3N 6.4N 12.8NN1 1.00 1.00 1.00 NA 1.00 NAN2 1.00 1.00 1.00 NA 1.00 NANA 0.50 0.70 0.10 NA 0.14 NAT 1.00 0.30 0.20 NA 0.00 NA

Poor fit 0.00 0.10 0.30 NA 0.14 NA

Table 4.4: Proportion of MIMAR analyses with parameter estimates withintwo fold of their true value when data are simulated under models ofisolation from a structured ancestral population (Figure 4.1c in main text).See legend of Supplemental Table 4.3 for details. Note, these are preliminary resultsas all analyses were not finished or did not reach convergence in time for this draft.

m1 = 4× 10−7 m1 = 4× 10−6

T1 4.3N 6.4N 12.8N 4.3N 6.4N 12.8NN1 1.00 1.00 1.00 NA NA NAN2 1.00 1.00 1.00 NA NA NANA 0.50 0.57 0.13 NA NA NAT 1.00 0.29 0.25 NA NA NA

Poor fit 0.00 0.14 0.33 NA NA NA

Table 4.5: Proportion of IM analyses with parameter estimates within twofolds of their true value when data are simulated under models of isolationfrom a structured ancestral population (Figure 4.1c in main text).See legend of Supplemental Table 4.3 for details. Note, these are preliminary resultsas all analyses were not finished or did not reach convergence in time for this draft.

148

Allopatric divergence followed by secondary contact

m2 = 4× 10−7 m2 = 4× 10−6

T2 0.8N 1.6N 2.4N 0.8N 1.6N 2.4NN1 1.00 1.00 1.00 1.00 1.00 1.00N2 1.00 1.00 1.00 0.70 0.90 1.00NA 0.60 0.70 0.50 0.60 0.90 0.60T 0.70 0.30 0.70 0.40 0.10 0.20

Poor fit 0.20 0.10 0.10 0.20 0.30 0.00

Table 4.6: Proportion of MIMAR analyses with parameter estimates withintwo fold of their true value when data are simulated under models ofisolation followed by secondary contact (Figure 4.1d in main text).See legend of Supplemental Table 4.3 for details.

m2 = 4× 10−7 m2 = 4× 10−6

T2 0.8N 1.6N 2.4N 0.8N 1.6N 2.4NN1 1.00 1.00 1.00 1.00 1.00 1.00N2 1.00 1.00 1.00 1.00 1.00 1.00NA 0.70 0.60 0.80 0.71 0.80 0.43T 1.00 1.00 1.00 1.00 1.00 0.86

Poor fit 0.20 0.10 0.10 0.10 0.00 0.00

Table 4.7: Proportion of IM analyses with parameter estimates within twofolds of their true value when data are simulated under models of isolationfollowed by secondary contact (Figure 4.1d in main text).See legend of Supplemental Table 4.3 for details. Note that only 49 of 60 analysesreached convergence in this case.

149

Effect of violations of the isolation-migration model: Parapatry with

gene flow only at an early stage

m = 4× 10−7 m = 4× 10−6

T2 0.8N 1.6N 2.4N 0.8N 1.6N 2.4NN1 1.00 1.00 1.00 1.00 1.00 1.00N2 1.00 1.00 1.00 1.00 1.00 1.00NA 0.90 0.90 1.00 1.00 0.90 1.00T 0.90 1.00 1.00 0.50 1.00 1.00

Poor fit 0.40 0.00 0.00 0.20 0.00 0.00

Table 4.8: Proportion of MIMAR analyses with parameter estimates withintwo fold of their true value when data are simulated under models ofisolation with migration only at an early stage (Figure 4.1e in main text).See legend of Supplemental Table 4.3 for details.

m = 4× 10−7 m = 4× 10−6

T2 0.8N 1.6N 2.4N 0.8N 1.6N 2.4NN1 1.00 1.00 1.00 1.00 1.00 1.00N2 1.00 1.00 1.00 1.00 1.00 1.00NA 0.90 1.00 1.00 1.00 1.00 1.00T 0.60 1.00 1.00 0.00 0.80 1.00

Poor fit 0.10 0.00 0.00 0.00 0.00 0.00

Table 4.9: Proportion of IM analyses with parameter estimates within twofolds of their true value when data are simulated under models of isolationwith migration only at an early stage (Figure 4.1e in main text).See legend of Supplemental Table 4.3 for details.

150

Detecting loci with unusual history

Isolation-migration Model i Model iiN1 1.00 1.00 1.00N2 1.00 1.00 1.00NA 0.80 0.75 0.80T 0.50 0.75 0.50

m ≥ 4× 10−8† 1.00 1.00 1.00Poor fit 0.10 0.05 (0.05) 0.10 (0.25)

Table 4.10: Proportion of MIMAR analyses with parameter estimates withintwo fold of their true value when data at a locus with an unusual historyare simulated under models i and ii.Results are for 20 data sets with the data at 19 loci simulated under a isolation-migration model with a constant gene flow rate m = 4× 10−7 since the split and thedata at another locus simulated without gene flow since the split (model i) or witha lower effective population size than the other loci (i.e., model ii). For comparison,we also show the results for 20 data sets simulated under a simple isolation-migrationmodel (i.e., when the 20 loci have the same history). We report for one of the statisticsthe proportion of 20 analyses for which the estimated model did not fit the data for20 (isolation-migration model) or 19 loci (models i and ii) with a similar history andin parenthesis for the data including the locus with a different history. See legend ofSupplemental Table 4.3 for details.

151

Ss

Ss

S1

S1

Sta

tist

ics

Sf

FS

Tπ

1π

1C

ombin

edS

s

Sf

FS

TM

odel

iii

iii

iii

iii

Fra

ctio

nof

unusu

allo

cith

atar

esi

gnifi

cant

0.75

0.15

0.80

0.40

0.15

0.90

0.75

0.90

Fra

ctio

nof

regu

lar

loci

that

are

sign

ifica

nt

0.01

0.02

0.02

0.04

0.03

0.02

0.02

0.02

Fra

ctio

nof

sign

ifica

nt

loci

not

unusu

al0.

250.

730.

360.

680.

800.

310.

290.

28

Tab

le4.

11:

Resu

lts

of

the

best

locu

s-sp

eci

fic

goodness

-of-

fit

test

s.See

lege

nd

ofT

able

4.2

and

Met

hods

inm

ain

text

for

det

ails

.

152

4.9 Appendix C: Supplemental Figure

Isolation−migration i ii

0.0

0.5

1.0

1.5

Model

NA^

NA

Isolation−migration Model i model ii

(a)

Isolation−migration i ii

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Model

TT

Isolation−migration Model i model ii

(b)

Figure 4.6: Estimates provided by MIMAR when the data at one locus with adifferent history are simulated with models i (blue) and ii (red).See legend of Figure 4.3 in main test for details. For comparison, we show the resultsfor 20 data sets simulated under a simple isolation-migration model (black).

CHAPTER 5

ESTIMATING THE DEMOGRAPHIC PARAMETERS OF

HUMAN POPULATIONS

153

154

5.1 Abstract

In the last years, there has been an extensive effort to gather large data sets of human

genetic diversity. The largest have so far consisted of microsatellites and SNP geno-

typing data in human samples from 52 populations. While these data yielded deep

knowledge about human populations diversity and structure, they suffered from vari-

ous drawbacks that made them less than ideal for learning about human demographic

history. Recently, a data set has been purposely generated to specifically study hu-

man demographic history, consisting of resequencing data at 40 unlinked autosomal

and X-linked loci sampled in 90 individuals from six geographically distant human

populations. In this study, we apply our method MIMAR to the autosomal data to

estimate demographic parameters for these six populations. MIMAR provides results

consistent with previous knowledge from other sources: We estimate that the split

between African and non-African populations occurred about 40−70 Kya, while the

Tsumkwe San and two other African populations split 60−80 Kya. Although the

six populations are found far apart from one another today, MIMAR detects extensive

evidence of migration between them, suggesting that it may be problematic to ignore

gene flow in models of human demographic history.

5.2 Introduction

There are several data sets that extensively catalogue the human genetic diversity.

For instance, 993 autosomal microsatellite markers were genotyped in 1056 individ-

uals from 52 populations by the HGDP-CEPH (Cann et al., 2002; Rosenberg et al.,

2002, 2005). This data set yielded important insight about the genetic structure and

the extent of differentiation between human populations. The International HapMap

155

Consortium, in turn, has genotyped 4 million SNPs in a panel of 270 individuals from

four populations (Consortium, 2005; Frazer et al., 2007). The knowledge accumu-

lated from such data sets has led to powerful selection scans and association studies

(e.g., Voight et al., 2006). However, such studies have some limitations with respect

to answering questions related to human demographic history. For instance, data

sets such as HapMap suffer from an ascertainment bias, which is largely unknown.

In addition, some of the sampled populations are admixed (e.g., Akey et al., 2004;

Consortium, 2005), which can affect the estimates of demographic parameters such

as population growth rates (Ptak and Przeworski, 2002) and therefore complicates

any demographic inferences from such data sets.

Wall et al. (2008) recently published a large resequencing data set from six human

populations. The data were gathered specifically in order to learn about human

demography: The populations were chosen to avoid cryptic population structure and

the 20 autosomal and 20 X-linked loci were chosen to be intergenic and far from

genes in an attempt to avoid the confounding effect of natural selection (e.g., Voight

et al., 2005; Barreiro et al., 2008). In this study, we apply MIMAR to the data from

the autosomal loci to estimate human demographic parameters.

5.3 Methods

5.3.1 Raw data description

The data set consists of 20 autosomal regions of about 20 kb located at least 100

kb away from genic regions. Three discrete segments (locus trio) of ∼4−6 kb were

sequenced to span most of the distance of each 20 kb region (Figure 5.1). These re-

gions were chosen in areas of the human genome with medium or high recombination,

156

making MIMAR a better method to apply to this data (since it allows for intra-locus

recombination as opposed to IM, Hey and Nielsen, 2004). The 20 locus trios were

resequenced in three Sub-Saharan African populations (17 Biaka Pygmies, 18 Man-

denka and 15 Tsumkwe San), a European population (16 French Basque), an Asian

population (16 Han Chinese), and an Oceanian population (16 Nan Melanesian).

3 loci of 4-6 kb each

Region of ~20 kb

Figure 5.1: Raw data description.The cartoon shows a region of about 20 kb and the locus trio (black boxes) rese-quenced for this region.

5.3.2 Data processing

Although MIMAR allows for intra-locus recombination, it assumes that the loci are

independent, and cannot easily incorporate linkage disequilibrium (LD) information

among segments. Thus, without modifying the program, we could not use the data

for all three loci of each region. Instead, we selected one or two loci per region and

maximized the information as follows: We tested for significant pairwise LD between

the loci within a region using DnaSP (Rozas et al., 2003) in the Nan Melanesian,

known to have the highest LD levels of the 52 populations considered in Conrad et al.

(2006), and French Basque, for which there were more data (i.e., more chromosomes).

We considered the LD between all pairs of polymorphic sites in the two most distant

loci: If fewer than 5% of the comparisons in the Nan Melanesian and fewer than 10%

157

of the comparisons in the French Basque showed significant LD at the 5% level (true

for 6 regions), the two most distant loci were assumed to be effectively independent,

so we retained both. The analyses below are for 26 loci: the two most distant loci

for six regions, and the longest locus for the remaining 14 regions. The samples from

related individuals as well as samples with missing data on the full length of a locus

were removed from the data set (Tables 5.4−5.8). The remaining missing data were

inferred using fastPHASE using known haplotypes, incorporating the subpopulations

labels, turning off the sampling of haplotypes and using the default values for other

parameters (Scheet and Stephens, 2006). On those augmented data, we calculated

FST , and π (Nei and Li, 1979) and Tajima’s D (Tajima, 1989) for each population

and the four summaries of polymorphism required by MIMAR: The polymorphisms

specific to the first and second sample and shared and fixed between the two samples,

S1, S2, S3 and S4, respectively (Becquet and Przeworski, 2007).

5.3.3 Analyses

We applied MIMAR to the data sets to estimate the parameters of the isolation-

migration models for the 15 pairs of human populations. MIMAR provides estimates

of the ancestral and descendant populations mutation rates (e.g., for population one,

θ1 = 4N1µ, where N1 is the effective population size of population one and µ is

the per base pair per generation mutation rate), the split time in generations, T ,

and the symmetrical number of migrants exchanged each generations by the pop-

ulations, M = 4N1m, were m is the fraction of generational migrants (for details,

see Becquet and Przeworski, 2007). Here, we report the estimates of the effective

population size, the split time in year (assuming 25 years per generation (e.g., Voight

et al., 2005) and µ = 2 × 10−8 (Nachman and Crowell, 2000)) and the gene flow

158

rate parameter m. We performed a goodness-of-fit test that rejected the null hy-

pothesis of homogeneous mutation rate across the loci (Becquet and Przeworski,

2007). So, for each locus, we fixed the scalar for mutation rate variation to the

ratio of observed divided by expected divergence. We obtained the average recom-

bination rates estimated by Myers et al. (2005) for one or more segments of the

genome that included a specific locus. In the analyses with MIMAR, we fixed the re-

combination scalar for the locus to this value (see MIMAR documentation available

at http://pps-spud.uchicago.edu/~cbecquet/MIMARdoc.pdf). We ran MIMAR for

5× 106 burn-in steps and 9.5× 107 recorded steps, with the variances for the kernel

distributions set to one 50th of the ranges of the prior distributions. The prior dis-

tributions were uniform and their widths corresponds to the x-axes of Supplemental

Figures 5.3−5.17 (Appendix A).

5.3.4 Goodness-of-fit test

To examine whether the isolation-migration model is an appropriate description of

the history of the human populations, we simulated 10, 000 data sets for parame-

ters sampled from the posterior distributions estimated by MIMAR, and compared the

simulated data to the actual data (for detail, see Becquet and Przeworski, 2007) .

Encouragingly, in most cases, the isolation-migration model appears to provide

a reasonable fit to the sum over the 26 loci of the four statistics used by MIMAR,

as well as to the mean over the 26 loci of FST , and of π (Nei and Li, 1979) and

Tajima’s D (Tajima, 1989) calculated for both populations in a data set (results not

shown). However, there are some exceptions: For the analyses of Nan Melanesian

and Mandenka (Supplemental Figure 5.18a), French Basque and Nan Melanesian

(Supplemental Figure 5.18c) and Han Chinese and Nan Melanesian (Supplemental

159

Figure 5.18d), the estimated model leads to lower FST than in the data. These

results suggest that in some of these cases (e.g., for the non-African populations),

an isolation-by-distance model may describe the data better. Alternatively, since

these three cases involve the Nan Melanesians, these results may reflect the little

data available for this population: Only 18 chromosomes were used in the analyses

(as opposed to 28 or more in the other samples, Tables 5.4−5.8) resulting in few

Melanesian specific polymorphisms (since rare alleles are less likely to be observed).

This tends to inflate the ratio of unique to shared polymorphisms and may bias

parameters estimates towards models leading to low FST values. For the analyses of

French Basque and Han Chinese (Supplemental Figure 5.18b) and French Basque and

Nan Melanesian (Supplemental Figure 5.18c), the estimated model tends to provide

lower Tajima’s D values than observed for the French Basque. This could be due to

the demographic history of the French Basque, if this population experienced strong

bottlenecks in the recent past (e.g., Alonso et al., 1998).

5.4 Results

The smoothed marginal posterior distributions estimated by MIMAR for pairs of popu-

lations and the estimates of the joint posterior distributions for combinations of gene

flow rate, split time in years and ancestral population size are shown in Supplementary

Figures 5.3−5.17. The mode and the central 95th percentile of the marginal poste-

rior distributions for the parameters of the isolation-migration model are reported in

Tables 5.1−5.3. The ranges of joint parameters with the highest probability density

are reported in Tables 5.4−5.8. For some pairs of populations, the marginal poste-

rior distributions were flat, suggesting that the data were not sufficiently informative

(e.g., Supplemental Figure 5.3a). The estimates of the parameters from unconvincing

160

posterior distributions are reported in italic in the tables and should be taken with

caution.

5.4.1 Split between African populations

The Tsumkwe San are estimated to have split from the common ancestor of the

two other African populations ∼70−80 thousand years ago (Kya). In contrast, the

Biaka Pygmies and Mandenka are estimated to have split more recently, ∼40 Kya (see

Table 5.3). We find extensive evidence of gene flow between the three pairs of African

populations, of the order of 10−20 migrants each generation. However, the estimates

of this parameter may not be reliable due to the wide ranges of values that fit the data

(see Supplemental Figures 5.3−5.5). The ancestral effective population size estimates

for these populations are ∼15, 500 and are consistent across the analyses (Table 5.2).

In turn, the estimates of the effective population sizes vary with the analysis.

When estimated in analyses with other African populations, the estimates of the

effective population sizes were >13, 000 for the Biaka Pygmies, ∼10, 000 for Man-

denka and ∼20, 000 for Tsumkwe San (Table 5.1). In contrast, when estimated in

analyses with non-African populations, these estimates seem less precise and become

>13, 000, >14, 000 and >11, 000, respectively (see Table 5.1 and Supplemental Fig-

ures 5.6−5.14). MIMAR may be less reliable for the data of African and non-African

populations pairs because it ignores the very different demographic histories of these

two groups (e.g., growth or constant effective population size in Africa vs. bottleneck

and recovery after the out-of-Africa migration; Schaffner et al., 2005). In any case,

these estimates between 10, 000 and 20, 000 are consistent with previous reports (e.g.,

Maynard Smith and Haigh, 1974; Schaffner et al., 2005).

161

Bia

kaP

ygm

ies

Man

den

kaT

sum

kw

eSan

Fre

nch

Bas

que

Han

Chin

ese

Nan

Mel

anes

ian

Bia

kaP

ygm

ies

32,4

00

49,1

00a

15,7

0035,4

0025,0

0036,7

00

13,5

00

13,4

00b

11,2

0012,8

0014,3

0015,9

00

47,3

00

48,4

00b

43,6

0048,2

0048,0

0048,3

00

Man

den

ka10,1

0020,0

00

10,2

0017,6

0030,5

0031,5

007,

920

12,0

00

7,41

013,7

0014,5

0016,3

0036,6

0040,7

00

22,0

0048,2

0048,2

0048,4

00

Tsu

mkw

eSan

16,3

0022,5

0019,5

00

20,0

0020,1

0018,8

0011,8

0012,6

0012,0

00

11,4

0012,1

0012,3

0047,5

0047,6

0047,4

00

47,7

0047,2

0047,2

00

Fre

nch

Bas

que

2,73

03,

720

2,98

03,5

70

3,72

04,

710

1,86

02,

680

2,06

02,5

20

2,59

03,

390

4,44

06,

100

4,77

019,9

00

39,3

0045,1

00

Han

Chin

ese

4,71

05,

700

4,22

07,

680

12,7

00

41,0

003,

360

4,28

03,

130

5,81

04,6

20

6,4

907,

170

9,66

06,

530

47,2

0023,7

00

48,0

00

Nan

Mel

anes

ian

3,91

04,

220

3,29

06,

200

4,81

04,4

80

2,41

02,

990

2,22

05,

380

3,93

03,3

80

6,10

07,

720

5,45

047,4

0047,0

0022,7

00

Table

5.1

:E

stim

ate

sof

the

desc

endant

eff

ect

ive

popula

tion

size

s.

The

modes

(a)

and

95th

confiden

cein

terv

als

(b)

ofth

ees

tim

ated

mar

ginal

pos

teri

ordis

trib

uti

ons

are

indic

ated

.T

he

effec

tive

pop

ula

tion

size

ses

tim

ated

for

the

pop

ula

tion

sin

row

sar

esh

own

for

the

anal

yse

sw

ith

each

ofth

efive

other

pop

ula

tion

sin

colu

mns.

The

num

ber

sin

ital

icsh

owes

tim

ates

that

are

unre

liab

leb

ecau

se,

e.g.

,of

flat

pos

teri

ordis

trib

uti

ons

(see

Supple

men

tal

Fig

ure

s5.

3−5.

17).

The

dia

gonal

inb

old

show

the

mea

ns

over

five

anal

yse

sof

the

effec

tive

pop

ula

tion

size

sof

the

six

hum

anp

opula

tion

s.

162

Bia

kaP

ygm

ies

Man

den

kaT

sum

kw

eSan

Fre

nch

Bas

que

Han

Chin

ese

Man

den

ka15,7

0012,1

00−

19,4

00

Tsu

mkw

eSan

15,7

0015,6

0011,1

008,

240

−21,0

0020,8

00

Fre

nch

Bas

que

14,6

0010,7

0017,1

0010,1

005,

710

11,1

00−

23,0

0017,4

0023,2

00

Han

Chin

ese

14,1

0012,6

0015,6

0010,2

007,

870

4,82

010,5

006,

490

−19,5

0016,0

0021,7

0014,6

00

Nan

Mel

anes

ian

13,8

0010,7

0016,3

009,

670

10,2

007,

800

4,80

010,4

006,

360

8,32

019,2

0015,6

0022,3

0012,2

0012,4

00

Tab

le5.

2:E

stim

ate

sof

the

ance

stra

leff

ect

ive

popula

tion

size

s.See

lege

nd

ofT

able

5.1

for

det

ails

.

163

Bia

kaP

ygm

ies

Man

den

kaT

sum

kw

eSan

Fre

nch

Bas

que

Han

Chin

ese

Nan

Mel

anes

ian

Bia

kaP

ygm

ies

1.91×

10−

48.

14×

10−

52.

30×

10−

41.

64×

10−

41.

62×

10−

4

1.21×

10−

71.

68×

10−

77.

25×

10−

61.

26×

10−

63.

10×

10−

6

−2.

21×

10−

42.

18×

10−

43.

19×

10−

42.

49×

10−

42.

68×

10−

4

Man

den

ka40,7

002.

13×

10−

42.

95×

10−

42.

43×

10−

42.

69×

10−

4

30,3

00−

4.35×

10−

74.

78×

10−

61.

01×

10−

64.

46×

10−

6

387,

000

3.84×

10−

44.

05×

10−

43.

79×

10−

44.

64×

10−

4

Tsu

mkw

eSan

78,2

0067,6

001.

39×

10−

47.

12×

10−

57.

48×

10−

5

57,3

0052,9

00−

2.08×

10−

61.

20×

10−

61.

82×

10−

6

392,

000

469,

000

1.90×

10−

41.

78×

10−

41.

72×

10−

4

Fre

nch

Bas

que

42,6

0047,6

0068,0

009.9

1×

10−

47.

20×

10−

4

39,8

0046,4

0051,2

00−

3.3

1×

10−

71.

86×

10−

7

482,

000

480,

000

476,

000

1.4

0×

10−

31.

06×

10−

3

Han

Chin

ese

67,6

0062,6

0072,6

0012,6

001.0

3×

10−

4

54,1

0051,4

0059,6

008,

350

−1.2

2×

10−

7

466,

000

460,

000

469,

000

454,

000

4.1

9×

10−

4

Nan

Mel

anes

ian

65,7

0057,6

0078,2

0017,6

0010,6

0053,5

0049,8

0057,2

008,

470

2,96

0−

475,

000

478,

000

473,

000

429,

000

203,

000

Tab

le5.

3:E

stim

ate

sof

the

spli

tti

me

(low

er

half

)and

gene

flow

rate

(upp

er

half

)fo

reach

pop

ula

tion

pair

.See

lege

nd

ofT

able

5.1

for

det

ails

.

164

Anal

ysi

sn

1an

2m

bT∗c

NA

d

Bia

kaP

ygm

ies×

Man

den

ka28−

3030

m×T∗

2.0

0×

10−

7−

3.2

0×

10−

725,0

00−

50,0

00

m×N

A1.3

0×

10−

4−

2.0

0×

10−

413,0

00−

15,0

00T×N

A25,0

00−

50,0

0015,0

00−

18,0

00

Bia

kaP

ygm

ies×

Tsu

mkw

eSan

28−

3018−

28

m×T∗

3.8

0×

10−

5−

6.0

0×

10−

575,0

00−

100,

000

m×N

A9.7

0×

10−

5−

1.6

0×

10−

413,0

00−

15,0

00T×N

A75,0

00−

100,

000

15,0

00−

18,0

00

Man

den

ka×

Tsu

mkw

eSan

3018−

28

m×T∗

6.2

0×

10−

7−

1.0

0×

10−

650,0

00−

75,0

00

m×N

A2.0

0×

10−

4−

3.3

0×

10−

410,0

00−

13,0

00T×N

A50,0

00−

75,0

0015,0

00−

18,0

00

Tab

le5.

4:E

stim

ate

sfr

om

join

tp

ost

eri

or

dis

trib

uti

ons

for

the

Afr

ican

popula

tions.

The

range

sof

the

esti

mat

esofm

,T∗

andN

Aw

ith

the

hig

hes

tpro

bab

ilit

yden

sity

inth

ejo

int

pos

teri

ordis

trib

uti

ons

esti

mat

edbyMIMAR

are

rep

orte

d.

The

num

ber

sin

ital

icsh

ould

be

take

nw

ith

cauti

on(s

eete

xt)

.an

1an

dn

2ar

eth

enum

ber

ofch

rom

osom

esin

the

firs

tan

dse

cond

pop

ula

tion

ofth

ean

alysi

s,re

spec

tive

ly(t

he

sam

ple

size

vari

esb

ecau

seof

mis

sing

dat

a,se

eM

ethods

for

det

ails

).bm

corr

esp

onds

toth

efr

acti

onof

ap

opula

tion

that

isre

pla

ced

by

mig

rants

from

the

other

pop

ula

tion

each

gener

atio

n.

cT∗

isth

ees

tim

ate

ofth

eti

me

since

the

pop

ula

tion

ssp

lit

inye

ars.

dN

Ais

the

esti

mat

eof

the

ance

stra

leff

ecti

vep

opula

tion

size

.

165

5.4.2 Split between African and non-African populations.

The estimates of the split times of Han Chinese and Nan Melanesian with the Biaka

Pygmies and Mandenka range from 60−70 Kya; in turn, the split times with the

Tsumkwe San range from 70−80 Kya (Supplemental Figures 5.6−5.14). In contrast,

the split times estimated between the French Basque and Biaka Pygmies and Man-

denka range from 40−50 Kya; with the Tsumkwe San, they are ∼70 Kya (Table 5.3).

These results likely reflect different migratory routes leading to the Asian and Euro-

pean populations and are consistent with the results of the previous section, namely

the finding that Tsumkwe San split first from the rest of Africa ∼70 Kya.

The estimates of the ancestral effective population size of African and non-African

populations are generally consistent and range from 8, 000−18, 000 (Table 5.2). We

also note that there is some evidence of gene flow between all pairs of populations,

of the order of 2−10 migrants per generation between Biaka Pygmies and Mandenka

and the non-African populations and ∼1 migrant per generation between Tsumkwe

San and non-African populations (Table 5.3). As mentioned above, this parameter is

difficult to estimate and it is not clear whether these estimates reflect ongoing gene

flow, range expansion, or more recent migratory events.

166

Anal

ysi

sn

1n

2m

T∗

NA

Fre

nch

Bas

que×

Bia

kaP

ygm

ies

3228−

30

m×T∗

2.40×

10−

4 −3.

50×

10−

448

0,0

00−

500,0

00

m×N

A1.

60×

10−

4 −2.

40×

10−

413,0

00−

15,0

00T×N

A25,0

00−

50,0

0015,0

00−

18,0

00

Han

Chin

ese×

Bia

kaP

ygm

ies

3228−

30

m×T∗

4.0

0×

10−

5−

6.0

0×

10−

575,0

00−

100,

000

m×N

A1.3

0×

10−

4−

2.0

0×

10−

410,0

00−

13,0

00T×N

A50,0

00−

75,0

0015,0

00−

18,0

00

Nan

Mel

anes

ian×

Bia

kaP

ygm

ies

1828−

30

m×T∗

2.30×

10−

4 −3.

50×

10−

445

0,0

00−

480,0

00

m×N

A1.

50×

10−

4 −2.

30×

10−

410,0

0013,0

00T×N

A50,0

00−

75,0

0015,0

00−

18,0

00

Tab

le5.

5:E

stim

ate

sfr

om

join

tp

ost

eri

or

dis

trib

uti

ons

betw

een

the

Bia

ka

Pygm

ies

an

dn

on

-Afr

ican

popula

tions.

See

lege

nd

ofT

able

5.4

for

det

ails

.

167

Anal

ysi

sn

1n

2m

T∗

NA

Fre

nch

Bas

que×

Man

den

ka32

30

m×T∗

3.90×

10−

4 −5.

90×

10−

445

0,0

00−

480,0

00

m×N

A2.

60×

10−

4 −3.

90×

10−

47,

900−

10,0

00T×N

A50,0

00−

75,0

0013,0

00−

15,0

00

Han

Chin

ese×

Man

den

ka32

30

m×T∗

6.50×

10−

5 −1.

00×

10−

475,0

00−

100,0

00

m×N

A1.

60×

10−

4 −2.

50×

10−

47,

900−

10,0

00T×N

A50,0

00−

75,0

0013,0

00−

15,0

00

Nan

Mel

anes

ian×

Man

den

ka18

30

m×T∗

3.30×

10−

4 −5.

20×

10−

448

0,0

00−

500,0

00

m×N

A2.

20×

10−

4 −3.

30×

10−

47,

900−

10,0

00T×N

A50,0

00−

75,0

0013,0

00−

15,0

00

Tab

le5.

6:E

stim

ate

sfr

om

join

tp

ost

eri

or

dis

trib

uti

ons

betw

een

the

Mandenka

and

non

-Afr

ican

pop

ula

-ti

ons.

See

lege

nd

ofT

able

5.4

for

det

ails

.

168

Anal

ysi

sn

1n

2m

T∗

NA

Fre

nch

Bas

que×

Tsu

mkw

eSan

3218−

28

m×T∗

1.70×

10−

4 −2.

50×

10−

448

0,0

00−

500,0

00

m×N

A1.

20×

10−

4 −1.

70×

10−

413,0

00−

15,0

00T×N

A50,0

00−

75,0

0015,0

00−

18,0

00

Han

Chin

ese×

Tsu

mkw

eSan

3218−

28

m×T∗

3.7

0×

10−

5−

5.4

0×

10−

510

0,00

0−13

0,00

0

m×N

A8.0

0×

10−

5−

1.2

0×

10−

413,0

00−

15,0

00T×N

A75,0

00−

100,

000

15,0

00−

18,0

00

Nan

Mel

anes

ian×

Tsu

mkw

eSan

1818−

28

m×T∗

1.10×

10−

4 −1.

60×

10−

445

0,0

00−

480,0

00

m×N

A7.

50×

10−

5 −1.

10×

10−

413,0

00−

15,0

00T×N

A75,0

00−

100,

000

15,0

00−

18,0

00

Tab

le5.

7:E

stim

ate

sfr

om

join

tp

ost

eri

or

dis

trib

uti

ons

betw

een

the

Tsu

mkw

eSan

an

dn

on

-Afr

ican

pop

-ula

tions.

See

lege

nd

ofT

able

5.4

for

det

ails

.

169

5.4.3 Split between non-African populations

French Basque are estimated to have split from the ancestral population of Han

Chinese and Nan Melanesian 13−18 Kya, and Han Chinese and Nan Melanesian

to have split ∼10 Kya (Table 5.3). The estimates of the ancestral population sizes

are ∼10, 000 for the three non-African population pairs (Table 5.2). However, these

estimates may be misleading because, although there is evidence of gene flow in the

three analyses of the order of 5−15 migrants per generation, these estimates are

highly unreliable and T , NA and m tend to be correlated (Supplemental Figures

5.15−5.17). Moreover, in those three cases, the estimated models do not fit the data

well (see Methods and Supplemental Figures 5.18b−d)

The estimates of the effective population size for Han Chinese range 4,200−7,700.

These estimates are generally consistent across analyses, with the exception of the

analysis with Nan Melanesian for which this parameter is not well estimated (Sup-

plemental Figure 5.17a), maybe because the different bottlenecks that are known to

have affected these populations are not included in the model or because a model of

isolation-by-distance is more appropriate in these cases (e.g., see Methods and Supple-

mental Figure 5.18). The estimates of the effective population size for French Basque

and Nan Melanesian range 2,700−4,700 and 2,000−6,200, respectively, but these es-

timates are usually smaller in the analyses with African populations than with the

non-African populations (Table 5.1), which may again reflect the demographic histo-

ries of these populations. We note that these estimates are generally consistent with

those found in other studies (e.g., Schaffner et al., 2005).

170

Anal

ysi

sn

1n

2m

T∗

NA

Fre

nch

Bas

que×

Han

Chin

ese

3232

m×T∗

4.5

0×

10−

5−

7.8

0×

10−

550−

25,0

00

m×N

A7.

50×

10−

4 −1.

30×

10−

37,

900−

10,0

00T×N

A50−

25,0

007,

900−

10,0

00

Fre

nch

Bas

que×

Nan

Mel

anes

ian

3218

m×T∗

4.90×

10−

5 −8.

60×

10−

550−

25,0

00

m×N

A4.

90×

10−

5 −8.

60×

10−

57,

900−

10,0

00T×N

A50−

25,0

007,

900−

10,0

00

Han

Chin

ese×

Nan

Mel

anes

ian

3218

m×T∗

4.00×

10−

5 −7.

00×

10−

550−

25,0

00

m×N

A7.

00×

10−

5 −1.

20×

10−

47,

900−

10,0

00T×N

A50−

25,0

007,

900−

10,0

00

Tab

le5.

8:E

stim

ate

sfr

om

join

tp

ost

eri

or

dis

trib

uti

ons

for

the

non-A

fric

an

popula

tion

s.See

lege

nd

ofT

able

5.4

for

det

ails

.

171

5.5 Conclusions

Figure 5.2 summarizes the broad-scale history of the six human populations estimated

by MIMAR. This model is consistent with previous observations. Specifically, it was

thought that Tsumkwe San is an ancient population (e.g., Passarino et al., 1998)

and MIMAR estimates that they split ∼70 Kya from other African groups. MIMAR also

estimates the split between African and Asian populations to 60−70 Kya and between

African and European populations to∼40 Kya, which fits well with previous estimates

for the waves of migration out-of-Africa of anatomically model human (e.g., Cavalli-

Sforza and Feldman, 2003).While the split times and other demographic parameters

estimated between the non-African populations are not well estimated by MIMAR, these

estimates are roughly consistent with previous reports (Voight et al., 2005; Keinan

et al., 2007). This said, the estimated split times are likely under-estimates, as we

recently showed that MIMAR tends to provide under-estimates of this parameter in

the presence of migration (Becquet and Przeworski, 2008). In addition, the difficulty

in estimating the parameters of the isolation-migration model for the non-African

populations pairs likely reflect the violation of numerous assumptions of the model,

e.g., panmixia and constant effective population sizes. Alternatively, an isolation-

migration model may be inappropriate in this case: the estimated models between

the non-African populations leads to lower FST than in the data, suggesting that a

different model − perhaps an isolation-by-distance model − may describe the data

better (see Methods and Supplemental Figure 5.18).

Interestingly, MIMAR found evidence of gene flow between all pairs of populations,

regardless of how differentiated or distant geographically they were. This is interesting

because commonly used models of human demographic history usually do not include

gene flow between populations after the split (e.g., Schaffner et al., 2005). The results

172

Non-African

populations

Tsumkwe San Biaka Pigmies

Mandenka

Split between Tsumkwe San

and the other African populations 70-80 Kya

Split between African

and non-African populations 40-70 Kya

Past

Present

Figure 5.2: Cartoon summarizing pairwise models estimated by MIMAR forsix human populations.The main split events are indicated. The colored lineages represent the history ofa population considered in this study. The three non-African populations are rep-resented by a single lineage as their split times were poorly estimated (see Results).The grey arrows represent gene flow between the populations as they diverge.

of this study suggest that it may be problematic to ignore migration when dealing

with data from human populations.

The effective population sizes estimates provided by MIMAR, when well-estimated,

are also consistent with previous estimates for human populations (e.g., Maynard Smith

and Haigh, 1974; Schaffner et al., 2005): The African populations and the ancestral

populations of the non-African populations are ∼10, 000, all the non-African popula-

tions∼5, 000 or less and for the ancestors of the African populations∼15, 000−20, 000.

We noted that in many analyses, MIMAR did not provide reliable estimates of the effec-

tive population sizes in the descendant populations (Table 5.1). Human populations

are certainly non-random mating and their effective population sizes have changed

173

often, as populations experienced bottlenecks during range expansions, e.g., during

the out-of-Africa migration of anatomically modern human, and recent population

growth, at least since the emergence of agriculture. Since the data considered here

undoubtedly violate the assumptions of panmixia and constant effective population

sizes of the isolation-migration model, the estimates of effective population sizes need

to be considered with caution. The results may also be affected by the choice of data

processing we used here. For instance, if there were systematic errors in the inferred

data, this could have introduced a bias. However, the alternative of removing the

missing data would bias the ratios Si/L, where L is the number of base pairs at a

locus, i ∈ [1, 4] and Si is the number of polymorphisms of type i (see Methods), thus

leading to a different bias in the estimates of the parameters of interest.

We note that MIMAR is limited to estimating demographic parameters between

pairs of populations. We therefore presented the results for the pairwise comparisons

between six human populations, although their histories are not independent (e.g.,

Ramachandran et al., 2005; Schaffner et al., 2005). Estimating the demographic

parameters for a divergence model for all six populations together would likely be

more appropriate and provide more reliable estimates. In particular, some of the

parameter estimates provided by MIMAR may be unreliable if there is gene flow between

a pair and other, unsampled populations: We observed for instance that estimates

obtained for a population across pairwise comparisons are quite variable. Therefore,

our interpretation of the history of the six human populations from the pairwise

comparisons needs to be taken with caution (Figure 5.2). It is unclear how the

parameters are affected by unsampled populations and a simulation study would

undoubtedly be helpful in inquiring how it may affect the estimates provided by

MIMAR.

174

Some of the difficulty discussed above in estimating demographic parameters for

the human populations may be resolved by a larger data set. We plan to apply MIMAR

to the X-linked fraction of the data and also to modify the method to incorporate

sex-specific migration rates to allow its application to the full data set of 40 loci

(Wall et al., 2008). If the results are consistent with those presented here, the fact

that some parameters are not well estimated may simply reflect that the complexities

of the human population histories: i.e., the demographic histories of some of these

populations are not well approximated by the simple isolation-migration model con-

sidered by MIMAR. In contrast, if the results vary dramatically depending on the data

(i.e., 20 X-linked versus 20 autosomal loci), this could point to different histories of

the X chromosomes and autosomes due, for example, to sex-specific migration rates.

Finally, in this study, we fit isolation-migration models to populations that are

not totally geographically isolated from each other, as attested by the substantial

evidence for gene flow detected by MIMAR and the independent knowledge that many

unsampled populations that regularly exchange migrants lie between them (Rosenberg

et al., 2002; Ramachandran et al., 2005). But, since the six human populations that we

investigated here have been chosen specifically because there were geographically far

apart from each other (Wall et al., 2008), their histories may be well approximated

by an isolation-migration model (with some exceptions perhaps, see Methods and

Supplemental Figure 5.18). For this reason, we do not explicitly investigate whether

other models, e.g., an isolation-by-distance model, would fit the data better than the

isolation-migration models estimated by MIMAR. However, if pairs of populations were

sampled along the gradient separating the six populations presented in this study,

their demographic histories may not be well approximated by the simple isolation-

migration model and MIMAR may provide unreliable estimates. In this case, the fit of

175

various models should be investigated as, e.g., isolation-by distance models may be

better approximations than isolation-migration models (e.g., Ramachandran et al.,

2005).

5.6 Acknowledgments

We thank Jeffrey Wall for the useful discussions and for providing us with the data.

5.7 Appendix A: Supplemental Figures

Estimated isolation-migration models for the 15 pairs of human

populations

176

010

000

2000

030

000

4000

050

000

0.000.020.040.060.080.100.12

Pop

ulat

ion

size

Bia

ka P

igm

ies

Man

denk

aA

nces

tral

pop

ulat

ion

(a)

0e+

001e

−04

2e−

043e

−04

4e−

045e

−04

0.0000.0100.0200.030

Mig

ratio

n

(b)

0e+

001e

+05

2e+

053e

+05

4e+

055e

+05

0.000.050.100.15

Tim

e(c

)

T

m

1e+0

52e

+05

3e+0

54e

+05

3.2e−071.3e−05

0.0

1

0.02 0.03

(d)

NA

m

1e+0

52e

+05

3e+0

54e

+05

3.2e−071.3e−05

0.0

1

0.02

(e)

NA

T

1e+0

52e

+05

3e+0

54e

+05

1e+052e+053e+054e+05

0.0

5

(f)

Fig

ure

5.3:

Est

imate

dm

odel

from

the

Bia

ka

Pygm

ies

and

Mandenka

popu

lati

on

data

.a−

c)Sm

oot

hed

mar

ginal

pos

teri

ordis

trib

uti

ons

esti

mat

edbyMIMAR,

when

the

firs

tp

opula

tion

isuse

asth

ere

fere

nce

(Bia

kaP

ygm

ies

inth

isca

se).

d−

f)Joi

nt

pos

teri

ordis

trib

uti

ons

ofth

ege

ne

flow

rate

and

split

tim

e(d

),of

the

gene

flow

rate

and

the

ance

stra

leff

ecti

vep

opula

tion

size

(e)

and

ofth

esp

lit

tim

ean

dth

ean

cest

ral

effec

tive

pop

ula

tion

size

(f).

177

010

000

2000

030

000

4000

050

000

0.000.020.040.060.080.10

Pop

ulat

ion

size

Bia

ka P

igm

ies

Tsu

mkw

e S

anA

nces

tral

pop

ulat

ion

(a)

0e+

001e

−04

2e−

043e

−04

4e−

045e

−04

6e−

04

0.0000.0100.020

Mig

ratio

n

(b)

0e+

001e

+05

2e+

053e

+05

4e+

055e

+05

0.000.020.040.060.080.10

Tim

e(c

)

T

m

1e+0

52e

+05

3e+0

54e

+05

3.3e−071.5e−05

0.0

05

0.0

1 0.015

0.02

(d)

NA

m

1e+0

52e

+05

3e+0

54e

+05

3.3e−071.5e−05 0

.01

0.0

2

(e)

NA

T

1e+

052e

+05

3e+

054e

+05

1e+052e+053e+054e+05

0.0

5

(f)

Fig

ure

5.4:

Est

imate

dm

odel

from

the

Bia

ka

Pygm

ies

and

Tsu

mkw

eSan

pop

ula

tion

data

.F

ordet

ails

,se

ele

gend

ofSupple

men

tal

Fig

ure

5.3.

178

010

000

2000

030

000

4000

050

000

0.000.020.040.06

Pop

ulat

ion

size

Man

denk

aT

sum

kwe

San

Anc

estr

al p

opul

atio

n

(a)

0e+

002e

−04

4e−

046e

−04

8e−

04

0.0000.0100.0200.030

Mig

ratio

n

(b)

0e+

001e

+05

2e+

053e

+05

4e+

055e

+05

0.000.010.020.030.04

Tim

e(c

)

T

m

1e+0

52e

+05

3e+0

54e

+05

3.8e−071.8e−05

0.0

05

0.0

1 0

.01

0.015

(d)

NA

m

1e+0

52e

+05

3e+0

54e

+05

3.8e−071.8e−05 0

.01

0.0

2 0

.03

(e)

NA

T

1e+0

52e

+05

3e+0

54e

+05

1e+052e+053e+054e+05

0.0

2

(f)

Fig

ure

5.5:

Est

imate

dm

odel

from

the

Mandenka

and

Tsu

mkw

eSan

popula

tion

data

.F

ordet

ails

,se

ele

gend

ofSupple

men

tal

Fig

ure

5.3.

179

010

000

2000

030

000

4000

050

000

0.000.050.100.150.20

Pop

ulat

ion

size

Fre

nch

Bas

que

Bia

ka P

igm

ies

Anc

estr

al p

opul

atio

n

(a)

0e+

002e

−04

4e−

046e

−04

0.000.010.020.030.040.05

Mig

ratio

n

(b)

0e+

001e

+05

2e+

053e

+05

4e+

055e

+05

0.0000.0050.0100.0150.020

Tim

e(c

)

T

m

1e+0

52e

+05

3e+0

54e

+05

1.7e−063.6e−05

0.0

05

0.0

1 0

.015

(d)

NA

m

1e+0

52e

+05

3e+0

54e

+05

1.7e−063.6e−05 0

.02 0.0

4

(e)

NA

T

1e+0

52e

+05

3e+0

54e

+05

1e+052e+053e+054e+05

0.0

1

(f)

Fig

ure

5.6:

Est

imate

dm

odel

from

the

Fre

nch

Basq

ue

and

Bia

ka

Pygm

ies

popula

tion

data

.F

ordet

ails

,se

ele

gend

ofSupple

men

tal

Fig

ure

5.3.

180

010

000

2000

030

000

4000

050

000

0.000.050.100.15

Pop

ulat

ion

size

Han

Chi

nese

Bia

ka P

igm

ies

Anc

estr

al p

opul

atio

n

(a)

0e+

001e

−04

2e−

043e

−04

4e−

045e

−04

6e−

04

0.0000.0100.0200.030

Mig

ratio

n

(b)

0e+

001e

+05

2e+

053e

+05

4e+

055e

+05

0.000.010.020.03

Tim

e(c

)

T

m

1e+0

52e

+05

3e+0

54e

+05

1.1e−062.7e−05

0.0

05

0.0

1

(d)

NA

m

1e+0

52e

+05

3e+0

54e

+05

1.1e−062.7e−05

0.0

2

(e)

NA

T

1e+0

52e

+05

3e+0

54e

+05

1e+052e+053e+054e+05

0.0

2

(f)

Fig

ure

5.7:

Est

imate

dm

odel

from

the

Han

Chin

ese

and

Bia

ka

Pygm

ies

popula

tion

data

.F

ordet

ails

,se

ele

gend

ofSupple

men

tal

Fig

ure

5.3.

181

010

000

2000

030

000

4000

050

000

0.000.050.100.150.20

Pop

ulat

ion

size

Nan

Mel

anes

ian

Bia

ka P

ygm

ies

Anc

estr

al p

opul

atio

n

(a)

0e+

002e

−04

4e−

046e

−04

0.000.010.020.030.040.050.06

Mig

ratio

n

(b)

0e+

001e

+05

2e+

053e

+05

4e+

055e

+05

0.0000.0100.0200.030

Tim

e(c

)

T

m

1e+0

52e

+05

3e+0

54e

+05

1.2e−063.1e−05

0.0

05

0.0

1 0

.015

(d)

NA

m

1e+0

52e

+05

3e+0

54e

+05

1.2e−063.1e−05

0.0

2

0.0

4

(e)

NA

T

1e+0

52e

+05

3e+0

54e

+05

1e+052e+053e+054e+05

0.0

1

0.0

2

(f)

Fig

ure

5.8:

Est

imate

dm

odel

from

the

Nan

Mela

nesi

an

and

Bia

ka

Pygm

ies

pop

ula

tion

data

.F

ordet

ails

,se

ele

gend

ofSupple

men

tal

Fig

ure

5.3.

182

010

000

2000

030

000

4000

050

000

0.000.050.100.150.20

Pop

ulat

ion

size

Fre

nch

Bas

que

Man

denk

aA

nces

tral

pop

ulat

ion

(a)

0e+

002e

−04

4e−

046e

−04

8e−

04

0.000.010.020.030.040.050.06

Mig

ratio

n

(b)

0e+

001e

+05

2e+

053e

+05

4e+

055e

+05

0.0000.0050.0100.0150.020

Tim

e(c

)

T

m

1e+0

52e

+05

3e+0

54e

+05

1.3e−063.4e−05

0.0

05

0.0

1

0.0

15

(d)

NA

m

1e+0

52e

+05

3e+0

54e

+05

1.3e−063.4e−05 0

.02 0

.04

(e)

NA

T

1e+0

52e

+05

3e+0

54e

+05

1e+052e+053e+054e+05

0.0

1

(f)

Fig

ure

5.9:

Est

imate

dm

odel

from

the

Fre

nch

Basq

ue

and

Mandenka

popu

lati

on

data

.F

ordet

ails

,se

ele

gend

ofSupple

men

tal

Fig

ure

5.3.

183

010

000

2000

030

000

4000

050

000

0.000.040.080.12

Pop

ulat

ion

size

Han

Chi

nese

Man

denk

aA

nces

tral

pop

ulat

ion

(a)

0e+

002e

−04

4e−

046e

−04

8e−

04

0.000.010.020.03

Mig

ratio

n

(b)

0e+

001e

+05

2e+

053e

+05

4e+

055e

+05

0.000.010.020.030.04

Tim

e(c

)

T

m

1e+0

52e

+05

3e+0

54e

+05

6.9e−072.6e−05

0.0

05

0.0

1 0

.015

(d)

NA

m

1e+0

52e

+05

3e+0

54e

+05

6.9e−072.6e−05

0.0

2

0.04

(e)

NA

T

1e+0

52e

+05

3e+0

54e

+05

1e+052e+053e+054e+05

0.0

2

(f)

Fig

ure

5.10

:E

stim

ate

dm

odel

from

the

Han

Chin

ese

and

Mandenka

popula

tion

data

.F

ordet

ails

,se

ele

gend

ofSupple

men

tal

Fig

ure

5.3.

184

010

000

2000

030

000

4000

050

000

0.000.050.100.15

Pop

ulat

ion

size

Nan

Mel

anes

ian

Man

denk

aA

nces

tral

pop

ulat

ion

(a)

0.00

000.

0002

0.00

040.

0006

0.00

080.

0010

0.00

12

0.000.010.020.030.040.05

Mig

ratio

n

(b)

0e+

001e

+05

2e+

053e

+05

4e+

055e

+05

0.0000.0050.0100.0150.020

Tim

e(c

)

T

m

1e+0

52e

+05

3e+0

54e

+05

1.1e−063.7e−05

0.0

05

0.0

1 0

.015

(d)

NA

m

1e+0

52e

+05

3e+0

54e

+05

1.1e−063.7e−05 0

.02

0.0

4

(e)

NA

T

1e+0

52e

+05

3e+0

54e

+05

1e+052e+053e+054e+05

0.0

1

0.0

2

(f)

Fig

ure

5.11

:E

stim

ate

dm

odel

from

the

Nan

Mela

nesi

an

and

Mandenka

popula

tion

data

.F

ordet

ails

,se

ele

gend

ofSupple

men

tal

Fig

ure

5.3.

185

010

000

2000

030

000

4000

050

000

0.000.050.100.150.200.25

Pop

ulat

ion

size

Fre

nch

Bas

que

Tsu

mkw

e S

anA

nces

tral

pop

ulat

ion

(a)

0e+

001e

−04

2e−

043e

−04

4e−

045e

−04

0.000.010.020.030.040.05

Mig

ratio

n

(b)

0e+

001e

+05

2e+

053e

+05

4e+

055e

+05

0.000.010.020.030.04

Tim

e(c

)

T

m

1e+0

52e

+05

3e+0

54e

+05

1.4e−062.7e−05

0.0

05

0.01

0.0

1

(d)

NA

m

1e+0

52e

+05

3e+0

54e

+05

1.4e−062.7e−05 0

.01

0.0

2

0.0

3

(e)

NA

T

1e+0

52e

+05

3e+0

54e

+05

1e+052e+053e+054e+05

0.0

1

0.0

2

(f)

Fig

ure

5.12

:E

stim

ate

dm

odel

from

the

Fre

nch

Basq

ue

and

Tsu

mkw

eSan

popula

tion

data

.F

ordet

ails

,se

ele

gend

ofSupple

men

tal

Fig

ure

5.3.

186

010

000

2000

030

000

4000

050

000

0.000.050.100.150.20

Pop

ulat

ion

size

Han

Chi

nese

Tsu

mkw

e S

anA

nces

tral

pop

ulat

ion

(a)

0e+

001e

−04

2e−

043e

−04

4e−

045e

−04

0.0000.0100.0200.030

Mig

ratio

n

(b)

0e+

001e

+05

2e+

053e

+05

4e+

055e

+05

0.0000.0100.0200.030

Tim

e(c

)

T

m

1e+0

52e

+05

3e+0

54e

+05

1.1e−062.5e−05

0.0

05

0.0

1 (d)

NA

m

1e+0

52e

+05

3e+0

54e

+05

1.1e−062.5e−05

0.0

1 0.0

2

0.0

4

(e)

NA

T

1e+0

52e

+05

3e+0

54e

+05

1e+052e+053e+054e+05

0.0

2

(f)

Fig

ure

5.13

:E

stim

ate

dm

odel

from

the

Han

Chin

ese

and

Tsu

mkw

eSan

pop

ula

tion

data

.F

ordet

ails

,se

ele

gend

ofSupple

men

tal

Fig

ure

5.3.

187

010

000

2000

030

000

4000

050

000

0.000.050.100.150.20

Pop

ulat

ion

size

Nan

Mel

anes

ian

Tsu

mkw

e S

anA

nces

tral

pop

ulat

ion

(a)

0e+

001e

−04

2e−

043e

−04

4e−

04

0.000.020.040.06

Mig

ratio

n

(b)

0e+

001e

+05

2e+

053e

+05

4e+

055e

+05

0.0000.0050.0100.0150.0200.0250.030

Tim

e

(c)

T

m

1e+0

52e

+05

3e+0

54e

+05

1.2e−062.4e−05

0.0

05

0.01

0.0

1

(d)

NA

m

1e+0

52e

+05

3e+0

54e

+05

1.2e−062.4e−05

0.0

1 0.0

2

(e)

NA

T

1e+0

52e

+05

3e+0

54e

+05

1e+052e+053e+054e+05

0.0

1

0.0

2

(f)

Fig

ure

5.14

:E

stim

ate

dm

odel

from

the

Nan

Mela

nesi

an

and

Tsu

mkw

eSan

popula

tion

data

.F

ordet

ails

,se

ele

gend

ofSupple

men

tal

Fig

ure

5.3.

188

010

000

2000

030

000

4000

050

000

0.000.020.040.060.080.100.12

Pop

ulat

ion

size

Fre

nch

Bas

que

Han

Chi

nese

Anc

estr

al p

opul

atio

n

(a)

0.00

00.

001

0.00

20.

003

0.0000.0100.020

Mig

ratio

n

(b)

0e+

001e

+05

2e+

053e

+05

4e+

055e

+05

0.000.050.100.15

Tim

e(c

)

T

m

1e+0

52e

+05

3e+0

54e

+05

4.9e−074.5e−05

0.01

0.02 0.03

(d)

NA

m

1e+0

52e

+05

3e+0

54e

+05

4.9e−074.5e−05 0

.01

0.02

0.0

3

(e)

NA

T

1e+0

52e

+05

3e+0

54e

+05

1e+052e+053e+054e+05

0.1

(f)

Fig

ure

5.15

:E

stim

ate

dm

odel

from

the

Fre

nch

Basq

ue

and

Han

Chin

ese

popula

tion

data

.F

ordet

ails

,se

ele

gend

ofSupple

men

tal

Fig

ure

5.3.

189

010

000

2000

030

000

4000

050

000

0.000.040.080.12

Pop

ulat

ion

size

Fre

nch

Bas

que

Nan

Mel

anes

ian

Anc

estr

al p

opul

atio

n

(a)

0.00

00.

001

0.00

20.

003

0.00

4

0.0000.0050.0100.015

Mig

ratio

n

(b)

0e+

001e

+05

2e+

053e

+05

4e+

055e

+05

0.000.050.100.15

Tim

e(c

)

T

m

1e+0

52e

+05

3e+0

54e

+05

4.9e−074.9e−05

0.01 0.02 0.03

(d)

NA

m

1e+0

52e

+05

3e+0

54e

+05

4.9e−074.9e−05 0

.01

0.0

2 0.03

(e)

NA

T

1e+

052e

+05

3e+

054e

+05

1e+052e+053e+054e+05

0.1

(f)

Fig

ure

5.16

:E

stim

ate

dm

odel

from

the

Fre

nch

Basq

ue

and

Nan

Mela

nesi

an

popula

tion

data

.F

ordet

ails

,se

ele

gend

ofSupple

men

tal

Fig

ure

5.3.

190

010

000

2000

030

000

4000

050

000

0.0000.0100.0200.030

Pop

ulat

ion

size

Han

Chi

nese

Nan

Mel

anes

ian

Anc

estr

al p

opul

atio

n

(a)

0.00

000.

0005

0.00

100.

0015

0.00

200.

0025

0.00

30

0.0000.0040.0080.012

Mig

ratio

n

(b)

0e+

001e

+05

2e+

053e

+05

4e+

055e

+05

0.000.020.040.06

Tim

e(c

)

T

m

1e+0

52e

+05

3e+0

54e

+05

4.6e−074e−05

0.02 0.04

0.06

(d)

NA

m

1e+0

52e

+05

3e+0

54e

+05

4.6e−074e−05

0.0

1

0.0

2 0.03

(e)

NA

T

1e+0

52e

+05

3e+0

54e

+05

1e+052e+053e+054e+05

0.1

(f)

Fig

ure

5.17

:E

stim

ate

dm

odel

from

the

Han

Chin

ese

and

Nan

Mela

nesi

an

popula

tion

data

.F

ordet

ails

,se

ele

gend

ofSupple

men

tal

Fig

ure

5.3.

191

Estimated models with a poor fit to the data

0 50 100 150 200 250

0.0

0.4

0.8

S statistics

S1 (Mel.)S2 (Man)S3 (s.)S4 (f.)

0.0 0.1 0.2 0.3 0.40.00

0.10

Fst

1 2 3 40.00

0.10

π

π for Mel.π for Man

−1.5 −0.5 0.0 0.5 1.00.00

0.10

D

D for M.D for M.

(a)

Figure 5.18: Isolation-migration models estimated by MIMAR with a poor fitto the human population data.We show for a pair of populations, the distributions of the sum over the 26 lociof the four statistics used by MIMAR (the polymorphisms specific to the first andsecond sample and shared and fixed between the two samples, S1, S2, S3 and S4,respectively), as well as the mean over the 26 loci of FST and π and Tajima’s Dcalculated for each population (see Methods for details). Results are for a) NanMelanesian (blue) and Mandenka (red), b) French Basque and Han Chinese, c) FrenchBasque and Nan Melanesian and d) Han Chinese and Nan Melanesian. The verticallines correspond to the values of the statistics in the human data sets. In this case,the model does not seem to fit the data for the mean FST .

192

0 50 100 150 200 2500.

00.

40.

8S statistics

S1 (Fre.)S2 (Han)S3 (s.)S4 (f.)

0.0 0.1 0.2 0.30.00

0.10

0.20

Fst

1.0 2.0 3.0 4.00.00

0.10

π

π for Fre.π for Han

−1.0 −0.5 0.0 0.5 1.00.00

0.04

0.08

D

D for F.D for H.

(b)

Figure 5.18 − continued: Results for French Basque (blue) and Han Chinese (red).In this case, the model does not seem to fit the data for the mean Tajima’s D in theFrench Basque.

0 50 100 150 200 250

0.0

0.4

0.8

S statistics

S1 (Fre.)S2 (Mel.)S3 (s.)S4 (f.)

0.0 0.1 0.2 0.3 0.40.00

0.10

Fst

1.0 2.0 3.0 4.00.00

0.05

0.10

0.15

π

π for Fre.π for Mel.

−1.0 −0.5 0.0 0.5 1.00.00

0.06

0.12

D

D for F.D for M.

(c)

Figure 5.18 − continued: Results for French Basque (blue) and Nan Melanesian(red). In this case, the model does not seem to fit the data for the mean FST andthe mean Tajima’s D in the French Basque.

193

0 50 100 150 200

0.0

0.4

0.8

S statistics

S1 (Han)S2 (Mel.)S3 (s.)S4 (f.)

0.00 0.05 0.10 0.15 0.200.00

0.10

0.20

Fst

1.0 1.5 2.0 2.5 3.0 3.50.00

0.06

0.12

π

π for Hanπ for Mel.

−0.5 0.0 0.5 1.00.00

0.06

0.12

D

D for H.D for M.

(d)

Figure 5.18 − continued: Han Chinese (blue) and Nan Melanesian (red). In thiscase, the model does not seem to fit the data for the mean FST .

CHAPTER 6

CONCLUSIONS

194

195

This collection of projects describes computational approaches and their uses in

learning about speciation and more generally about the demographic history of popu-

lations. Despite the ambivalent conclusions of Chapter 4 about the reliability of such

methods in learning about nodes of speciation, I remain confident that population

genetic approaches can help us learn about the processes of speciation, provided that

one is appropriately cautious when defining the underlying models and interpreting

the results. In this regard, the new generations of polymorphism data, such as full

genome short-read resequencing, will likely help resolve some of the problems met

with the available polymorphism data since they carry orders of magnitude more

information on genetic variation. Of course, the size of new data sets will give rise

to new and challenging issues and I am looking forward investigating those and de-

veloping computational approaches to help extract useful information and answer

unresolved questions in evolutionary biology.

Here, I would like to point out that the question of whether speciation can be

initiated in the presence of gene flow is directly relevant to our history, as attested by

the highly debated hypothesis that early hominids and chimpanzee hybridized since

they split (Patterson et al., 2006b; Barton, 2006; Wakeley, 2008). I focused part of my

PhD projects on great apes with this ultimate goal in mind: Since the great apes are

our closest living evolutionary relatives, they are the best models to help us under-

stand the evolution of human biology and culture. Unraveling the mechanisms of the

great apes species and subspecies formation is thus a crucial step in understanding

how anatomically modern human evolved.

196

Finally, I would like to draw the readers’ attention to the fact that all great apes

species and subspecies are highly endangered (e.g., http://www.greatapetrust.

org/save/statistics.php and citations therein). The situation is disastrous: We

know so little and so much remains to be discovered about non-human great apes

that if any one subspecies came to disappear, the human species will deprive itself of

a wealth of knowledge about its origin. I hope that the method and results presented

in this thesis will help the conservation biology community in preserving the biodi-

versity of our (still) green planet and specifically the rapidly disappearing species and

subspecies of great apes.

REFERENCES

Aagaard, S. M., Sastad, S. M., Greilhuber, J., and Moen, A. (2005). A secondaryhybrid zone between diploid Dactylorhiza incarnata ssp. cruenta and allotetraploidD. lapponica (orchidaceae). Heredity, 94(5):488–96.

Akey, J. M., Eberle, M. A., Rieder, M. J., Carlson, C. S., Shriver, M. D., Nickerson,D. A., and Kruglyak, L. (2004). Population history and natural selection shapepatterns of genetic variation in 132 genes. PLoS Biol., 2(10):e286.

Albrecht, G. H. and Miller, J. M. A. (1993). Geographic variation in primates. Areview with implications for interpreting fossils. In Kimbel, W. H. and Mar, L. B.,editors, Species, species concepts, and primate evolution, pages 123–161. New York:Plenum Press.

Alonso, S., Fernandez-Fernandez, I., Castro, A., and De Pancorbo, M. M. (1998).Genetic characterization of apob and d17s5 aflp loci in a sample from the basquecountry (northern spain). Hum Biol, 70(3):491–505.

Altschul, S. F., Gish, W., Miller, W., Meyers, E. W., and Lipman, D. J. (1990). Basiclocal alignment search tool. J. Mol. Biol., 215(3):403–410.

Andolfatto, P. and Przeworski, M. (2000). A genome-wide departure from the stan-dard neutral model in natural populations of Drosophila. Genetics, 156(1):257–268.

Andolfatto, P. and Wall, J. D. (2003). Linkage disequilibrium patterns across arecombination gradient in african Drosophila melanogaster. Genetics, 165(3):1289–1305.

Austin, J. D., Lougheed, S. C., and Boag, P. T. (2004). Controlling for the effects ofhistory and nonequilibrium conditions in gene flow estimates in northern bullfrog(Rana catesbeiana) populations. Genetics, 168(3):1491–1506.

Bachtrog, D., Thornton, K. R., Clark, A. G., and Andolfatto, P. (2006). Extensive in-trogression of mitochondria DNA relative to nuclear genes in the Drosophila yakubaspecies group. Evolution Int. J. Org. Evolution, 60(2):292–302.

Barbash, D. A., Siino, D. F., Tarone, A. M., and Roote, J. (2003). A rapidly evolvingMYB-related protein causes species isolation in Drosophila. Proc. Natl. Acad. Sci.,100(9):5302–5307.

Barluenga, M., Stolting, K. N., Salzburger, W., Muschick, M., and Meyer, A. (2006).Sympatric speciation in nicaraguan crater lake cichlid fish. Nature, 439(7077):719–723.

197

198

Barreiro, L. B., Laval, G., Quach, H., Patin, E., and Quintana-Murci, L. (2008).Natural selection has driven population differentiation in modern humans. NatGenet, 40(3):340–345.

Barton, N. H. (2001). The role of hybridization in evolution. Molecular Ecology,10(3):551–568.

Barton, N. H. (2006). Evolutionary biology: How did the human species form? Curr.Biol., 16(16):R647–R650.

Barton, N. H. and Bengtsson, B. O. (1986). The barrier to genetic exchange betweenhybridising populations. Heredity, 57(Pt 3):357–376.

Beadle, L. C. (1981). The inland waters of tropical Africa: An introduction to tropicallimnology. Longman Group Ltd. London, 2nd edition.

Beaumont, M. A. (2003). Estimation of population growth or decline in geneticallymonitored populations. Genetics, 164(3):1139–1160.

Beaumont, M. A., Zhang, W., and Balding, D. J. (2002). Approximate Bayesiancomputation in population genetics. Genetics, 162(4):2025–2035.

Becquet, C., Patterson, N., Stone, A., Przeworski, M., and Reich, D. (2007). Genomicanalysis of chimpanzee population structure. PLoS Genet., 3(4):e66.

Becquet, C. and Przeworski, M. (2007). A new approach to estimate parameters ofspeciation models with application to apes. Genome Res., 17(10):1505–1519.

Becquet, C. and Przeworski, M. (2008). Can we learn about modes of speciation bycomputational approaches? Evolution Int. J. Org. Evolution. In prep.

Bull, V., Beltran, M., Jiggins, C. D., McMillan, W. O., Bermingham, E., and Mallet,J. (2006). Polyphyly and gene flow between non-sibling Heliconius species. BMCBiol., 4:11.

Cann, H. M., de Toma, C., Cazes, L., Legrand, M.-F., Morel, V., Piouffre, L., Bod-mer, J., Bodmer, W. F., Bonne-Tamir, B., Cambon-Thomsen, A., Chen, Z., Chu,J., Carcassi, C., Contu, L., Du, R., Excoffier, L., Friedlaender, J. S., Groot, H.,Gurwitz, D., Herrera, R. J., Huang, X., Kidd, J., Kidd, K. K., Langaney, A., Lin,A. A., Mehdi, S. Q., Parham, P., Piazza, A., Pistillo, M. P., Qian, Y., Shu, Q., Xu,J., Zhu, S., Weber, J. L., Greely, H. T., Feldman, M. W., Thomas, G., Dausset, J.,and Cavalli-Sforza, L. L. (2002). A human genome diversity cell line panel. Science,296(5566):261b–262.

199

Carling, M. D. and Brumfield, R. T. (2008). Integrating phylogenetic and populationgenetic analyses of multiple loci to test species divergence hypotheses in passerinabuntings. Genetics, 178(1):363–377.

Carothers, A. D., I., R., Kolcic, I., Polasek, O., Hayward, C., Wright, A. F., Campbell,H., Teague, P., Hastie, N., and Weber, J. (2006). Estimating human inbreedingcoefficients: Comparison of genealogical and marker heterozygosity approaches.Ann. Hum. Genet., 70:666–676.

Cavalli-Sforza, L., Menozzi, P., and Piazza, A. (1994). The history and geography ofhuman genes. Princeton (N. J.): Princeton University Press.

Cavalli-Sforza, L. L. and Feldman, M. W. (2003). The application of molecular geneticapproaches to the study of human evolution. Nat. Genet., 33:266–275.

Clark, A. G. (1997). Neutral behavior of shared polymorphism. Proc. Natl. Acad.Sci., 94(15):7730–7734.

Conrad, D. F., Jakobsson, M., Coop, G., Wen, X., Wall, J. D., Rosenberg, N. A., andPritchard, J. K. (2006). A worldwide survey of haplotype variation and linkagedisequilibrium in the human genome. Nat Genet, 38(11):1251–1260.

Consortium, T. I. H. (2005). A haplotype map of the human genome. Nature,437(7063):1299–1320.

Coop, G. and Przeworski, M. (2007). An evolutionary view of human recombination.Nat. Rev. Genet., 8(1):23–24.

Coyne, J. A. and Orr, H. A. (1989). Patterns of speciation in drosophila. EvolutionInt. J. Org. Evolution, 43(2):362–381.

Coyne, J. A. and Orr, H. A. (1998). The evolutionary genetics of speciation. Philos.Trans. R. Soc. Lond., B, Biol. Sci., 353:287–305.

Coyne, J. A. and Orr, H. A. (2004a). Allopatric and parapatric specition. In Specia-tion, chapter 3, pages 83–123. SINAUER ASSOCIATES, Inc., Sunderland, MA.

Coyne, J. A. and Orr, H. A. (2004b). Ecological isolation. In Speciation, pages179–210. SINAUER ASSOCIATES, Inc., Sunderland, MA.

Coyne, J. A. and Orr, H. A. (2004c). Genic incompatibilities. In Speciation, pages267–276. SINAUER ASSOCIATES, Inc., Sunderland, MA.

Coyne, J. A. and Orr, H. A. (2004d). Speciation. SINAUER ASSOCIATES, Inc.,Sunderland, MA.

200

D’Andrade, R. and Morin, P. A. (1996). Chimpanzee and human mitochondrialDNA: A principal components and individual-by-site analysis. Am. Anthropologist,98(2):352–370.

de Waal, F. B. M. (1997). Bonobo: The Forgotten Ape. University of California Press.

Degnan, J. H. and Rosenberg, N. A. (2006). Discordance of species trees with theirmost likely gene trees. PLoS Genet., 2(5):e68.

Dieckmann, U. and Doebeli, M. (1999). On the origin of species by sympatric speci-ation. Nature, 400(6742):354–357.

Dixon, C. J., Schonswetter, P., and Schneeweiss, G. M. (2007). Traces of ancientrange shifts in a mountain plant group (Androsace halleri complex, primulaceae).Mol. Ecol., 16(18):3890–3901.

Dobzhansky, T. (1937). Genetics and the Origin of Species. New York, ColumbiaUniv. Press.

Dolman, G. and Moritz, C. (2006). A multilocus perspective on refugial isolationand divergence in rainforest skinks (Carlia). Evolution Int. J. Org. Evolution,60(3):573–82.

Emelianov, I., Marec, F., and Mallet, J. (2004). Genomic evidence for divergencewith gene flow in host races of the larch budmoth. Proc. R. Soc. Lond., B, Biol.Sci., 271(1534):97–105.

Excoffier, L., Novembre, J., and Schneider, S. (2000). Computer note. SIMCOAL: Ageneral coalescent program for the simulation of molecular data in interconnectedpopulations with arbitrary demography. J. Hered., 91(6):506–509.

Fischer, A., Pollack, J., Thalmann, O., Nickel, B., and Paabo, S. (2006). Demographichistory and genetic differentiation in apes. Curr. Biol., 16(11):1133–1138.

Fischer, A., Wiebe, V., Paabo, S., and Przeworski, M. (2004). Evidence for a complexdemographic history of chimpanzees. Mol. Biol. Evol., 21(5):799–808.

Fossella, J., Samant, S. A., Silver, L. M., King, S. M., Vaughan, K. T., Olds-Clarke, P.,Johnson, K. A., Mikami, A., Vallee, R. B., and Pilder, S. H. (2000). An axonemaldynein at the hybrid sterility 6 locus: implications for t haplotype-specific malesterility and the evolution of species barriers. Mamm. Genome, 11(1):8–15.

201

Frazer, K. A., Ballinger, D. G., Cox, D. R., Hinds, D. A., Stuve, L. L., Gibbs,R. A., Belmont, J. W., Boudreau, A., Hardenbol, P., Leal, S. M., Pasternak, S.,Wheeler, D. A., Willis, T. D., Yu, F., Yang, H., Zeng, C., Gao, Y., Hu, H., Hu,W., Li, C., Lin, W., Liu, S., Pan, H., Tang, X., Wang, J., Wang, W., Yu, J.,Zhang, B., Zhang, Q., Zhao, H., Zhao, H., Zhou, J., Gabriel, S. B., Barry, R.,Blumenstiel, B., Camargo, A., Defelice, M., Faggart, M., Goyette, M., Gupta, S.,Moore, J., Nguyen, H., Onofrio, R. C., Parkin, M., Roy, J., Stahl, E., Winchester,E., Ziaugra, L., Altshuler, D., Shen, Y., Yao, Z., Huang, W., Chu, X., He, Y., Jin,L., Liu, Y., Shen, Y., Sun, W., Wang, H., Wang, Y., Wang, Y., Xiong, X., Xu,L., Waye, M. M. Y., Tsui, S. K. W., Xue, H., Wong, J. T.-F., Galver, L. M., Fan,J.-B., Gunderson, K., Murray, S. S., Oliphant, A. R., Chee, M. S., Montpetit, A.,Chagnon, F., Ferretti, V., Leboeuf, M., Olivier, J.-F., Phillips, M. S., Roumy, S.,Sallee, C., Verner, A., Hudson, T. J., Kwok, P.-Y., Cai, D., Koboldt, D. C., Miller,R. D., Pawlikowska, L., Taillon-Miller, P., Xiao, M., Tsui, L.-C., Mak, W., Song,Y. Q., Tam, P. K. H., Nakamura, Y., Kawaguchi, T., Kitamoto, T., Morizono, T.,Nagashima, A., Ohnishi, Y., Sekine, A., Tanaka, T., Tsunoda, T., Deloukas, P.,Bird, C. P., Delgado, M., Dermitzakis, E. T., Gwilliam, R., Hunt, S., Morrison, J.,Powell, D., Stranger, B. E., Whittaker, P., Bentley, D. R., Daly, M. J., de Bakker,P. I. W., Barrett, J., Chretien, Y. R., Maller, J., McCarroll, S., Patterson, N., Pe’er,I., Price, A., Purcell, S., Richter, D. J., Sabeti, P., Saxena, R., Schaffner, S. F.,Sham, P. C., Varilly, P., Altshuler, D., Stein, L. D., Krishnan, L., Smith, A. V.,Tello-Ruiz, M. K., Thorisson, G. A., Chakravarti, A., Chen, P. E., Cutler, D. J.,Kashuk, C. S., Lin, S., Abecasis, G. R., Guan, W., Li, Y., Munro, H. M., Qin, Z. S.,Thomas, D. J., McVean, G., Auton, A., Bottolo, L., Cardin, N., Eyheramendy, S.,Freeman, C., Marchini, J., Myers, S., Spencer, C., Stephens, M., Donnelly, P.,Cardon, L. R., Clarke, G., Evans, D. M., Morris, A. P., Weir, B. S., Tsunoda, T.,Mullikin, J. C., Sherry, S. T., Feolo, M., Skol, A., Zhang, H., Zeng, C., Zhao, H.,Matsuda, I., Fukushima, Y., Macer, D. R., Suda, E., Rotimi, C. N., Adebamowo,C. A., Ajayi, I., Aniagwu, T., Marshall, P. A., Nkwodimmah, C., Royal, C. D. M.,Leppert, M. F., Dixon, M., Peiffer, A., Qiu, R., Kent, A., Kato, K., Niikawa, N.,Adewole, I. F., Knoppers, B. M., Foster, M. W., Clayton, E. W., Watkin, J., Gibbs,R. A., Belmont, J. W., Muzny, D., Nazareth, L., Sodergren, E., Weinstock, G. M.,Wheeler, D. A., Yakub, I., Gabriel, S. B., Onofrio, R. C., Richter, D. J., Ziaugra,L., Birren, B. W., Daly, M. J., Altshuler, D., Wilson, R. K., Fulton, L. L., Rogers,J., Burton, J., Carter, N. P., Clee, C. M., Griffiths, M., Jones, M. C., McLay, K.,Plumb, R. W., Ross, M. T., Sims, S. K., Willey, D. L., Chen, Z., Han, H., Kang,L., Godbout, M., Wallenburg, J. C., L’Archeveque, P., Bellemare, G., Saeki, K.,Wang, H., An, D., Fu, H., Li, Q., Wang, Z., Wang, R., Holden, A. L., Brooks,L. D., McEwen, J. E., Guyer, M. S., Wang, V. O., Peterson, J. L., Shi, M., Spiegel,J., Sung, L. M., Zacharia, L. F., Collins, F. S., Kennedy, K., Jamieson, R., andStewart, J. (2007). A second generation human haplotype map of over 3.1 millionSNPs. Nature, 449(7164):851–861.

202

Frisse, L., Hudson, R. R., Bartoszewicz, A., Wall, J. D., Donfack, J., and Di Rienzo,A. (2001). Gene conversion and different population histories may explain thecontrast between polymorphism and linkage disequilibrium levels. Am. J. Hum.Genet., 69(4):831–843.

Fritz, U., Ayaz, D., Buschbom, J., Kami, H. G., Mazanaeva, L. F., Aloufi, A. A., Auer,M., Rifai, L., Silic, T., and Hundsdorfer, A. K. (2007). Go east: phylogeographiesof Mauremys caspica and M. rivulata− discordance of morphology, mitochondrialand nuclear genomic markers and rare hybridization. J. Evol. Biol., page e660.

Gage, T. B. (1998). The comparative demography of primates: with some commentson the evolution of life histories. Annu. Rev. Anthropol., 27:197–221.

Gagneux, P. (2002). The genus Pan: population genetics of an endangered outgroup.Trends Genet., 18(7):327–330.

Galtier, N., Depaulis, F., and Barton, N. H. (2000). Detecting bottlenecks and selec-tive sweeps from DNA sequence polymorphism. Genetics, 155(2):981–987.

Geraldes, A., Ferrand, N., and Nachman, M. W. (2006). Contrasting patterns ofintrogression at X-linked loci across the hybrid zone between subspecies of theeuropean rabbit (Oryctolagus cuniculus). Genetics, 173(2):919–933.

Ghebranious, N., Vaske, D., Yu, A., Zhao, C., Marth, G., and Weber, J. (2003).STRP screening sets for the human genome at 5 cM density. BMC Genomics,24:6.

Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996). Implementation. InMarkov Chain Monte Carlo In Practice, chapter 1.4, pages 8–19. Chapman andHall/CRC, Boca Raton, Florida.

Goebel, T. (2007). Anthropology: The missing years for modern humans. Science,315(5809):194–196.

Goldstein, D. B. and Pollock, D. D. (1997). Launching microsatellites: A review ofmutation processes and methods of phylogenetic inference. J. Hered., 88(5):335–342.

Gonder, M. K., Disotell, T. R., and Oates, J. F. (2006). New genetic evidence onthe evolution of chimpanzee populations and implications for taxonomy. Int. J.Primatol., 27:1103–1127.

Gonder, M. K., Oates, J. F., Disotell, T. R., Forstner, M. R. J., Morales, J. C.,and Melnick, D. J. (1997). A new west african chimpanzee subspecies? Nature,388(6640):337–337.

203

Goodman, S. J., Barton, N. H., Swanson, G., Abernethy, K., and Pemberton, J. M.(1999). Introgression through rare hybridization: A genetic study of a hybridzone between red and sika deer (Genus Cervus) in Argyll, Scotland. Genetics,152(1):355–371.

Greene-Till, R., Zhao, Y., and Hardies, S. C. (2000). Gene flow of unique sequencesbetween Mus musculus domesticus and Mus spretus. Mamm. Genome, 11(3):225–230.

Groves, C. P. (1970). Population systematics of the gorilla. J. Zool. Lond., 161:287–300.

Groves, C. P. (1971). Pongo pygmaeus. Mamm. Species, 4(4):1–6.

Groves, C. P. (2001). Primate taxonomy. Washington (D. C.): Smithsonian Institu-tion Press.

Grubb, P., Butynski, T. M., Oates, J. F., Bearder, S. K., Disotell, T. R.,Grovesm Colin, P., and Struhsaker, T. T. (2003). Assessment of the diversityof african primates. Int. J. Primatol., 24(6):1301–1357.

Hall, J. P. (2005). Montane speciation patterns in Ithomiola butterflies (lepidoptera:Riodinidae): are they consistently moving up in the world? Proc. Biol. Sci.,272(1580):2457–2466.

Hastings, W. (1970). Monte Carlo sampling methods using Markov chains and theirapplications. Biometrika, 57:97–109.

Hey, J. (1991). The structure of genealogies and the distribution of fixed differencesbetween DNA sequence samples from natural populations. Genetics, 128(4):831–840.

Hey, J. (2005). On the number of new world founders: A population genetic portraitof the peopling of the americas. PLoS Biol., 3(6):e193.

Hey, J. (2006a). On the failure of modern species concepts. Trends Ecol. Evol.,21(8):447–450.

Hey, J. (2006b). Recent advances in assessing gene flow between diverging populationsand species. Curr. Opin. Genet. Dev., 16(6):592–596.

Hey, J. and Nielsen, R. (2004). Multilocus methods for estimating populationsizes, migration rates and divergence time, with applications to the divergence ofDrosophila pseudoobscura and D. persimilis. Genetics, 167(2):747–760. Genetics.

204

Hey, J., Won, Y. J., Sivasundar, A., Nielsen, R., and Markert, J. A. (2004). Using nu-clear haplotypes with microsatellites to study gene flow between recently separatedcichlid species. Mol. Ecol., 13(4):909–919.

Hill, W. C. O. (1969). The nomenclature, taxonomy and distribution of chimpanzees.In Bourne, G. H., editor, The chimpanzee; a series of volumes on the chimpanzee,volume 1, pages 22–49. Basel, New York: S. Karger A. G.

Hobolth, A., Christensen, O. F., Mailund, T., and Schierup, M. H. (2006). Genomicrelationships and speciation times of human, chimpanzee and gorilla inferred froma coalescent Hidden Markov Model. PLoS Genet., preprint(2006):e7.eor.

Hosono, S., Faruqi, A., Dean, F., Du, Y., Sun, Z., Wu, X., Du, J., Kingsmore, S.,Egholm, M., and Lasken, R. (2003). Unbiased wholegenome amplification directlyfrom clinical samples. Genome Res., 13:954–964.

Hudson, R. R. (1983). Properties of a neutral allele model with intragenic recombi-nation. Theor. Popul. Biol., 23(2):183–201.

Hudson, R. R. (2001). Two-locus sampling distributions and their application. Ge-netics, 159(4):1805–1817.

Hudson, R. R. (2002). Generating samples under a wright-fisher neutral model ofgenetic variation. Bioinformatics, 18(2):337–338.

Hudson, R. R. and Coyne, J. A. (2002). Mathematical consequences of the genealog-ical species concept. Evolution Int. J. Org. Evolution, 56(8):1557–1565.

Hudson, R. R. and Kaplan, N. L. (1985). Statistical properties of the number ofrecombination events in the history of a sample of DNA sequences. Genetics,111(1):147–164.

Hudson, R. R., Slatkin, M., and Maddison, W. P. (1992). Estimation of levels of geneflow from DNA sequence data. Genetics, 132(2):583–589.

Hughes, P. D., Woodward, J. C., and Gibbard, P. L. (2006). Quaternary glacialhistory of the mediterranean mountains. Progr. Phys. Geogr., 30(3):334–364.

Innan, H. and Watanabe, H. (2006). The effect of gene flow on the coalescent timein the human-chimpanzee ancestral population. Mol. Biol. Evol., 23(5):1040–1047.

Kaessmann, H., Wiebe, V., and Paabo, S. (1999). Extensive nuclear DNA sequencediversity among chimpanzees. Science, 286(5442):1159–1162. Science.

Kaessmann, H., Wiebe, V., Weiss, G., and Paabo, S. (2001). Great ape DNA se-quences reveal a reduced diversity and an expansion in humans. Nat. Genet.,27(2):155–156.

205

Kallman, K. D. and Kazianis, S. (2006). The genus Xiphophorus in mexico andcentral america. Zebrafish, 3(3):271–285.

Keinan, A., Mullikin, J. C., Patterson, N., and Reich, D. (2007). Measurement of thehuman allele frequency spectrum demonstrates greater genetic drift in east asiansthan in europeans. Nat Genet, 39(10):1251–1255.

Kimura, M. (1969). The number of heterozygous nucleotide sites maintained in afinite population due to steady flux of mutations. Genetics, 61(4):893–903.

Kittles, R. A. and Weiss, K. M. (2003). Race, ancestry and genes: Implications fordefining disease risk. Annu. Rev. Genom. Hum. Genet., 4:33–67.

Kliman, R. M., Andolfatto, P., Coyne, J. A., Depaulis, F., Kreitman, M., Berry,A. J., McCarter, J., Wakeley, J., and Hey, J. (2000). The population genetics ofthe origin and divergence of the Drosophila simulans complex species. Genetics,156(4):1913–1931.

Knaden, M., Tinaut, A., Cerda, X., Wehner, S., and Wehner, R. (2005). Phylogenyof three parapatric species of desert ants, Cataglyphis bicolor, C. viatica, and C.savignyi : a comparison of mitochondrial DNA, nuclear DNA, and morphologicaldata. Zoology (Jena), 108(2):169–177.

Kong, A., Gudbjartsson, D. F., Sainz, J., Jonsdottir, G. M., Gudjonsson, S. A.,Richardsson, B., Sigurdardottir, S., Barnard, J., Hallbeck, B., Masson, G., Shlien,A., Palsson, S. T., Frigge, M. L., Thorgeirsson, T. E., Gulcher, J. R., and Stefans-son, K. (2002). A high-resolution recombination map of the human genome. Nat.Genet., 31(3):241–247.

Kronforst, M. R., Young, L. G., Blume, L. M., and Gilbert, L. E. (2006). Multilocusanalyses of admixture and introgression among hybridizing Heliconius butterflies.Evolution Int. J. Org. Evolution, 60(6):1254–68.

Kumar, S., Tamura, K., Jakobsen, I. B., and Nei, M. (2001). MEGA2: molecularevolutionary genetics analysis software. Bioinformatics, 17(12):1244–1245.

Leman, S. C., Chen, Y., Stajich, J. E., Noor, M. A. F., and Uyenoyama, M. K. (2005).Likelihoods from summary statistics: Recent divergence between species. Genetics,171(3):1419–1436.

Llopart, A., Elwyn, S., Lachaise, D., and Coyne, J. A. (2002). Genetics of a differencein pigmentation between Drosophila yakuba and Drosophila santomea. EvolutionInt. J. Org. Evolution, 56(11):2262–2277.

206

Llopart, A., Lachaise, D., and Coyne, J. A. (2005). Multilocus analysis of introgres-sion between two sympatric sister species of drosophila: Drosophila yakuba and D.santomea. Genetics, 171(1):197–210.

Lockwood, J. R. and Roeder, K.and Devlin, B. (2001). A Bayesian hierarchical modelfor allele frequencies. Genet. Epidemiol., 20:13–33.

Machado, C. A., Kliman, R. M., Markert, J. A., and Hey, J. (2002). Inferring thehistory of speciation from multilocus DNA sequence data: The case of Drosophilapseudoobscura and close relatives. Mol. Biol. Evol., 19(4):472–488.

Mallet, J. (2005). Hybridization as an invasion of the genome. Trends Ecol. Evol.,20(5):229–237.

Maynard Smith, J. and Haigh, J. (1974). The hitch-hiking effect of a favourable gene.Genet Res, 23(1):23–35.

Mayr, E. (1963). Animal Species and Evolution. The Belknap press, Cambridge, MA.

Mayr, E. (2001). Wu’s genic view of speciation. J. Evol. Biol., 14(6):866–867.

Mazzoni, C., Araki, A., Ferreira, G., Azevedo, R., Barbujani, G., and Peixoto, A.(2008). Multilocus analysis of introgression between two sand fly vectors of leish-maniasis. BMC Evol Biol, 8(1):141.

McBrearty, S. and Jablonski, N. G. (2005). First fossil chimpanzee. Nature,437(7055):105–108.

McVean, G. A. T., Myers, S. R., Hunt, S., Deloukas, P., Bentley, D. R., and Donnelly,P. (2004). The fine-scale structure of recombination rate variation in the humangenome. Science, 304(5670):581–584.

Meng, X. L. (1994). Posterior predictive p-values. Ann. Stat., 22(3):1142–1160.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E.(1953). Equation of state calculation by fast computing machines. J. Chem. Phys.,21:1087–1092.

Miller, S. R., Purugganan, M. D., and Curtis, S. E. (2006). Molecular popula-tion genetics and phenotypic diversification of two populations of the thermophiliccyanobacterium Mastigocladus laminosus. Appl. Environ. Microbiol., 72(4):2793–2800.

Morin, P. A., Moore, J. J., Chakraborty, R., Jin, L., Goodall, J., and Woodruff, D. S.(1994). Kin selection, social structure, gene flow, and the evolution of chimpanzees.Science, 265(5176):1193–1201.

207

Muir, C., Galdikas, B., and Andrew, T. (2000). mtDNA sequence diversity oforangutans from the islands of Borneo and Sumatra. J. Mol. Evol., 51(5):471–480.

Muller, H. J. (1940). Bearing of the Drosophila work on systematics. In Huxley,J., editor, In The New Systematics, pages 185–268. Clarendon Press, Oxford, UK,clarendon, oxford edition.

Myers, S., Bottolo, L., Freeman, C., McVean, G., and Donnelly, P. (2005). A Fine-Scale Map of Recombination Rates and Hotspots Across the Human Genome. Sci-ence, 310(5746):321–324.

Myers Thompson, J. A. (2003). A model of the biogeographical journey from Proto-pan to Pan paniscus. Primates, 44(2):191–197.

Nachman, M. W. and Crowell, S. L. (2000). Estimate of the mutation rate pernucleotide in humans. Genetics, 156(1):297–304.

Navarro, A. and Barton, N. H. (2003). Accumulating postzygotic isolation genesin parapatry: a new twist on chromosomal speciation. Evolution Int. J. Org.Evolution, 57(3):447–459.

Nei, M. and Li, W. H. (1979). Mathematical model for studying genetic variation interms of restriction endonucleases. Proc. Natl. Acad. Sci., 76(10):5269–5273.

Nielsen, R. and Signorovitch, J. (2003). Correcting for ascertainment biases when an-alyzing SNP data: applications to the estimation of linkage disequilibrium. Theor.Popul. Biol., 63(3):245–255.

Nielsen, R. and Wakeley, J. (2001). Distinguishing migration from isolation: AMarkov Chain Monte Carlo approach. Genetics, 158(2):885–896.

Niemiller, M. L., Fitzpatrick, B. M., and Miller, B. T. (2008). Recent divergence withgene flow in Tennessee cave salamanders (Plethodontidae: Gyrinophilus) inferredfrom gene genealogies. Molecular Ecology, 17(9):2258–2275.

Nordborg, M. and Tavare, S. (2002). Linkage disequilibrium: what history has to tellus. Trends Genet., 18(2):83–90.

Nosil, P. (2008). Speciation with gene flow could be common. Molecular Ecology,17(9):2103–2106.

Orr, H. A. (2001). Some doubts about (yet another) view of species. J. Evol. Biol.,14(6):870–871.

208

Orr, H. A. (2005). The genetic basis of reproductive isolation: Insights fromDrosophila. Proc. Natl. Acad. Sci., 102(suppl 1):6522–6526.

Orr, H. A., Masly, J. P., and Presgraves, D. C. (2004). Speciation genes. Curr. Opin.Genet. Dev., 14(6):675–679.

Passarino, G., Semino, O., Quintana-Murci, L., Excoffier, L., Hammer, M., andSantachiara-Benerecetti, A. S. (1998). Different genetic components in theEthiopian population, identified by mtDNA and Y-chromosome polymorphisms.Am J Hum Genet, 62(2):420–434.

Patterson, N., Price, A. L., and Reich, D. (2006a). Population structure and eigen-analysis. PLoS Genet., 2:e190.

Patterson, N., Richter, D. J., Gnerre, S., Lander, E. S., and Reich, D. (2006b).Genetic evidence for complex speciation of humans and chimpanzees. Nature,441(7097):1103–1108.

Petren, K., Grant, P. R., Grant, B. R., and Keller, L. F. (2005). Comparative land-scape genetics and the adaptive radiation of darwin’s finches: the role of peripheralisolation. Mol. Ecol., 14(10):2943–2957.

Pollard, D. A., Iyer, V. N., Moses, A. M., and Eisen, M. B. (2006). Widespreaddiscordance of gene trees with species tree in Drosophila: Evidence for incompletelineage sorting. PLoS Genet., 2(10):e173.

Presgraves, D. C., Balagopalan, L., Abmayr, S. M., and Orr, H. A. (2003). Adaptiveevolution drives divergence of a hybrid inviability gene between two species ofDrosophila. Nature, 423(6941):715–719.

Price, T. (2007). Speciation in birds. Robert and Company Publishers.

Pritchard, J. K., Seielstad, M. T., Perez-Lezaun, A., and Feldman, M. W. (1999). Pop-ulation growth of human Y chromosomes: a study of Y chromosome microsatellites.Mol. Biol. Evol., 16(12):1791–1798.

Pritchard, J. K., Stephens, M., and Donnelly, P. (2000). Inference of populationstructure using multilocus genotype data. Genetics, 155(2):945–959.

Przeworski, M. (2003). Estimating the time since the fixation of a beneficial allele.Genetics, 164(4):1667–1676.

Ptak, S. E. and Przeworski, M. (2002). Evidence for population growth in humans isconfounded by fine-scale population structure. Trends Genet, 18(11):559–563.

209

Putnam, A. S., Scriber, M., and Andolfatto, P. (2007). Discordant divergence timesamong Z chromosome regions between two ecologically distcint swallowtail butterflyspecies. Evolution Int. J. Org. Evolution, 61(4):912–927.

Ramachandran, S., Deshpande, O., Roseman, C. C., Rosenberg, N. A., Feldman,M. W., and Cavalli-Sforza, L. L. (2005). Support from the relationship of geneticand geographic distance in human populations for a serial founder effect originatingin africa. Proc. Natl. Acad. Sci., 102(44):15942–15947.

Reinartz, G. E., Karron, J. D., Phillips, R. B., and Weber, J. L. (2000). Patterns ofmicrosatellite polymorphism in the range-restricted bonobo (Pan paniscus): Con-siderations for interspecific comparison with chimpanzees (P. troglodytes). Mol.Ecol., 9:315–328.

Rosenberg, N. A., Li, L. M., Ward, R., and Pritchard, J. K. (2003). Informativenessof genetic markers for inference of ancestry. Am. J. Hum. Genet., 73(6):1402–1422.

Rosenberg, N. A., Mahajan, S., Ramachandran, S., Zhao, C., Pritchard, J. K., andFeldman, M. W. (2005). Clines, clusters, and the effect of study design on theinference of human population structure. PLoS Genet., 1(6):e70.

Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivo-tovsky, L. A., and Feldman, M. W. (2002). Genetic structure of human populations.Science, 298(5602):2381–2385.

Rozas, J., Sanchez-DelBarrio, J. C., Messeguer, X., and Rozas, R. (2003). DnaSP,DNA polymorphism analyses by the coalescent and other methods. Bioinformatics,19(18):2496–2497.

Savolainen, V., Anstett, M.-C., Lexer, C., Hutton, I., Clarkson, J. J., Norup, M. V.,Powell, M. P., Springate, D., Salamin, N., and Baker, W. J. (2006). Sympatricspeciation in palms on an oceanic island. Nature, advanced online publication.

Sawamura, K., Watanabe, T., and Yamamoto, M. (1993). Hybrid lethal systems inthe Drosophila Melanogaster species complex. Genetica, 88(2-3):175–185.

Schaffner, S. F., Foo, C., Gabriel, S., Reich, D., Daly, M. J., and Altshuler, D. (2005).Calibrating a coalescent simulation of human genome sequence variation. GenomeRes, 15(11):1576–1583.

Scheet, P. and Stephens, M. (2006). A fast and flexible statistical model forlarge−scale population genotype data: Applications to inferring missing genotypesand haplotypic phase. Am. J. Hum. Genet., 78(4):629–644.

210

Schneider, S., Roessli, D., and Excoffier, L. (2000). ARLEQUIN: A software forpopulation genetics data analysis, version 2000 [Computer Program]. Geneva: De-partment of Anthropology, University of Geneva.

Shea, B. T., Leigh, S. R., and Groves, C. P. (1993). Multivariate craniometric varia-tion in chimpanzees: implications for species identification in paleoanthropology. InKimbel, W. and Mar, L., editors, Species, species concepts, and primate evolution,pages 265–296. New York: Plenum Press.

Slatkin, M. (1995). A measure of population subdivision based on microsatellite allelefrequencies. Genetics, 139(1):457–462.

Smith, R. J. and Pilbeam, D. R. (1980). Evolution of the orang-utan. Nature,284(5755):447–448.

Stadler, T., Arunyawat, U., and Stephan, W. (2008). Population genetics of speciationin two closely related wild tomatoes (solanum section lycopersicon). Genetics,178(1):339–350.

Stephens, M., Smith, N. J., and Donnelly, P. (2001). A new statistical method forhaplotype reconstruction from population data. Am. J. Hum. Genet., 68(4):978–89.

Stone, A. C., Griffiths, R. C., Zegura, S. L., and Hammer, M. F. (2002). High levelsof Y-chromosome nucleotide diversity in the genus Pan. Proc. Natl. Acad. Sci.,99(1):43–48.

Stone, A. C. and Verrelli, B. C. (2006). Focusing on comparative ape populationgenetics in the post-genomic age. Curr. Opin. Genet. Dev., 16(6):586–591.

Tajima, F. (1989). Statistical method for testing the neutral mutation hypothesis byDNA polymorphism. Genetics, 123(3):585–595. Genetics.

Takahata, N. and Satta, Y. (2002). Pre-speciation coalescence and the effective sizeof ancestral populations. In Slatkin, M. and Veuille, M., editors, Modern Develop-ments in Theoretical Population Genetics, pages 52–71. Oxford University Press,Oxford.

Tavare, S., Balding, D. J., Griffiths, R. C., and Donnelly, P. (1997). Inferring coales-cence times from DNA sequence data. Genetics, 145(2):505–518.

Thalmann, O., Fischer, A., Lankester, F., Paabo, S., and Vigilant, L. (2006). Thecomplex evolutionary history of gorillas: Insights from genomic data. Mol. Biol.Evol., 24(1):146–158.

211

The Chimpanzee Sequencing and Analysis Consortium (2005). Initial sequenceof the chimpanzee genome and comparison with the human genome. Nature,437(7055):69–87.

Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994). CLUSTAL W: improv-ing the sensitivity of progressive multiple sequence alignment through sequenceweighting, position-specific gap penalties and weight matrix choice. Nucl. AcidsRes., 22(22):4673–4680.

Ting, C.-T., Tsaur, S.-C., Wu, M.-L., and Wu, C.-I. (1998). A rapidly evolvinghomeobox at the site of a hybrid sterility gene. Science, 282(5393):1501–1504.

Turner, T. L., Hahn, M. W., and Nuzhdin, S. V. (2005). Genomic islands of speciationin Anopheles gambiae. PLoS Biol., 3(9):e285.

Valdes, A. M., Slatkin, M., and Freimer, N. B. (1993). Allele frequencies at mi-crosatellite loci: The stepwise mutation model revisited. Genetics, 133(3):737–749.

Voight, B. F., Adams, A. M., Frisse, L. A., Qian, Y., Hudson, R. R., and Di Rienzo, A.(2005). Interrogating multiple aspects of variation in a full resequencing data set toinfer human population size changes. Proc. Natl. Acad. Sci., 102(51):18508–18513.

Voight, B. F., Kudaravalli, S., Wen, X., and Pritchard, J. K. (2006). A map of recentpositive selection in the human genome. PLoS Biol, 4(3):e72.

Wakeley, J. (1996). Distinguishing migration from isolation using the variance ofpairwise differences. Theor. Popul. Biol., 49(3):369–386.

Wakeley, J. (2008). Complex speciation of humans and chimpanzees. Nature,452(7184):E3–E4; discussion E4.

Wakeley, J. and Hey, J. (1997). Estimating ancestral population parameters. Genetics,145(3):847–855. Genetics.

Wall, J. D. (2000). Detecting ancient admixture in humans using sequence polymor-phism data. Genetics, 154(3):1271–1279.

Wall, J. D. (2003). Estimating ancestral population sizes and divergence times. Ge-netics, 163(1):395–404.

Wall, J. D., Cox, M. P., Mendez, F. L., Woerner, A., Severson, T., and Hammer,M. F. (2008). A novel DNA sequence database for analyzing human demographichistory. Genome Research, page gr.075630.107.

Wang, R. L., Stec, A., Hey, J., Lukens, L., and Doebley, J. (1999). The limits ofselection during maize domestication. Nature, 398(6724):236–239.

212

Wittbrodt, J., Adam, D., Malitschek, B., Maueler, W., Raulf, F., Telling, A., Robert-son, S. M., and Schartl, M. (1989). Novel putative receptor tyrosine kinase encodedby the melanoma-inducing Tu locus in Xiphophorus. Nature, 341(6241):415–421.

Won, Y.-J. and Hey, J. (2005). Divergence population genetics of chimpanzees. Mol.Biol. Evol., 22(2):297–307.

Won, Y. J., Sivasundar, A., Wang, Y., and Hey, J. (2005). On the origin of lakemalawi cichlid species: A population genetic analysis of divergence. Proc. Natl.Acad. Sci., 102(suppl 1):6581–6586.

Wu, C.-I. (2001a). Genes and speciation. J. Evol. Biol., 14(6):889–891.

Wu, C.-I. (2001b). The genic view of the process of speciation. J. Evol. Biol.,14(6):851–865.

Wu, C.-I. and Ting, C.-T. (2004). Genes and speciation. Nat. Rev. Genet., 5(2):114–122.

Yu, N., Fu, Y. X., and Li, W. H. (2002). DNA polymorphism in a worldwide sampleof human X chromosomes. Mol. Biol. Evol., 19(12):2131–2141.

Yu, N., Jensen-Seaman, M. I., Chemnick, L., Kidd, J. R., Deinard, A. S., Ryder, O.,Kidd, K. K., and Li, W. H. (2003). Low nucleotide diversity in chimpanzees andbonobos. Genetics, 164(4):1511–1518.

Yu, N., Jensen-Seaman, M. I., Chemnick, L., Ryder, O., and Li, W. H. (2004).Nucleotide diversity in gorillas. Genetics, 166(3):1375–1383.

Zhang, Y., Ryder, O. A., and Zhang, Y. (2001). Genetic divergence of orangutansubspecies (Pongo pygmaeus). J. Mol. Evol., 52(6):516–26.

Zhi, L., Karesh, W. B., Janczewski, D. N., Frazier-Taylor, H., Sajuthi, D., Gombek,F., Andau, M., Martenson, J. S., and O’Brien, S. J. (1996). Genomic differenti-ation among natural populations of orang-utan (Pongo pygmaeus). Curr. Biol.,6(10):1326–1336.

Zhivotovsky, L. A., Rosenberg, N. A., and Feldman, M. W. (2003). Features of evo-lution and expansion of modern humans, inferred from genomewide microsatellitemarkers. Am. J. Hum. Genet., 72(5):1171–1186.