D E S I S L A V A P E T K O V A
INFERRINGEFFECTIVE MIGRATION FROMGEOGRAPHICALLY INDEXEDGENETIC DATA
T H E U N I V E R S I T Y O F C H I C A G O
Contents
1 Population Structure in Genetic Variation 6
2 Population Structure due to Migration 8
3 Genetic Dissimilarities and Distance Matrices 20
4 Estimating Effective Rates of Migration 28
5 Simulations of Structured Genetic Data 35
6 Empirical Results 42
7 Appendices 58
8 Bibliography 77
List of Figures
2.1 A genealogy describes the ancestral history of a genotyped sample 8
2.2 A randomwalk approximates the migration process in a population graph 182.3 Effective resistances approximate expected coalescence times: relative error 19
5.1 Population structure under uniform migration 355.2 Population structure due to a barrier to migration 365.3 Uncertainty in the inferred migration surface 365.4 Barrier to migration with ascertainment bias 385.5 Population structure due to differences in deme size 395.6 A past demographic event results in a barrier to effective migration 395.7 Barrier to migration with uneven sampling 40
6.1 Habitat of the red-backed fairywren with the Carpentarian barrier 426.2 PCA and STRUCTURE analysis of the red-backed fairywren data 436.3 Distance scatterplot for the red-backed fairywren data 446.4 Triangular population graph spans thehabitat of the red-backed fairywren 446.5 Inferred effective migration surface for the red-backed fairywren 456.6 Uncertainty in the inferred effectivemigrationof the red-backed fairywren 456.7 Triangular population graph spans the habitat of the African elephant 466.8 PCA analysis of the elephant data 476.9 Inferred effective migration surface for the African elephant 476.10 Effective migration rates at each of sixteen microsatellites 486.11 Inferred effective migration surface for the savanna and forest elephants 486.12 GENELAND analysis of the African elephant data 496.13 STRUCTURE analysis of the African elephant data 496.14 Distance scatterplots for the African elephant data 506.15 Sample configuration and PCA analysis of the European and African data 516.16 Distance scatterplots for the European and African data 526.17 Inferred effective migration for human populations in Europe and Africa 536.18 Sample configuration and PCA analysis of Arabidopsis thaliana data 546.19 Inferred effective migration surfaces for Arabidopsis thaliana 556.20 Distance scatterplots for the Arabidopsis thaliana data 56
7.1 ms command: uniform migration on a regular triangular grid 747.2 ms command: barrier to migration on a regular triangular grid 747.3 ms command: barrier to effectivemigrationdue todifferences inpopulation size 757.4 ms command: uniformeffectivemigrationdespite differences in population size 757.5 ms command: barrier to effective migration due to a split in time 76
4
Genetic data often exhibit patterns that are broadly consistent with ''iso-
lation by distance'' โ a phenomenon where genetic similarity tends to
decaywith geographic distance. In a heterogeneous habitat, decaymay oc-
curmore quickly in some regions than others: for example, barriers to gene
flow such as mountains or deserts could accelerate the genetic differenti-
ation between neighboring groups. In this thesis we present a method to
quantify and visualize variation in effective migration across the habitat,
and, under further assumptions, to infer the presence or absence of barri-
ers to migration, from geographically indexed large-scale genetic data.
inferring effective migration from geographically indexed genetic data 5
First and foremost, I would like to express my deepest gratitude to my su-pervisor, Professor Matthew Stephens, for his guidance and encourage-ment throughout the development of this project. His tremendous sup-port made this formidable journey less complicated, if not easier.
I am also grateful to my colleagues at the Departments of Statisticsand Human Genetics for their inspiring companionship, for their collec-tive critical eye, but above all, for their advice, assistance and guidance onmany difficult problems.
I am indebted to so many but I especially wish to acknowledge the ef-forts, love and encouragement of my parents. ey watched me from adistance while I worked towards my degree. e completion of this thesiswould mean a lot to them, so I dedicate this project to my parents, Emaand Ivo.
And finally, I would like to thank my sister and my friends for allow-ing me to realize my own potential and without whose love, affection andencouragement this thesis and many other pursuits would not have beensuccessful.
1
Population Structure in Genetic Variation
e term ''population structure'' is used to describe nonrandom patterns of genetic sim-ilarity (or alternatively, dissimilarity) between individuals from the same species. Onetask is to detect such patterns; this is often done in association studies because system-atic ancestry differences between cases and controls that are not genetic risk factors forthe disease can bias the results of the study [Price et al., 2010]. A more challenging taskisAdmixture is partial ancestry from two
or more distinct subpopulations as theresult of interbreeding.
to explain population structure as the outcome of events in the evolutionary historyof the species such as splits or admixture (events in time) and/or migration (events inspace) [Lawson and Falush, 2012].
Twowidely used approaches for inferring genetic ancestry are principal componentsanalysis and model-based clustering. In both cases, interpretation of results and infer-ence of demography are founded on the assumption that sample structure is evidencefor population structure, to the exclusion of other possible sources such as family struc-ture, cryptic relatedness or sample processing artifacts.
Principal component analysis (PCA) was first used in population genetics to sum-marize human genetic variation across continents [Menozzi et al., 1978, Cavalli-Sforzaet al., 1994]. eir synthetic maps of allele frequency variation show gradients thatcould support hypotheses for specific migration events such as the spread of Neolithicfarming. is interpretation of PCA maps is not universally accepted because PCA canproduce similar wave patterns in simulated spatial data, where gradients result fromlocal dispersal and not directed migration [Novembre and Stephens, 2008]. However,even though PCA might not explain what processes generated the structured variationin genetic data, themethod has been successfully applied to detect population stratifica-tion and infer genetic ancestry. For example, the top principal components of the sam-ple covariancematrix across a large number of (randomly selected) SNPs align well withgeographic distribution in some datasets [Novembre et al., 2008, Wang et al., 2012].
Alternatively, population structure can be analyzed with a model-based clusteringapproach. For example, STRUCTURE [Pritchard et al., 2000] assigns individuals into ๐พgenetically homogeneous subpopulations [i.e., randommating and hence under Hardy-Weinberg equilibrium], with individual-specific ancestry proportions. As a clusteringalgorithm, STRUCTURE assumes the number of clusters is known. Even more impor-tantly, it uses a discrete model of population structure that is most applicable wherehigh level of divergence have resulted into well differentiated clusters.
Both PCA and STRUCTURE can produce results that are difficult to interpret. Forexample, STRUCTURE can fail if the population consists of distinct groups character-ized by small differences in allele frequencies, or a single population where the distribu-tion of allele frequencies varies continuously across space. In both cases, it is hard for aclustering algorithm to distinguish between clusters, or find the correct number of clus-
inferring effective migration from geographically indexed genetic data 7
ters. e study design can also influence the extent of observed ''clusteredness'' asmanydatasets consist ofmultiple observations from few locations [Serre and Pรครคbo, 2004]. Ifthe sample configuration does not represent the geographic distribution of the species,muchnaturally occurring genetic variation remains unobserved. And indeed the geneticdifferentiation of a widely distributed species such as humans is likely to exhibit evi-dence for both clusters, which correspond to discontinuous jumps in allele frequenciesacross large barriers such as oceans or the Himalayas, and clines, which reflect smoothgradations in allele frequencies across unbroken geographic regions [Rosenberg et al.,2005].
In the case of low differentiation between clusters, STRUCTURE results can be im-proved with a stronger, more informative prior on cluster membership. For example,[Hubisz et al., 2009] introduces a prior that places more weight on cluster assignmentsthat are correlated with sampling locations (because origin is often informative aboutancestry). In another modification of STRUCTURE that incorporates geographic infor-mation, [Guillot et al., 2005] explicitly models the distribution of clusters across thehabitat to encourage spatially continuous clusters (because subspecies often occupy lo-cally connected areas).
In the case of smoothly varying population structure, it is not appropriate to as-sign individuals to a fixed number of distinct clusters, even if the clustering methodallows fractional membership. PCA is effective in presenting continuous variation andPC projections are related to the underlying genealogical process [McVean, 2009]. How-ever, the algorithm is not based on a population genetics model, so it does not estimaterelevant demographic parameters, and its results are strongly affected by uneven sam-pling.
Genetic data often exhibit patterns that are broadly consistentwith "isolationbydis-tance" [Weiss and Kimura, 1965, Rousset, 1997] where genetic similarity tends to decaywith geographic distance. at is, a population inwhich the exchange ofmigrants is con-stant in both space and time still has structure as individuals that are close together are,on average, more genetically similar than individuals that are far apart [if reproductionand dispersal tend to occur locally/over small distances in every generation].
In a heterogeneous habitat, genetic similarity may decrease faster in some regionsthan others because a barrier to migration could accelerate the genetic differentiationbetween neighboring groupsโ thus creating patterns of population structure that arenot consistent with uniform migration. Here we develop a method aimed at investi-gating this kind of scenario. Specifically, we introduce a parametric model for geneticstructure that attempts to explain the spatial structure observed in geographically in-dexed large-scale genetic data in terms of effective rates of migration. We say "effective"because themodel's applicability to genetic data ismotivated under of series of assump-tions [most importantly, equilibrium in time] that mean estimated rates cannot be in-terpreted as actual rates of migration unless the assumptions are reasonably satisfied.However, even when estimated population parameters are not directly interpretable interms of demographic history, our method provides an intuitive and informative wayto quantify and visualize spatial patterns of population structure.
...
1
.(๐)
.
1
.(๐)
.
1
..
0
..
0
..
0
.
โข
.๐ก๐๐
.
๐ก๐๐๐๐ โ ๐ก๐๐
Figure 2.1: is genealogy specifies onepossible history for a sample of size 6.Since exactly one mutation occurs, de-noted by โข, we observe a pattern of both0s and 1s. Regardless of which branch car-ries the mutation, the events 'only 0s' and'only 1s' are excluded.
2
Population Structure due to Migration
In this background chapter we explain how population structure is reflected in observedgenetic data via the genealogy of the sample and review briefly a mathematical modelfor spatially structured populations.
In natural populations,mating is not randomdue to a complexmixture of evolution-ary and ecological factors. Non-randommating creates structure in genetic variation asclosely related individuals tend to be more similar genetically than distantly related in-dividuals. us shared ancestry leads to genetic similarity [Section 2.1].
An important factor for non-random mating is geographic distance as two individ-uals located close in space are more likely to reproduce than two individuals far apart.us geographic proximity leads to genetic similarity โ a phenomenon called isolationby distance [Section 2.2].
A population geneticsmodel that exhibits isolation by distance is Kimura's stepping-stone model [Section 2.3]. It represents a spatially distributed population as a graphwhere vertices are groups of randomly mating individuals (called demes) and edges aredirect routes of migration. us demes that are closer together in the graph tend to bemore similar.
In fact, the stepping-stone model can capture the effect of both geographic distanceand heterogeneous habitat on genetic similarity as edges can have different migrationrates to reflect heterogeneity in gene flow. is weighted population graph describesprecisely what it means for two demes to be ''close together'' [Section 2.4].
2.1 Pairwise expected coalescence times explain population structure
emore closely related two individuals are, themore genetically similar they are. ere-fore, the genetic similarities observed in a sample contain information about the evo-lutionary processes undergone by the entire population. In this section we explain theconnection between genealogical histories and genetic similarities; the review is largelybased on [McVean, 2009].
Let ๐ง1, โฆ , ๐ง๐ be the genotypes of ๐ individuals at a single segregating locus. Forsimplicity, assume the genetic markers are biallelic (e.g., SNPs): each individual carrieseither the ancestral allele, labeled '0', or the derived allele, labeled '1'.
Although life occurs forward in time and in discrete generations, it is often moreconvenient to model the ancestry of a sample backwards in time using a continuous-time process called the coalescent [Kingman, 1982b,a] that traces the lineages backwardsin time until their convergence into a single common ancestor. us the coalescentconstructs the history of the sample, at a single locus, in the form of a genealogical tree[Figure 2.1]. e most important demographic functions of the genealogy are
inferring effective migration from geographically indexed genetic data 9
โข the time to the most recent common ancestor ๐ก๐๐๐๐ [the height of the tree];
โข the total size of the tree ๐ก๐ก๐๐ก [the sum of all branches];
โข the pairwise time to coalescence ๐ก๐๐ for every pair (๐, ๐) [the length of the path from ๐,or equivalently from ๐, to the most recent common ancestor of ๐ and ๐].
With a slight abuse of notation, let ๐พ๐ก denote the number of mutations that occur ona path with length ๐ก. If mutations are generated by a Poisson process with intensity[mutation rate] ๐, the probability that the path accumulates a mutation depends onlyon its relative length, not on its position within the genealogy. In particular, P{๐พ๐ก =0} = E{๐โ๐๐ก}. Similarly, let๐พ๐ก๐ก๐๐ก denote thenumber ofmutations that occur throughoutthe genealogy. us {๐พ๐ก๐ก๐๐ก > 0} is the event that the site segregates in the sample. Fora fixed mutation rate ๐, the probability that at least one mutation occurs on a path withlength ๐ก and none in the rest of the tree is
P{๐พ๐ก > 0, ๐พ๐ก๐ก๐๐กโ๐ก = 0} = E{(1 โ ๐โ๐๐ก)๐โ๐(๐ก๐ก๐๐กโ๐ก)}. (2.1)
Similarly, the probability that the site segregates is
P{๐พ๐ก๐๐ก > 0} = E{1 โ ๐โ๐๐ก๐ก๐๐ก}, (2.2)
where the expectation is with respect to all possible genealogies of the sample.[Nielsen, 2000] argues that if we assume the mutation rate is low and condition on
the site segregating in the sample, then the mutation rate ๐ is of little interest and soit can be treated as a nuisance parameter. Following [Nielsen, 2000] Alternatively, without explicitly making
the infinitely-many-sites assumption,we can ignore the probability of event{๐พ๐ก > 1} if the mutation rate ๐ is verylow.
we can eliminate๐ from the analysis by taking the limit ๐ โ 0. Under the infinitely-many-sites model[Kimura, 1969], the event of at least one mutation is equivalent to the event of exactlyone mutation. erefore, P{๐ก = 0} and P{๐ก = 1} are complementary events. Togetherwith the low mutation limit ๐ โ 0, this implies
P{๐พ๐ก = 1|๐พ๐ก๐๐ก = 1} = P{๐พ๐ก > 0, ๐พ๐ก๐๐ก > 0}P{๐พ๐ก๐๐ก > 0} =
P{๐พ๐ก > 0, ๐พ๐ก๐ก๐๐กโ๐ก = 0}P{๐พ๐ก๐๐ก > 0} (2.3a)
= lim๐โ0
๐โ1E{๐โ๐(๐ก๐ก๐๐กโ๐ก) โ ๐โ๐๐ก๐ก๐๐ก}๐โ1E{1 โ ๐โ๐๐ก๐ก๐๐ก}
Interchange limit and expectation [valid
if E{๐ก๐ก๐๐ก} < โ] and use the Taylorapproximation ๐โ๐ฅ โผ 1 โ ๐ฅ.
(2.3b)
= E{๐ก๐ก๐๐ก โ (๐ก๐ก๐๐ก โ ๐ก)}E{๐ก๐ก๐๐ก}
= E{๐ก}E{๐ก๐ก๐๐ก}
โก ๐๐๐ก๐๐ก
, (2.3c)
where for convenience we denote the expectation of coalescence time ๐ก by ๐.erefore, for biallelic markers and under the conditions specified above, there is a
relationship between expected coalescence times and the probability that a particularbranch in the genealogy carries the derived allele. We will use it to derive the first twomoments of the genotype vector ๐ = (๐1, โฆ , ๐๐).Proposition 2.1 Suppose that a sample of size ๐ is collected from a population that evolvesaccording to the neutral infinitely-many-sitesmodel, wheremutations are generated by a Pois-son process with lowmutation rate. At segregating sites where exactly one mutation occurs inthe sample, the allele carried by individual ๐, denoted by ๐๐, is a binary random variable suchthat
Eโ{๐๐} = ๐๐๐๐๐๐๐ก๐๐ก
. (2.4)
Furthermore, for two distinct individuals ๐ and ๐,
Eโ{๐๐๐๐} =๐๐๐๐๐ โ ๐๐๐
๐๐ก๐๐ก. (2.5)
10
Here the symbolโ indicates that the expectation is with respect to all possible sample genealo-gies with exactly one mutation.
Proof. In a genealogical tree with exactly one mutation, the ๐th lineage carries thederived allele if the mutation occurs anywhere on the path from ๐th external branch tothe most recent common ancestor of the entire sample. is path has length ๐ก๐๐๐๐ forall ๐; its average length is ๐๐๐๐๐. erefore, the conditional probability of observing thederived allele is the same for every individual and
Eโ{๐๐} = E{๐พ๐ก๐๐๐๐ = 1|๐พ๐ก๐๐ก = 1} = ๐๐๐๐๐๐๐ก๐๐ก
. (2.6)
at is, the genotypes at a biallelic marker are Bernoulli random variables with fre-quency๐๐๐๐๐/๐๐ก๐๐ก. Furthermore, since the genotypes are binary, the event {๐๐ = 1, ๐๐ =1} โ {๐๐๐๐ = 1} implies that the mutation occurs on the branch from the pair's mostrecent common ancestor to the most recent common ancestor of the sample. is an-cestral branch has length ๐ก๐๐๐๐ โ ๐ก๐๐. erefore, the conditional expectation that twoindividuals ๐ and ๐ carry a common mutation at a biallelic marker is given by
E{๐๐๐๐ |๐พ๐ก๐๐ก = 1} = P{๐พ๐ก๐๐๐๐โ๐ก๐๐ = 1|๐พ๐ก๐๐ก = 1} =E{๐ก๐๐๐๐} โ E{๐ก๐๐}
E{๐ก๐ก๐๐ก}. (2.7)
l
us the individual genotypes have the same marginal distribution: the ๐๐s are identi-cally distributed but not independent. Finally, in equations (2.4) and (2.5) the expectedcoalescence times ๐๐๐๐๐, ๐๐ก๐๐ก, ๐๐๐ are marginal expectations with respect to all possiblehistories [genealogies] of the sample, not only histories that can induce the observedpattern of 0s and 1s.
e principle behind equation (2.5) states that the more history two individualsshare, the more genetically similar they are. Here we should interpret "shared history"precisely as "common ancestral branch" in the genealogy rather than broadly as a "de-mographic past" in the sense of evolutionary history. Different models can produce thesame expected genealogy. For example, a long branch separating two samples couldcorrespond to a split into distinct subpopulations some time in the past or constantmi-gration between two locations at a low rate. Conversely, without further assumptions,patterns of similarities and differentiation observed in genetic data reveal informationabout the underlying genealogies, and hence, indirectly, about the demographic modelthat generated them. In this thesis we average observed genetic similarities acrossmarkers; thus we ignore information (e.g., the variance) that could in principle improvethe ability to distinguish demographic models.
2.1.1 Bias due to SNP ascertainment
Ascertainment bias refers to systematic deviations in the SNP discovery process where asmall number of individuals are used to find sites polymorphic in the entire population[Clark et al., 2005]. In particular, rare SNPs are harder to ascertain and more likely tobe underrepresented. Furthermore, the genetic variation in a geographic region couldbe misrepresented in a panel with unbalanced sample configuration. [McVean, 2009]observes that two samples are effectively involved in ascertainment โ first a panel todiscover SNPs for genotyping on a microchip and then a sample to genotype. We condi-tion on sites that segregate in both samples and this can distort (the average shape of)the observed genealogies and thus produce misleading results. In this thesis we ignoreSNP ascertainment as a potential source of sample structure.
inferring effective migration from geographically indexed genetic data 11
2.2 Isolation by distance in a spatially distributed population
Geographic separation can act as a genetic barrier because in a natural population mi-gration tends to be local rather than long-range. If long-distance migration events arerare, a mutation that arises in one area might take a long time to spread throughout thehabitat (if at all). Consequently, individuals that are closer together tend to be moresimilar genetically than those that are far apart. is phenomenon is known as isola-tion by distance. However, the relationship between geography and genetic similarityalso depends on dispersal. If the habitat is homogeneous andmigration is characterizedby the same dispersal density everywhere, genetic similarity decreases as a function ofrelative distance.
e effect of subdivision on population structure is often quantified in terms of astatistic called ๐น๐๐ that measures the genetic variation among subpopulations relativeto the total genetic variation. Several definitions of ๐น๐๐ [Wright, 1943] introduces ๐น๐๐ as the
statistic var{๐}/[s๐(1 โ s๐)] where var{๐}is the variance in allele frequency amongsubpopulations and s๐ is the overallmean allele frequency in the population.Intuitively, ๐น๐๐ is high when individualsare similar within subpopulations anddifferent between subpopulations.
have been proposed [Wright,1943, Cockerham, 1969, Nei, 1973]. We use Nei's definition where ๐น๐๐ is a functionof the probabilities of identity within and between subpopulations. [Two lineages areidentical, at a given locus, if they carry the same allele.] e ๐น-statistic is defined as
๐น๐๐ = ๐0 โ ๐1 โ ๐ , (2.8)
where ๐ is the probability of identity for two individuals chosen at random withoutreference to geography, and ๐0 is the probability of identity for two individuals chosenat random from the same subpopulation.
As ameasure of genetic differentiation, the๐น-statistic is related to coalescence timesbecause identity means neither lineage accumulates a mutation in the time to most re-cent common ancestor. If the mutation process is Poisson with low mutation rate ๐,
๐(๐) = E{๐โ๐๐ก} โ 1 โ ๐E{๐ก}. (2.9)
In this case, by substituting ๐0 = 1 โ ๐0 and ๐ = 1 โ ๐ in equation (2.8), [Slatkin,1991] derives the approximation
๐น๐๐ โ ๐ โ ๐0๐ , (2.10)
where ๐0, ๐ are the expected coalescence times for a pair of distinct lineages sampledat random from the same subpopulation and from the entire population, respectively.e coalescent-based approximation to the ๐น-statistic is very general: [Slatkin, 1991]derives it in the low mutation limit but otherwise makes no assumptions about thedemographic model. us, the approximation holds under a subdivided population atequilibrium, a growing population, or a population that has undergone a split some timein the past.
By analogy, [Rousset, 1997] considers the ๐น-statistic for two demes separated bydistance ๐ฅ,
๐น๐๐(๐ฅ) = ๐0 โ ๐๐ฅ1 โ ๐๐ฅ
โ ๐๐ฅ โ ๐0๐๐ฅ
, (2.11)
as well as the linearized ๐น-statistic given by
๐น๐๐(๐ฅ)1 โ ๐น๐๐(๐ฅ) โ ๐๐ฅ โ ๐0
๐0. (2.12)
[Rousset, 1997] analyzes the relationship between genetic differentiation, ๐น๐๐ , and ge-ographic distance, ๐ฅ, in a spatially-homogeneous stepping-stone model where demes
12
are equally sized and regularly spaced (on a ring in one dimension and a torus in twodimensions), and migration is determined by a symmetric dispersal kernel. e impor-tant demographic parameters are the effective population density ๐ท per length/areaunit and the mean squared dispersal distance ๐2, which determines the speed at whichtwo lineages with a common ancestor move away from each other in a generation. Bysymmetry, the probability of identity for two randomly sampled individuals is also afunction of the relative distance ๐ฅ. [Rousset, 1997, 2004] derives the following large-distance approximations to the linearized ๐น๐๐ ,
๐น๐๐(๐ฅ)1 โ ๐น๐๐(๐ฅ) โ ๐ฅ
4๐ท๐2 + ๐ถ1; (in one dimension) (2.13a)
๐น๐๐(๐ฅ)1 โ ๐น๐๐(๐ฅ) โ ln(๐ฅ/๐)
4๐๐ท๐2 + ๐ถ2; (in two dimensions) (2.13b)
where the constants ๐ถ1 and ๐ถ2 depend on the population density and the dispersaldistribution but not on the population sizes or the mutation rate.
erefore, if migration is uniform, the linearized ๐น๐๐ increases with geographic dis-tance. is relationship is appropriate only for homogeneous habitats as it ignores theeffect of barriers (or corridors) to migration: two demes separated by a barrier wouldappear to be more genetically dissimilar than relative distance would suggest. In otherwords, we need a measure of effective distance to describe the patterns of movementacross the habitat.
2.3 e stepping-stone model of population subdivision
Section 2.1 explains that coalescence times represent population structure because ge-netically similar individuals are likely tohave a recent commonancestor and thus shortercoalescence time. e relationship between genetic correlation and coalescence times inequation (2.5) is very general. For example, [McVean, 2009] uses as an example amodelof population split in which groups derived from a common ancestor do not exchangemigrants and thus develop independently after the split. In this thesis we aim to an-alyze the spatial structure of genetic variation, and therefore, we need to model [andapply equation (2.5) to] a spatially distributed population.
Kimura's stepping-stone model [Kimura and Weiss, 1964] represents a populationacross the span of its habitat as a connected grid of panmictic (randomlymating) demes(colonies) which exchangemigrants in a fixed pattern. For simplicity, in this chapter weconsider a haploid population.A haploid organism has a single copy of
its genome; a diploid organism has twocopies, one inherited from the father andthe other from the mother.
To extend the framework, a diploid individual can berepresented as the sum of two independent haplotypes, one from each parent.
e stepping-stone model makes the following assumptions:
โข ere are ๐ demes and deme ๐ผ consists of๐๐ผ randomlymating individuals. e totalpopulation number is ๐๐ = โ๐ผ ๐๐ผ and the average deme size is ๐0 = ๐๐/๐. edemes remain constant in size and ๐๐ผ โผ ๐ช(๐0) for all ๐ผ.
โข e mutation rate per site per generation is ๐ข and the scaled mutation rate for twodistinct lineages in ๐0 generations is ๐ = 2๐0๐ข.
โข e coalescence ratee ancestral process develops backwardsin time, from the present towards thepast. A coalescence event means that twoindividuals have the same parent and amigration event means that an individualfrom ๐ผ has a parent from ๐ฝ.
for a pair of distinct lineages drawn at random from deme ๐ผis ๐๐ผ = ๐0/๐๐ผ โผ ๐ช(1). Two lineages coalesce when they merge into a commonancestor and in a single generation this event has probability 1/๐๐ผ.
โข e migration rate for a lineage to move from deme ๐ผ to deme ๐ฝ โ ๐ผ is ๐๐ผ๐ฝ โผ
inferring effective migration from geographically indexed genetic data 13
๐ช(1). e migration matrix ๐ = (๐๐ผ๐ฝ), where ๐๐ผ๐ผ = โ โ๐ฝโถ๐ฝโ ๐ผ ๐๐ผ๐ฝ, describesthe transition process of a single lineage backwards in time.
All rate parameters are constant in times and on the scale of ๐0 generations. e as-sumptions ๐๐ผ โผ ๐ช(1) for every deme ๐ผ and ๐๐ผ๐ฝ โผ ๐ช(1) for every pair (๐ผ โ ๐ฝ) implythatmigration isweak. at is, the probability ofmultiplemigration and/or coalescenceevents occurring in the same generation [before scaling by ๐0] is ๐ช(๐โ2
0 ) and can beignored.
e stepping-stone model describes how a spatially distributed population evolvesunder equilibrium in time, i.e., under the condition that bothmigration and coalescencerates are the same in every generation. erefore, the model can characterize system-atic differences between the groups due to gene flow but not due to splits or admixtureevents. In other words, the stepping-stonemodel can represent population structure inspace but not in time. [As we show through simulations in Chapter 5, temporal struc-ture can be explained as spatial structure, in terms of effective rates of migration.]
If demes of constant size exchange migrants at fixed rates as required under equi-librium, the number of individuals to emigrate is equal the number of individuals toimmigrate, i.e., migration is conservative [Nagylaki, 1980]. Mathematically,
โ๐ฝโถ๐ฝโ ๐ผ
๐๐ผ๐ฝ/๐๐ผ = โ๐ฝโถ๐ฝโ ๐ผ
๐๐ฝ๐ผ/๐๐ฝ โ ๐โฒ๐โ1 = 0 (2.14)
where ๐โ1 = (๐โ1๐ผ ) = (๐๐ผ/๐0) is the vector of coalescence rates.In a general stepping-stone model, migration is not necessarily symmetric. How-
ever, in this thesis we assume that ๐๐ผ๐ฝ = ๐๐ฝ๐ผ for all edges (๐ผ, ๐ฝ). e condition thatmigration is both symmetric and conservative implies that all demes have the same size:on one hand, ๐๐โ1 = ๐โฒ๐โ1 = 0, and on the other, ๐1 = 0 as ๐ is a Laplacian ma-trix; hence ๐ โ 1. us the average deme size ๐0 = ๐๐/๐ is a convenient choice forthe coalescent timescale.
e stepping-stone model characterizes dispersal not in terms of an explicit disper-sal density but indirectly through the combined effect of the graph topology and themigration rates. It may not seem natural to represent the geographic distribution of or-ganisms with a graph. However, discrete models for migration are common in popula-tion genetics. In fact, a continuousmodel of isolation by distance (with normal dispersaland continuous spatial distribution) can lead to inconsistencies [Felsenstein, 1975].
2.3.1 Expected coalescence times in a subdivided population
In Section 2.1 we described how the probability that two individuals both carry the de-rived allele is related to the expected coalescence time to their most recent commonancestor. We will use this connection between genetic similarity and shared ancestry toanalyze the spatial structure in genetic variation, and in particular, to estimate migra-tion rates. e inference procedure requires that we express pairwise coalescence timesas functions of migration rates.
e coalescent process can be extended to represent the ancestry of a sample fromthe stepping-stonemodel [Notohara, 1990, 1993]. is version, called the structured co-alescent, describes themovement of lineages between demes as well as their coalescenceinto common ancestors. We can use the properties of the structured coalescent [as wedo in Appendix 7.1] to derive the following system of linear equations for the pairwiseexpected coalescence times ๐ = (๐๐ผ๐ฝ) as a function of the coalescence rates ๐ = (๐๐ผ)
14
and the migration rates ๐ = (๐๐ผ๐ฝ):
diag {๐} diag {๐} โ ๐๐ โ ๐๐โฒ = 11โฒ. (2.15)
Furthermore, if the migration rates are symmetric, as we assume throughout, thenthere is no variation in coalescence rates across demes, i.e., ๐ = 1, ๐ = ๐โฒ and
diag {๐} โ ๐๐ โ ๐๐ = 11โฒ. (2.16)
In equation (2.15) ๐๐ผ๐ฝ is the expected coalescence time between two randomly chosenlineages, one from ๐ผ and the other from ๐ฝ. In equation (2.5) ๐๐๐ is the expected coales-cence times between two sampled individuals ๐ โ ๐ผ๐ and ๐ โ ๐ผ๐. Crucially, the pair-
wise coalescence times do not depend on the sample configuration,Individuals are exchangeable withindemes but not across demes because thesample location is informative about thealleles an individual carries.
๐ผ = (๐ผ1, โฆ , ๐ผ๐),because individuals are exchangeable within each deme. erefore, the expected coales-cence time for an observed pair (๐ โ ๐ผ๐, ๐ โ ๐ผ๐) is the same as the expected coalescencetime for any pair (๐โฒ โ ๐ผ๐, ๐โฒ โ ๐ผ๐) from the subdivided population:
๐๐๐ = ๐๐ผ๐๐ผ๐ . (2.17)
Notation: We use Greek letters [๐ผ, ๐ฝ] to denote subpopulations and Latin letters [๐, ๐]to denote sampled individuals. And we will distinguish between the population matrix๐ = (๐๐ผ๐ฝ โถ demes ๐ผ, ๐ฝ) and the sample matrix ๐ = (๐๐๐ โถ individuals ๐, ๐) where๐ = ๐(๐ผ) โ diag {๐(๐ผ)}. e diagonal is subtracted because coalescence time with selfis always 0.
In any population graph,๐๐ผ๐ฝ > ๐๐ผ๐ผ because coalescence is possible only for lineagesin the same deme. However, if ๐ผ and ๐ฝ are separated by a barrier, fewer migrants movebetween ๐ผ and ๐ฝ, and so the pairwise coalescence times ๐๐ผ๐ฝ would be larger than thetime expected under isolation by distance, i.e., uniform migration. us, the matrix ofpairwise coalescence times๐ = (๐๐ผ๐ฝ)would contain evidence for habitat heterogeneity.
Since longer coalescent time mean less genetic similarity, coalescence times are anaturalmeasure of genetic dissimilarity andhencepopulation structure. For the stepping-stonemodel we can compute the matrix of expected coalescence times, ๐, given the mi-gration rates ๐ and the coalescence rates ๐ using equation (2.15). Alternatively, thereexists a computationally efficient method to approximate ๐, which we discuss next.
2.4 Isolation by resistance is a metric for gene flow
Isolation by resistance (IBR) [McRae, 2006, McRae et al., 2008] draws an analogy be-tween a subdivided population in which neighboring demes exchange migrants and anelectrical network in which current flows through conductors. [Or in other words, be-tween Kimura's stepping-stone model and an undirected random walk.] To understandthe analogy better, concepts in electrical networks can be given population genetic in-terpretation [Table 2.1]. Using this correspondence between population genetics andcircuit theory, McRae develops IBR to test whether putative barriers to genetic flow af-fect genetic differentiation.
Isolation by resistance predicts effective distances from a raster grid of landscaperesistance (or friction): each cell in the grid specifies how difficult it is for an animalto migrate locally and these values are assigned based on expert knowledge about thespecies and the habitat. If the effective distances agree with the observed genetic dis-similarities, then the hypothesized grid explains the data well. Such a raster map couldbe hard to produce, especially at fine scales, and if the agreement is low, there is no
inferring effective migration from geographically indexed genetic data 15
Electrical term Ecological interpretation
conductance
๐๐ฅ๐ฆ โถ โ(๐ฅ, ๐ฆ) โ ๐ธ direct migration ๐๐ผ๐ฝ: the number of migrants exchanged be-tween two neighboring demes ๐ผ and ๐ฝ in a single generation.
(On the coalescent timescale ๐๐ผ = ๐0๏ฟฝฬ๏ฟฝ๐ผ๐ฝ where ๏ฟฝฬ๏ฟฝ๐ผ๐ฝ is theprobability that a lineage in ๐ผ has a parent from ๐ฝ.)
resistance
๐๐ฅ๐ฆ = 1/๐๐ฅ๐ฆcost 1/๐๐ผ๐ฝ: measure of local landscape friction in the directionfrom ๐ผ to ๐ฝ. If migration is symmetric, ๐๐ผ๐ฝ = ๐๐ฝ๐ผ.
(Since ๐๐ผ = ๐0, the ๐๐ผ๐ฝs are comparable across the habitat.)effective conductance
๐ถ๐ฅ๐ฆ โถ โ(๐ฅ, ๐ฆ) โ ๐ ร ๐ effectivemigration๐๐ผ๐ฝ: the number ofmigrants thatwould pro-duce the same level of genetic differentiation between ๐ผ and ๐ฝ ifthese two demes made up a two-deme system.
effective resistance
๐ ๐ฅ๐ฆ = 1/๐ถ๐ฅ๐ฆdistance metric ๐ ๐ฅ๐ฆ: quantifies the genetic differentiation be-tween a pair of demes (๐ผ, ๐ฝ) by taking into account the existenceof multiple pathways between them.
Table 2.1: Circuit theory concepts andtheir ecological interpretation, adaptedfrom [McRae et al., 2008]. McRae spec-ifies the edge conductances as ๐๐ผ๐ฝ =๐๐ผ๐ฝ/๐๐ผ. However, it is natural to defineconductances only in terms of the migra-tion process because lineages cannot coa-lesce until they meet.
method to facilitate improving the map of resistances. However, IBR does provide anuseful and efficient approximation to expected coalescence times.
To begin with, consider a stepping-stone model that has only two demes, ๐ผ and ๐ฝ,with equal size and a single edge with migration rate ๐๐ผ๐ฝ. In this population, alsoknown as a two-island model,
๐๐ผ๐ฝ =(๐๐ผ๐ผ + ๐๐ฝ๐ฝ)/8
๐๐ผ๐ฝ โ (๐๐ผ๐ผ + ๐๐ฝ๐ฝ)/2 . (2.18a)
[is follows from the system of linear equations (2.15).] e two-islandmodel is a veryspecial case and the equation (2.18a) does not hold more generally. In fact, unless thepopulation graph is fully connected, many pairs of demes might not exchange migrantsdirectly and then ๐๐ผ๐ฝ = 0. However, [McRae, 2006] extends the relevance of the re-lationship (2.18a) to the general stepping-stone model by introducing the concept ofeffective migration ๐๐ผ๐ฝ between ๐ผ and ๐ฝ. It is given by
๐๐ผ๐ฝ โก(๐๐ผ๐ผ + ๐๐ฝ๐ฝ)/8
๐๐ผ๐ฝ โ (๐๐ผ๐ผ + ๐๐ฝ๐ฝ)/2 . (2.19)
at is, the effective migration ๐๐ผ๐ฝ is the number of migrants that would producethe actual genetic differentiation between ๐ผ and ๐ฝ in a hypothetical two-island system.Since two lineages take the same time to reach their common ancestor, ๐๐ผ๐ฝ = ๐๐ฝ๐ผ andthe definition (2.19) implies that effective migration is always symmetric even thoughthe underlying true migration patterns might not be symmetric.
It is natural to relate the concept of effective migration in a subdivided population,๐๐ผ๐ฝ, and the concept of effective conductance in an electrical circuit, ๐ถ๐ผ๐ฝ. In circuittheory, ๐ถ๐ผ๐ฝ is the conductance in a two-node, single-conductor network required toproduce the same amount of current between ๐ผ and ๐ฝ as in the original network.
Proposition 2.2 Consider apopulation graph (๐, ๐ธ)with symmetricmigration rates {๐๐ผ๐ฝ โถโ(๐ผ, ๐ฝ) โ ๐ธ}. is corresponds to a circuit network (๐, ๐ธ) with conductances {๐๐ผ๐ฝ = ๐๐ผ๐ฝ}.For every pair (๐ผ, ๐ฝ) โ ๐ ร ๐, the effective conductance ๐ถ๐ผ๐ฝ in the circuit is a measure ofthe effective migration ๐๐ผ๐ฝ in the population:
๐๐ผ๐ฝ โ ๐ถ๐ผ๐ฝ. (2.20)
And thus the resistance distance ๐ ๐ผ๐ฝ = 1/๐ถ๐ผ๐ฝ is a measure of genetic differentiation.
16
Proof. e relationship between effective migration and effective conductance is exactonly if migration is isotropic, i.e., demes are equivalent with respect to the size and pat-tern of movement. Here we assume only that migration is symmetric and conservative.
e migration process can be represented as a continuous-time discrete-space ran-domwalk on an undirected graph [Levin et al., 2008]. en๐ = (๐๐ผ๐ฝ) is the transitionkernel of the embedded jump chain, which determines the sequence of locations occu-pied by the lineage, and ๐๐ผ = โ๐ฝโถ๐ฝโ ๐ผ ๐๐ผ๐ฝ are the rates of the holding distributions,which determine the waiting times before jumps. Let ๐ = (1/๐) โ๐ผ ๐๐ผ be the averageholding rate.
Since migration is symmetric and conservative, the demes have the same size ๐0,which is also a convenient choice for the coalescent timescale. Let ๐0 be the averagewithin-deme expected coalescence time. en by Strobeck's theorem [see equation (7.8)in Appendix 7.1],
๐0 โก โ๐ผ
๐๐ผ๐ผ/๐ = ๐. (2.21)
us ๐0 does not depend on the migration process.Furthermore, let๐๐ผ๐ฝ be the expected time for two lineages, one from๐ผ and the other
from ๐ฝ, to occupy the same deme. en
(๐๐ผ๐ผ + ๐๐ฝ๐ฝ)/2 โ ๐0, (2.22a)
๐๐ผ๐ฝ โ (๐๐ผ๐ผ + ๐๐ฝ๐ฝ)/2 โ ๐๐ผ๐ฝ. (2.22b)
ese two approximations are exact ifmigration is isotropic: since the demes are equiva-lent with respect to themigration process, the within-deme coalescence times๐๐ผ๐ผ mustbe equal by symmetry. Hence, ๐๐ผ๐ผ = ๐0, ๐๐ผ๐ฝ = ๐๐ผ๐ฝ + ๐0 and once the lineages meetfor the first time, we can restart the random walk with two lineages in the same demechosen at random.
Under the coalescent process, two lineages โ one from ๐ผ and another from ๐ฝ โmove simultaneously until they coalesce into a common ancestor. Suppose that theymeet for the first time in deme ๐พ. Together the paths ๐ผ โ ๐พ and ๐ฝ โ ๐พ have half thelength of a commute between ๐ผ and ๐ฝ that passes through ๐พ. erefore, the expectedtime to first meet, ๐๐ผ๐ฝ, can be related to the expected commute length, ๐พ๐ผ๐ฝ, in thecorresponding circuit network:
๐๐ผ๐ฝ โ ๐พ๐ผ๐ฝ/(4๐), (2.23)
where ๐พ๐ผ๐ฝ is the expected number of jumps in a random walk that starts at ๐ผ, visits ๐ฝand returns to ๐ผ, and 1/(2๐) is the average waiting time before either lineage jumps.e relationship is approximate because the waiting time varies across vertices.
Finally, by [Chandra et al., 1996] for a undirected graph [whether isotropic or not],
๐พ๐ผ๐ฝ = ๐๐บ๐ ๐ผ๐ฝ = ๐๐บ/๐ถ๐ผ๐ฝ, (2.24)
where ๐ ๐ผ๐ฝ is the effective resistance between nodes ๐ผ and ๐ฝ, ๐ถ๐ผ๐ฝ is the effective con-ductance, and ๐๐บ is the total conductance of the network given by
๐๐บ = โ๐ผ
โ๐ฝโถ๐ฝโ ๐ผ
๐๐ผ๐ฝ = โ๐ผ
๐๐ผ = ๐๐. (2.25)
erefore,
๐๐ผ๐ฝ =(๐๐ผ๐ผ + ๐๐ฝ๐ฝ)/8
๐๐ผ๐ฝ โ (๐๐ผ๐ผ + ๐๐ฝ๐ฝ)/2 โ ๐0/4๐๐ผ๐ฝ
โ (๐/4)๐ ๐ผ๐ฝ(๐๐)/(4๐) = ๐ถ๐ผ๐ฝ (2.26)
l
inferring effective migration from geographically indexed genetic data 17
Essentially,McRae's approximation splits the between-deme coalescence time,๐๐ผ๐ฝ, intothe time to firstmeet, ๐๐ผ๐ฝ, and the averagewithin-deme coalescence time,๐0. However,since the population graph is not necessarily symmetric, not every deme ๐พ is equallylikely to be the deme where two lineages, starting from ๐ผ and ๐ฝ, meet for the first time.And furthermore, the within-deme coalescence times are not necessarily equal. ere-fore, the effective resistancemetric reflects themigration process accurately but ignoresthe fact that the lineages do not necessarily coalesce on their first opportunity. On theother hand, the coalescence time metric correctly captures the effect of both processesbecause Kingman's coalescent models migration and coalescence by explicitly trackingboth lineages until their common ancestor. Since higher rates imply faster mixing, wecan conclude that the higher migration rates are, the better McRae's approximation is.See Figures 2.2 and 2.3.
2.4.1 Effective resistance approximates expected coalescence time
McRae's method approximates the ancestral process of two lineages evolving simulta-neously in terms of one lineage evolving at twice the rate. However, one random walkcannot represent a coalescence event where two lineages merge into their most recentcommon ancestor. us, while effective resistance, ๐ ๐ผ๐ฝ, provides a measure for thegenetic differentiation between demes, it does not capture the genetic differentiationbetween individuals from the same deme [๐ ๐ผ๐ผ = 0 for every deme ๐ผ]. However, itfollows directly fromMcRea's approximation that
๐๐ผ๐ฝ โ ๐๐ผ๐ฝ + ๐0 โ ๐0(๐ ๐ผ๐ฝ/4 + 1), (2.27)
or equivalently in matrix notation,
๐ โ ๐0(๐ /4 + 11โฒ). (2.28)
emain advantage of approximating coalescence times in terms of effective resistancesis computational efficiency. To compute๐, we solve a linear system of equations๐ด๐ = ๐ฅwith ๐(๐ + 1)/2 unknowns that corresponds to eq. (2.16). In this problem ๐ด is sparse(because the population graph ๐บ is sparse) and positive definite, and so we can use aniterative preconditioned gradient method. ere are several methods to compute ๐ ; weuse a method that inverts the ๐ ร ๐ matrix ๐ + 11โฒ [Babiฤ et al., 2002]. Since ๐ด is ofhigher order than ๐, it is more efficient to compute ๐ . Furthermore, ๐ gives a verygood approximation to ๐ whenmigration rates are high and it is more appropriate thanother distance metrics such Euclidean distance and least-cost path. erefore, effec-tive resistance offers a compromise between accuracy of representation and efficiencyof computation.
In this chapter we introduced two important components of ourmethod for analyz-ing spatial population structure: the stepping-stone model and the effective resistancemetric. In the next chapters we describe how we can estimate and visualize effectiverates of migration from geographically referenced genetic data.
18
Tab - (Taa+Tbb)/20 100 200 300 400 500
0
100
200
300
400
500
isotropic on a circle | m = 0.01R a
b(d/
4)
Tab - (Taa+Tbb)/20 1 2 3 4 5
0
1
2
3
4
5
m = 1
R ab(
d/4)
Tab - (Taa+Tbb)/20.00 0.01 0.02 0.03 0.04 0.05
0.00
0.01
0.02
0.03
0.04
0.05
m = 100
Tab - (Taa+Tbb)/20 500 1000
0
500
1000
1500
uniform on a grid | m = 0.01
R ab(
d/4)
Tab - (Taa+Tbb)/20 5 10 15
0
5
10
15
m = 1R a
b(d/
4)
Tab - (Taa+Tbb)/20.00 0.05 0.10 0.15
0.00
0.05
0.10
0.15
m = 100
Tab - (Taa+Tbb)/20 200 400 600 800 1000
0
200
400
600
800
1000
barrier on a grid | m = 0.01
R ab(
d/4)
Tab - (Taa+Tbb)/20 2 4 6 8 10
0
2
4
6
8
10
m = 1
R ab(
d/4)
Tab - (Taa+Tbb)/20.00 0.02 0.04 0.06 0.08 0.10
0.00
0.02
0.04
0.06
0.08
0.10
m = 100
Figure 2.2: On the ๐ฅ-axis, ๐๐ผ๐ฝ โ (๐๐ผ๐ผ +๐๐ฝ๐ฝ)/2 is the expected time to reach thesame deme; on the ๐ฆ-axis,๐ ๐ผ๐ฝ(๐/4) is the(appropriately scaled) effective resistance.As the migration rate increases, ๐ ๐ผ๐ฝ be-comes a better approximation of the ex-pected time to first meet, ๐๐ผ๐ฝ, even if mi-gration is not isotropic. [Results for a 5ร4regular triangular grid with uniform mi-gration rate ๐ = 0.01, 1 or 10.]
inferring effective migration from geographically indexed genetic data 19
Tab
0 100 200 300 400 5000
100
200
300
400
500
isotropic on a circle | m = 0.01
d(R a
b4+
1)
Tab
20 21 22 23 24 25
20
21
22
23
24
25
m = 1
d(R a
b4+
1)
Tab
20.00 20.01 20.02 20.03 20.04 20.05
20.00
20.01
20.02
20.03
20.04
20.05
m = 100
Tab
0 500 1000
0
500
1000
1500
uniform on a grid | m = 0.01
d(R a
b4+
1)
Tab
15 20 25 30
20
25
30
35
m = 1
d(R a
b4+
1)
Tab
19.95 20.00 20.05 20.10
20.00
20.05
20.10
20.15
m = 100
Tab
0 200 400 600 800 1000
0
200
400
600
800
1000
1200
barrier on a grid | m = 0.01
d(R a
b4+
1)
Tab
18 20 22 24 26 28 30
20
22
24
26
28
30
32
m = 1
d(R a
b4+
1)
Tab
19.98 20.00 20.04 20.08 20.10
20.00
20.02
20.04
20.06
20.08
20.10
20.12
m = 100
Figure 2.3: On the ๐ฅ-axis, ๐๐ผ๐ฝ is the ex-pected time to coalescence; on the ๐ฆ-axis,๐(๐ ๐ผ๐ฝ/4 + 1) is the IBR approximation.e approximation to the within-deme co-alescence times, ๐๐ผ๐ผ, is always ๐0 = ๐;there are the points closest to the originat ๐0 = 20 in a 5 ร 4 grid. Althoughthe pattern does not change as the mi-gration rate increases, the relative errorโ๐๐ผ๐ฝ/๐๐ผ๐ฝ decreases.
3
Genetic Dissimilarities and Distance Matrices
Habitat heterogeneity can shape genetic variation by reducing or increasing gene flow.e stepping-stone model is a natural representation of a spatially distributed popula-tion and the effects of gene flow on its genetic structure. In this thesis a population isa graph ๐บ = (๐, ๐ธ, ๐) comprised of vertices ๐ [randomly mating demes of equal size],edges๐ธ [symmetric routes ofmigration between neighboring demes] and aweight func-tion ๐ โถ ๐ธ โ โ+ that specifies the rates at which migrants are exchanged.
roughout, we will assume that the population graph ๐บ is embedded in a two-dimensional habitat, with the vertex set ๐ and the edge set ๐ธ both fixed. In practice,this graph is not known and does not necessarily exist. For example, it might not bepossible to split the population into distinct groups that satisfy the randommating as-sumption. Instead, we cover the habitat with a regular triangular grid in which verticesdo not represent actual colonies. is simplification indicates that we should interpretthe migration parameters carefully โ as effective rather than actual rates of migration.
us the topology of the graph is determined by the shape of the habitat [and thesomewhat arbitrary choice that the graph is triangular and regularly spaced] and notthe sample configuration or the sample ''clusteredness''. And so we construct the graphdifferently from methods that aim to subdivide the population into clusters that aresimilar within and dissimilar between. However, if we make the grid (๐, ๐ธ) sufficientlyfine, we can reasonably assume that each vertex represents a randomly mating groupwithout further structure. In this case, individuals would be similar within demes butnot necessarily dissimilar between demes.
In a habitat with uniformmigration, the genetic differentiation between individualsfrom the same species is positively correlated with the distance between their origin; ina heterogeneous habitat, landscape features such as barriers or corridors create spatialstructure in genetic variation. For example, individuals separated by a barrier are lessclosely related, and therefore less genetically similar, than if the barrier were absent.e stepping-stone model can represent such effects because some edges in the popu-lation graph can have high migration rates and others โ low. In this thesis we developa Bayesian procedure to estimate the effective migration rates in a fixed grid (๐, ๐ธ) ofequally sized demes, from geographically indexed genetic data. e function ๐ mea-sures the relative rate at which two connected demes exchange migrants; we call ๐ amigration surface.
To analyze population structure, we will assume that all genotyped sites develop un-der the same evolutionary process which determines the expected structure in geneticcorrelations (or equivalently, genetic distances). In contrast, many methods for asso-ciation testing assume that individuals are independent while sites are correlated. (Inpopulation genetics, the systematic association between loci is called linkage disequilib-
inferring effective migration from geographically indexed genetic data 21
rium.) e problem at hand determines which assumption is appropriate to make. e PCA decomposition of the observedcovariance matrix ๐๐โฒ can be used tocorrect for population stratification [Priceet al., 2006] incorporate the leadingeigenvectors in a regression analysis thattests for association between sites anddisease.
Tofind associations between disease status and genetic makeup, it is reasonable to assumethat the disease develops under the same mechanism in all sampled cases but not allsites contribute to the disease and not with equal effect. To analyze population struc-ture, it is reasonable to assume that the same evolution process underlies all genotypedsites but not all sampled individuals are genetically similar to equal degree.
In this chapter, let ๐ = (๐ง๐ โถ ๐ = 1, โฆ , ๐) be a vector of ๐ genotypes at a singlepolymorphic site. We will consider multiple sites in the next chapter. Also let ๐ผ =(๐ผ1, โฆ , ๐ผ๐) denote the sample configuration, in which ๐ผ๐ is the sampling location ofthe ๐th haplotype.
3.1 Mean and covariance of genotype vectors: SNPs
First we consider the simplest case โ a haploid population from which we have a sam-ple of ๐ individuals genotyped at a single nucleotide polymorphism (SNP). Following[McVean, 2009], we make the following assumptions:
A1. SNPs are identically distributed: Since all sites evolve under the same demographicmodel, the observed genotype ๐ง๐ at any SNP is a realization of the same random vari-able ๐๐.
A2. SNPs segregate in the sample: Since exactly one mutation occurs in every sampledgenealogy, we observe both the ancestral allele '0' and the derived allele '1' at everysite.
A3. e scaled mutation rate ๐ is low: Since A2 and A3 together imply ๐ is a nuisanceparameter [Nielsen, 2000], we can take the limit ๐ = 2๐0๐ข โ 0 and thus ignoresmall differences in mutation rate across SNPs.
Under these assumptions, the probability that individuals ๐ and ๐ share the derived mu-tation at a randomly chosen segregating site is given by
Eโ{๐๐๐๐} =๐๐๐๐๐ โ ๐๐๐
๐๐ก๐๐ก, (3.1)
where ๐๐๐๐๐ and ๐๐ก๐๐ก are the height and the size of the expected genealogy of the sam-ple, and ๐๐๐ is the expected time for ๐ and ๐ to coalesce in a sample of size 2 [McVean,2009]. e symbol โ indicates the condition that both 0s and 1s are observed, i.e., theexpectation on the left in equation (3.1) is with respect to all possible genealogies (ob-served or not) with exactly one mutation. e expectations on the right in equation(3.1) are unconditional. e relevance is that for Kimura's stepping-stone model thereis an explicit formula for pairwise coalescence times, ๐๐๐, and a good approximation interms of effective resistances, ๐ ๐๐.
Furthermore, since the๐๐s are binary random variables and the time to coalescencewith self is always 0,
Eโ{๐๐} = Eโ{๐2๐ } = ๐๐๐๐๐
๐๐ก๐๐ก. (3.2)
erefore, the expected genealogy fully specifies the first two moments of the allelecount vector ๐ = (๐๐) at a particular segregating SNP. In matrix notation,
Eโ{๐} = ๐๐๐๐๐๐๐ก๐๐ก
1 โก ๐1, (3.3a)
varโ{๐} = ๐๐๐๐๐๐๐ก๐๐ก
(1 โ ๐๐๐๐๐๐๐ก๐๐ก
) โ 1๐๐ก๐๐ก
๐ โก ๐2(11โฒ โ ๐๐). (3.3b)
22
For sample with configuration ๐ผ from a population with model ๐บ, the parameters aregiven by
๐ = ๐๐๐๐๐๐๐ก๐๐ก
, ๐2 = ๐๐๐๐๐๐๐ก๐๐ก
(1 โ ๐๐๐๐๐๐๐ก๐๐ก
), ๐๐2 = 1๐๐ก๐๐ก
, (3.4)
where ๐ = (๐๐๐) is the matrix of expected pairwise coalescence times between sampledindividuals. at is, ๐๐๐ is the expected time to coalescence between ๐ โ ๐ผ๐ and ๐ โ ๐ผ๐in a sample of size 2, regardless of the composition of the entire sample ๐ผ. Since ๐๐๐does not depend on the sample configuration or even the sample size ๐, it is completelydetermined by the population model ๐บ. However,
โข e expected height and size of the sample genealogy, ๐๐๐๐๐ and ๐๐ก๐๐ก, depend onboth the population model ๐บ and the sample configuration ๐ผ. In particular, they arestrongly influenced by uneven sampling. erefore, ๐๐๐๐๐/๐๐ก๐๐ก and 1/๐๐ก๐๐ก are nui-sance parameters because it would be very hard to decouple the effects of populationstructure from the effects of uneven sampling. e confounding of population andsample-specific information also makes it difficult to interpret PCA projections interms of a (historic) demographic process [Novembre et al., 2008, McVean, 2009].
โข e matrix ๐ = (๐๐,๐ โถ individuals ๐, ๐) describes the expected genetic differentiationin the sample and has a block structure which depends on how many individuals, ifany, we observe from each deme. On the other hand, ๐ = (๐๐ผ๐ฝ โถ demes ๐ผ, ๐ฝ) spec-ifies how genetic variation increases with geographic distance for all pairs of demes,whether they are sampled from or not. us ๐ is a dissimilarity matrix that charac-terizes the entire population. Although ๐ is a function of the sample configuration,it depends on ๐ผ in a straightforward way:
๐ = ๐ฝ๐๐ฝโฒ โ diag {๐ฝ๐๐ฝโฒ}, (3.5)
where ๐ฝ โก ๐ฝ(๐ผ) = (๐ฝ๐๐ผ) โ โค๐ร๐ is an indicator matrix such that ๐ฝ๐๐ผ = 1 if ๐ โ ๐ผ and0 otherwise. And we remove the diagonal because the coalescence time with self isalways 0.
e demographic model ๐บ, which describes the population, determines the coalescentprocess and hence the expected pairwise coalescence times๐๐ผ๐ฝ for all deme pairs (๐ผ, ๐ฝ).On the other hand, both the model ๐บ and the configuration ๐ผ determine the genealog-ical statistics ๐๐๐๐๐ and ๐๐ก๐๐ก which are generally not of interest as the goal is to esti-mate population-level features of ๐บ โ such as the migration rates between pairs ofconnected demes โ while accounting for the sample specific features of ๐ผ. In this the-sis ๐บ = (๐, ๐ธ, ๐) is always a population graph (๐, ๐ธ, ๐) with equally sized demes ๐,undirected edges ๐ธ and effective migration rates ๐ โถ ๐ธ โ โ+.
We have shown that the expected mean and variance of a genotype vector are com-putable functions of the effectivemigration rates๐. Next we derive similar expressionsfor the mean and the variance as functions of expected coalescence times in the case ofdiploid SNPs and microsatellites.
3.1.1 e case of diploid data
Since a diploid individual is the offspring of a pair of diploid parents, we can representthe genotype of a diploid as the sum of two haploids, each drawn randomly from thesame location, i.e., ๐๐ = ๐(1)
๐ + ๐(2)๐ โ {0, 1, 2} where the superscript indicates one of
two haplotypes. However, since we do not distinguish between the haplotype inherited
inferring effective migration from geographically indexed genetic data 23
from the mother and the haplotype inherited from the father, this assumption is rea-sonable only for autosomal SNPs (and not for sex-linked ones) in outbred individuals.
A sample ๐1, โฆ , ๐๐ of ๐ diploid individuals is polymorphic if
{๐1, โฆ , ๐๐ โถ at least one ๐๐ โฅ 1}โ {๐(1)
1 , ๐(2)1 , โฆ , ๐(1)
๐ , ๐(2)๐ โถ ๐(1)
๐ = 1 or ๐(2)๐ = 1}. (3.6)
at is, a segregating SNP in a diploid sample of size ๐ is equivalent to exactly one mu-tation in a haploid sample of size 2๐. [is excludes the possibility that all individualscarry the same allele, either ancestral or derived.]
Furthermore, at a segregating site in a diploid sample, the copies ๐(1)๐ and ๐(2)
๐ ,which constitute ๐๐, are not independent โ the event that one carries the mutationbut not the other is informative for the time to their most common ancestor. ere-fore,
Eโ{๐๐} = Eโ{๐(1)๐ } + Eโ{๐(2)
๐ } = 2Eโ{๐๐} = 2๐ (3.7a)
varโ{๐๐} = 2varโ{๐๐} + 2covโ{๐(1)๐ , ๐(2)
๐ } = 4๐2 โ 2๐๐2๐๐๐ (3.7b)
covโ{๐๐, ๐๐} = 4covโ{๐๐, ๐๐} = 4๐2 โ 4๐๐2๐๐๐ (3.7c)
where the symbol โ indicates the condition that there is exactly one mutation in a sam-ple of 2๐ haplotypes [and ๐๐๐ is the expected coalescence time for two distinct lineageswith the same origin as individual ๐]. In matrix notation,
Eโ{๐} = 2๐1, varโ{๐} = 4๐2(11โฒ โ ๐๐2), (3.8)
where
๐2 = ๐ฝ๐๐ฝโฒ โ 12 diag {๐ฝ๐๐ฝโฒ}. (3.9)
e subscript 2 indicates that the matrix of pairwise coalescence times corresponds toa diploid population. Here the mean does not depend on the location. (is is the casefor haploid data as well.) However, the variance varโ{๐๐} can vary with location unlessthe demographic model implies ๐๐ผ๐ผ = ๐0 for all demes ๐ผ, i.e., isotropic migration.
3.2 Mean and covariance of genotype vectors: microsatellites
Microsatellites (also called short tandem repeats) are repeating sequences of a particularshort DNA segment. Mutation can increase or decrease the number of repeats ๐, andeach ๐ corresponds to an allele.
To model microsatellites, we assume that a locus ๐ evolves from its ancestral allele๐ด๐ according to a symmetric stepwise mechanism where mutations occur with rate ๐๐ and each mutation increases or decreases the number of repeats by exactly one, withequal probability. Here we consider the evolution at a particular site, and for simplicityof notation, we omit the subscript ๐ in the rest of this section.
e ancestral allele ๐ด and the mutation rate ๐ are unknown site-specific parameterswhile the genealogy ๐ฏ has a distribution determined by Kingman's coalescent. As wedid for SNPs, we assume that themicrosatellites are neutral and hence their genealogiesare identically distributed. On the other hand,microsatellites are usually highly variablemarkers (i.e., with high mutation rates), so we cannot take the low-mutation limit.
Conditional on the mutation rate ๐ and the genealogical tree ๐ฏ of the sample, mu-tations occur independently and the number of mutations on a branch with length ๐ก is
24
a Poisson random variable with mean ๐๐ก. is follows from the assumption that muta-tions are generated by a Poisson process with intensity [mutation rate] ๐. For example,the total number of mutations is
๐พ๐ก๐๐ก | ๐, ๐ฏ โผ Po(๐๐ก๐ก๐๐ก), (3.10)
while the number of mutations carried by individual ๐ is๐พ๐ | ๐, ๐ฏ โผ Po(๐๐ก๐๐๐๐). (3.11)
All lineages share the samePoissonmeanparameter because every branch froma lineageto the most common ancestor of the entire sample has length ๐ก๐๐๐๐.
Let ๐ฆ denote the set of all mutations that occur in the genealogy, with |๐ฆ| = ๐พ๐ก๐๐ก.Also, let ๐ฆ๐ โ ๐ฆ denote the set of mutations carried by individual ๐, with |๐ฆ๐| = ๐พ๐.Since each mutation is equally likely to decrease or increase the allele length by 1, the๐th allele is
๐๐ = ๐ด + โ๐โ๐ฆ๐
๐๐, (3.12)
where ๐๐ = ยฑ1 with probability 1/2 and thus E{๐๐} = 0 and var{๐๐} = E{๐2๐} = 1.
First we derive the mean and variance of allele ๐๐ given the mutation rate, the an-cestral allele and the genealogy. e binary variables, ๐๐, are independent of the samplehistory, so E{๐๐ | ๐, ๐ด, ๐ฏ } = E{๐๐} and var{๐๐ | ๐, ๐ด, ๐ฏ } = var{๐๐}. And furthermore,conditional on the number of mutations, the ๐๐s are mutually independent. erefore,
E{๐๐ | ๐, ๐ด, ๐ฏ } = ๐ด + E{E{ โ๐โ๐ฆ๐
๐๐ |๐พ๐}} = ๐ด + E{๐พ๐โ๐=1
E{๐๐}} = ๐ด, (3.13a)
var{๐๐ | ๐, ๐ด, ๐ฏ } = E{๐พ๐โ๐=1
E{๐2๐}} + E{ โ
๐โ ๐โฒE{๐๐๐๐โฒ}} = E{๐พ๐} = ๐๐ก๐๐๐๐Since the mutations are independent,
E{๐๐๐๐โฒ } = E{๐๐}E{๐๐โฒ } = 0 for ๐ โ ๐โฒ.
, (3.13b)
because ๐พ๐ is a Poisson random variable with mean ๐๐ก๐๐๐๐ by equation (3.11).Let ๐ฆ๐โ๐ be the set of mutations that occur in one lineage but not the other, with
|๐ฆ๐โ๐| = ๐พ๐โ๐. Suchmutations occur on the branch from ๐ to๐๐๐๐(๐, ๐) or on the branchfrom ๐ to ๐๐๐๐(๐, ๐). erefore, ๐พ๐โ๐ has mean 2๐๐ก๐๐. Similarly, let ๐ฆ๐\๐ be the set ofmutations carried by ๐ but not ๐.
E{(๐๐ โ ๐๐)2 | ๐, ๐ด, ๐ฏ } = E{( โ๐โ๐ฆ๐\๐
๐๐ โ โ๐โ๐ฆ๐\๐
๐๐)2} = E{
๐พ๐โ๐
โ๐=1
E{๐2๐}} = E{๐พ๐๐} = 2๐๐ก๐๐Again, the cross terms are 0 by mutual
independence.
,
(3.14a)
cov{๐๐, ๐๐ | ๐, ๐ด, ๐ฏ } = var{๐๐ | ๐, ๐ด, ๐ฏ } โ 12E{(๐๐ โ ๐๐)2 | ๐, ๐ด, ๐ฏ } = ๐๐ก๐๐๐๐ โ ๐๐ก๐๐.
(3.14b)
Now we have expressions for the mean, variance and covariance of the genotypes at aparticular microsatellite, given the site-specific mutation rate ๐, ancestral allele ๐ด andgenealogy๐ฏ . We treat๐ and๐ด asnuisance parameters to be estimated andwemarginal-ize the genealogy out. e goal is to express the model in terms of the expected coales-cence times rather than the coalescence times at a particular site. We took the sameapproach for SNP data but in the former case, ๐ด = 0 for every segregating site and ๐ iseliminated in the small mutation limit ๐ โ 0. Finally,
E{๐๐ | ๐, ๐ด} = E{๐ด | ๐, ๐ด} = ๐ด,E{๐} = E{E{๐ | ๐}} (3.15a)
var{๐๐ | ๐, ๐ด} = E{๐๐ก๐๐๐๐ | ๐, ๐ด} + var{๐ด | ๐, ๐ด} = ๐๐๐๐๐๐,var{๐} = E{var{๐ | ๐}} + var{E{๐ | ๐}} (3.15b)
cov{๐๐, ๐๐ | ๐, ๐ด} = E{๐๐ก๐๐๐๐ โ ๐๐ก๐๐ | ๐, ๐ด} + var{๐ด | ๐, ๐ด} = ๐(๐๐๐๐๐ โ ๐๐๐)cov{๐, ๐} = E{cov{๐, ๐ | ๐}} +
cov{E{๐ | ๐}, E{๐ | ๐}}
. (3.15c)
inferring effective migration from geographically indexed genetic data 25
In the case ofmicrosatellites, we donot condition on observing variability in the sample,i.e., on the event {๐พ๐ก๐๐ก > 0} as microsatellites have higher mutation rates and we canestimate the parameter rather than take its limit to 0. For SNPs such that we observeexactly one mutation at every site, the "variability" condition is explicitly modeled be-cause it modifies the genealogy distribution. Intuitively, it "stretches" the tree and thuschanges (proportionally) all branches ๐ก โ ๐ฏ .
erefore, the genotype vector of๐ sampled individuals at a particularmicrosatellitehas mean and variance
Eโ{๐} = ๐1, varโ{๐} = ๐2(11โฒ โ ๐๐) (3.16)
where the symbol โ indicates conditioning on the ancestral allele ๐ด and the mutationrate ๐, and the parameters are given by
๐ = ๐ด, ๐2 = ๐๐๐๐๐๐, ๐ = 1๐๐๐๐๐
. (3.17)
As for SNP data, the mean and the variance of genotypes at a particular locus do notdepend on the origin of an individual. However, for microsatellite data, the mean andthe variance vary across sites because the ancestral allele ๐ด and the mutation rate ๐ areboth site-specific parameters. On the other hand, the scale ๐ is shared across sites andtherefore every site has the same correlation matrix ฮฃ โก 11โฒ โ ๐๐.
With this parametrization, the demographic parameters are estimable up to a pro-portionality constant. If wemultiply themigration and coalescence rates by 2, we speedup the structured coalescent process by a factor of 2, and hence, we decrease the ex-pected coalescence times by 2. However, the covariance matrix ฮฃ remains unchangedbecause the dissimilarity matrix ๐ is appropriately scaled.
3.3 Effective migration can explain spatial structure in genetic variation
In the previous section, we discussed how to specify the mapping from the stepping-stone model ๐บ = (๐, ๐ธ, ๐) to the genetic covariance matrix cor{๐} = ฮฃ, for bothSNP and microsatellite data. Briefly, we followed three steps. First, ๐บ = (๐, ๐ธ, ๐)determines ๐ = (๐๐ผ๐ฝ) through the system of linear equations (2.15). en, in turn,the expected coalescence times between demes, ๐, determine the expected coalescencetimes between sampled individuals, ๐, through equation (3.5). Finally, the distancematrix determines the correlation matrix ฮฃ = 11โฒ โ ๐๐ by equation (3.3b) where ๐ isan appropriately chosen scalar parameter that guarantees ฮฃ is positive definite.
Our goal is to estimate the effective migration rates ๐ across the habitat; these aresample-independent (population-level) features of the population graph๐บ. emean ๐and the variance๐2 of derived alleles as well as the scale factor๐ of expected coalescencetimes can be treated as nuisance parameters because they are sample-dependent andshared by all individuals in the sample. For example, for haploid SNPs the overall meanis ๐ = ๐๐๐๐๐/๐๐ก๐๐ก [with ๐2 = ๐(1 โ ๐)] and the scale factor is ๐ = 1/๐๐ก๐๐ก, so (๐, ๐2, ๐)contain some information about ๐บ. Although the scalars ๐๐ก๐๐ก and ๐๐๐๐๐ are, formally,functions of the effective migration rates ๐ they are very difficult to compute.
On the other hand, the matrix ๐ = (๐๐๐) of pairwise coalescence times is a com-putable function of ๐. is matrix is also a pairwise dissimilarity (distance) matrix[and formally, a semivariogram]: the more genetically dissimilar two individuals are,the longer the time to their most recent common ancestor because the probability thatthe branch ๐๐๐ accumulates a mutation is proportional to its relative length in the aver-age genealogy tree. e property that ๐ is a distance matrix is important because it can
26
explain genetic dissimilarities (correlations) as a linear function of distances betweenlocations. Expected coalescence time is a particular choice of distance metric motivatedby coalescent theory [McVean, 2009]. We can consider other metrics such as effectiveresistance [McRae, 2006].
๐ โกโง{โจ{โฉ
๐ = { migration rates ๐(๐) }๐ถ = { conductances ๐(๐) }
โถ โ๐ โ ๐ธโซ}โฌ}โญ
๐ โ ๐๐ is a symmetric matrix of
weights.
(1)โถ โ โกโง{โจ{โฉ
๐ = { coalescence times ๐๐ผ๐ฝ }๐ = { effective resistances ๐ ๐ผ๐ฝ }
โถ โ(๐ผ, ๐ฝ) โ ๐ ร ๐โซ}โฌ}โญ
โ โ ๐ป๐ is the population distance
matrix.
(2)โถ โ โกโง{โจ{โฉ
๐ = { coalescence times ๐๐๐ }๐ = { effective resistances ๐ ๐๐ }
โถ โ(๐, ๐) โ ๐ผโซ}โฌ}โญ
โ โ ๐ป๐ is the sample distance matrix.
(3)โถ ฮฃ โก 11โฒ โ ๐โฮฃ โ ๐๐ is the sample covariance matrix.
e first step, denoted by(1)โถ, is to compute all ๐(๐ + 1)/2 pairwise distances between
๐ demes. is operation is expensive even for medium-size grids. However, the covari-ancematrixฮฃ is a function of the sample distancematrixฮ, not the population distancematrix ฮ. at is, in principle, we could avoid computing the full ๐ ร ๐ dissimilarity ma-trix, especially for sparsely sampled habitats. [is is the advantage of ๐ over ๐.]
In a certain sense, ๐ is an "appropriate" dissimilarity measure for population struc-ture as genetically similar individuals are likely to have a recent common ancestor andthus shorter coalescence time. For the stepping-stone model we can obtain the matrixof pairwise coalescence times๐ exactly or approximate it with thematrix of effective re-sistances, ๐ . However, the stepping-stone model itself does not represent the true his-tory of the populationโ the grid is placed arbitrarily and there are underlying assump-tions, including equilibrium in time, lowmutation rate and no selection. erefore, in amanner similar to McRae's definition of the effective migration rate, ๐๐ผ๐ฝ, for a pair ofdemes, we should interpret the migration rate function ๐ = {๐๐ผ๐ฝ โถ (๐ผ, ๐ฝ) โ ๐ ร ๐}as effective migration surface because it would produce the observed patterns of geneticdifferentiation if the population were evolving under the stepping-stone model.
3.4 Related methods for analyzing population structure
We have shown that genetic correlations can be modeled in terms of a distance ma-trix. is representation is motivated by the relationship between genetic similaritiesand expected coalescence times. However, we can consider other distance metrics (onthe population graph) as long as they capture relevant features of a spatially heteroge-neous habitat, and effective resistance is particularly useful because it approximates thecoalescent-based metric and is efficient to compute.
Here we discuss briefly two related methods for analyzing spatially distributed pop-ulations.
3.4.1 MIGRATE
[Beerli and Felsenstein, 2001] develop an approach to estimate migration rates amongdemes, and more generally, to compare and rank structured population models. eir
inferring effective migration from geographically indexed genetic data 27
method MIGRATE is also based on the structured coalescent but it makes different as-sumptions about the spatial distribution and the migration pattern.
In MIGRATE the demes are sampling locations and all demes potentially exchangemigrants, so the population graph is constructed without explicit geographic informa-tion. [Some edges can be excluded to test and compare various migration patterns.]Every deme in the resulting graph has a size parameter and every edge has two migra-tion parameters. [MIGRATE allows asymmetric gene flow.] us for a graph with ๐demes, the most complex model to test has ๐(๐ โ 1) migration rates and ๐ deme sizes.
In contrast, our method uses a regular triangular grid constructed independently ofthe sampling configuration [or an a priori grouping of individuals into subpopulations].Migration is symmetric and constrained to occur only between neighboring demes butnot all demes need to be sampled. A Voronoi tessellation of a Euclidean
space is a partition into ๐ convex poly-gons (tiles) generated by ๐ distinctpoints (centers). e region associatedwith the ๐กth center ๐ข is the set of pointscloser to ๐ข than any other center. Bound-ary points are equidistant to two centers.[Okabe et al., 2000].
And edges are grouped via a Voronoi tessellation ofthe habitat to encourage parameter sharing and locally constant migration. is repre-sentation is flexible and the number of (unique) migration rates varies with the numberof tiles.
3.4.2 GENELAND
[Guillot et al., 2005] also uses Voronoi tiling to model the spatial structure in geneticvariation but their method GENELAND is cluster-based and thus best suited to ana-lyze discrete structure. Since individuals sampled from geographically close locationsare more likely to come from the same subpopulation, GENELAND attempts to findclusters that are both genetically and geographically coherent. Compared with a spatialrepresentation in terms of a population graph, such clusters can correspond to singledemes in the graph (e.g., if migration is low and even demes close in space are clearlydifferentiated); or they can correspond to groups of demes where allele frequency dis-tributions are indistinguishable (e.g., if gene flow is high so that a mutation that arisesin one deme can quickly ''spread'' to nearby locations).
4
Estimating Effective Rates of Migration
In this chapter we introduce a likelihood function and prior distributions to performBayesian inference for the effective migration surface ๐ based on the similarities ob-served in georeferenced genetic data. e posterior estimate of๐ can represent graphi-cally population-level features such as barriers tomigration, ormore generally, the com-bined effect of evolutionary processes on genetic differentiation.
Our method assumes that we have data for ๐ individuals sampled from a spatiallydistributed population at locations (๐ฅ1, ๐ฆ1), โฆ , (๐ฅ๐, ๐ฆ๐) and genotyped at ๐ loci, eitherSNPs or microsatellites. e geographic information is used to assign individuals tothe closest deme in the population graph (๐, ๐ธ); this defines the sample configuration๐ผ = (๐ผ1, โฆ , ๐ผ๐). Given๐บ = (๐, ๐ธ, ๐)with symmetric migration rates๐ = (๐๐ผ๐ฝ)wecan compute the pairwise distancematrix for entire populationฮ = (ฮ๐ผ๐ฝ); givenฮ andthe deme indicators ๐ผ we can obtain the expected pairwise distances for the observedsample ฮ = (ฮ๐๐). Notation: Here we discuss the likelihood of the sample, so we willwrite simply ฮ throughout as there is no need to distinguish between the populationand the sample distance matrices.
In the previous chapter we derived expressions for the mean and variance of theallele count vector ๐ = (๐๐) at a segregating site [eq. (3.3) for single nucleotide poly-morphisms; eq. (3.16) for microsatellites]. Recall that
E{๐} = ๐1, var{๐} = ๐2(11โฒ โ ๐ฮ), (4.1)
where ๐ is the allele frequency and ๐2 is the variance in allele frequency [in the sample,not the population]. It is convenient to normalize ฮ so that 1โฒฮโ11 = 1; then thecorrelation matrix ฮฃ = 11โฒ โ ๐ฮ is positive definite for ๐ โ (0, 1) [Appendix 7.2].
Recall further that neutral sites (not under selection) develop under the same co-alescent process, and therefore, the genotype vectors ๐ = (๐1, โฆ , ๐๐) โ โค๐ร๐ at ๐segregating sites have the same correlation matrix ฮฃ. e scalar parameters ๐, ๐2 canvary across sites. For microsatellites ๐ is the ancestral allele and ๐2 depends on themu-tation rate ๐, and both are site specific. For SNPs๐ is the expected allele frequency if thederived allele is coded as 1; but the labels might not be consistent as usually the minorallele is coded as 1.
Our aim here is to incorporate these expressions for the mean and variance intoa likelihood function in order to infer effective migration rates from observed data.Note that every individual has mean ๐ regardless of location; intuitively, the sharedparameter ๐ contains little information about patterns of genetic differentiation be-tween individuals, as we discuss in Section 4.4. So, to simplify, assume that we ob-serve the pairwise differences, ๐๐ โ ๐๐, rather than the allele counts ๐๐. Equivalently,assume that we observe ๐ฟ๐ where ๐ฟ โ โ(๐โ1)ร๐ is a basis for contrasts, e.g., ๐ฟ =
inferring effective migration from geographically indexed genetic data 29
(๐2 โ ๐1, ๐3 โ ๐1, โฆ , ๐๐ โ ๐1)โฒ where ๐๐ is the standard basis vector with 1 in the ๐thcoordinate and 0 otherwise. Note that
E{๐ฟ๐} = 0, var{๐ฟ๐} = โ๐โ๐ฟฮ๐ฟโฒ, (4.2)
where we define ๐โ = ๐๐2 because the variance and the scale are longer identifiable.e matrix โ๐ฟฮ๐ฟโฒ is positive definite, and thus a valid covariance matrix, because thedistance matrix ฮ is nonnegative definite on contrasts and ๐ฟ๐ฃ is a contrast for every๐ฃ โ โ๐โ1.
erefore, it might be natural to assume a Normal likelihood for the pairwise differ-ences,
๐ฟ๐ | ๐โ, ฮ โผ N๐โ1(0, โ๐โ๐ฟฮ๐ฟโฒ). (4.3)
Suppose further that the genotypedmarkers are independent; then it is straightforwardto extend the Normal likelihood (4.3) for one locus to multiple loci. In particular, forSNP data where usually there are many more SNPs than individuals and mutation ratesare low, let ๐ = ๐๐โฒ/๐ be the observed similarity matrix averaged across ๐ SNPs. en๐ฟ๐๐ฟโฒ is a scatter matrix of pairwise differences and
๐ฟ๐๐ฟโฒ | ๐โ, ฮ โผ W๐โ1(๐, โ๐โ
๐ (๐ฟฮ๐ฟโฒ)), (4.4)
where the degrees of freedom are the number of independent SNPs and and the scaleparameter ๐โ is shared. erefore, by considering the pairwise differences, we avoidestimating a nuisance parameter ๐ with dimensionality that grows with the number ofmarkers ๐. In practice we also gain efficiency with faster MCMC convergence.
4.1 Effective degrees of freedom for SNP data
So far we have considered the case where the ๐ genotyped markers are independent(unlinked). e assumption of independence between loci is very strong and likely tobe violated. In particular, SNPs in close proximity are often associated (in linkage dis-equilibrium) because individuals inherit long segments of unbroken DNA from theirparents. For this reason, SNPs data is often ''thinned'' by removing SNPs in high LD.We propose an alternative method to correct for model mis-specification due to bothdependence between SNPs and non-normality of genotypes.
In the Wishart likelihood (4.3) the scatter matrix of contrasts, ๐ฟ๐๐ฟโฒ, has known de-grees of freedom ๐. However, instead of fixing the degrees of freedom to the number ofgenotyped SNPs, we can estimate this parameter. e likelihood for the scatter matrixbecomes
๐ฟ๐๐ฟโฒ | ๐, ๐โ, ฮ โผ W๐โ1(๐, โ๐โ
๐ (๐ฟฮ๐ฟโฒ)), (4.5)
with degrees of freedom ๐ โ (๐, ๐). Both Wishart likelihoods (4.3) and (4.5) implyE{๐ฟ๐๐ฟโฒ} = โ๐โ๐ฟฮ๐ฟโฒ. erefore, estimating the degrees of freedom does not affect theexpected pairwise differences as a function of effectivemigration. However, theWishartvariance is proportional to (๐โ)2/๐, so it we infer ๐ โ (๐, ๐) rather than set ๐ = ๐,the model variance increases as we would expect if the data contain less informationthan the sample size suggests, or more generally, if the model is mis-specified. Undernormality, ๐ = ๐ implies that all sites are independent; otherwise, the variance increasesby a factor of ๐/๐.
30
4.2 Prior on migration surface represented as a Voronoi tessellation
We have proposed a model for population structure in terms of expected pairwise dis-tances on a population graph ๐บ = (๐, ๐ธ, ๐) where (๐, ๐ธ) is a rectangular grid and๐ assigns effective migration rates to edges in the graph. e goal is to estimate theeffective migration surface ๐ so that the demographic model ๐บ explains the observedgenetic dissimilarities. e grid is fixed; the likelihood is defined in the previous section.Here we consider prior specification for ๐.
e regular grid (๐, ๐ธ) is not determined by the sampling locations and it yields ahigh-dimensional, flexible representation so that fine features in the effectivemigrationsurface can emerge if supported by the data. To take advantage of this flexibility, weorganize the edges in terms of a Voronoi tessellation of the habitat. Statistically, theVoronoi decomposition offers the advantages of parameter sharing and a locally smoothmigration surface. Previous applications ofVoronoi tiling in population genetics include[Guillot et al., 2005] and [Wasser et al., 2004].
A Voronoi tessellation of the migration surface ๐ is fully specified by the numberof tiles ๐, their locations ๐ข and migration rates ๐. us ๐ = {๐๐ก โถ ๐ก = 1, โฆ , ๐} is theset of effective migration rates for the ๐ tiles in the partition. Furthermore, let edge(๐ผ, ๐ฝ) โ ๐ธ have migration rate
๐๐ผ๐ฝ = 12๐๐ก๐ผ + 1
2๐๐ก๐ฝ , (4.6)
where ๐ก๐ผ denotes the tile deme ๐ผ falls into. at is, the rate of an edge is the averagerate of the two tiles it connects.
Migration rates are naturally positive and therefore we parametrize them on the logscale as differences from the overall mean rate โ s๐,
log10(๐๐ก) = โ s๐ + ๐๐ก. (4.7)
If the effect of distance on differentiation is space-homogeneous and the tile-specificeffects ๐๐ก are (close to) 0, the migration pattern thus produced would correspond toisolation by distance.
erefore, our model has the following parameters:
1. parameters of interest ฮ1 that determine the effective migration rates ๐ and thusthe effective pairwise distances ฮ. ese are
โข (๐, โ s๐, ๐2๐): number of tiles, mean and variance of tile migration rates on the log(base 10) scale.
โข {(๐๐ก, ๐ข๐ก) โถ ๐ก = 1, โฆ , ๐}: relative effect and center location for each Voronoi tile๐ก. e dimensionality of this group of parameters changes with the number ofVoronoi tiles ๐.
โข ๐: effective degrees of freedom for SNP data where we observe more sites thanindividuals, i.e., ๐ > ๐.
2. nuisance parameters ฮ0 that do not depend on the demographic model. For SNPdata this is the scale parameter ๐โ; for microsatellite data each site has its own scaleparameter ๐โ๐ because mutation rates vary across sites and under the stepwise mu-tation model the scale ๐โ๐ is the mutation rate ๐๐ .
Using the Voronoi tessellation๐ฑ(๐, ๐ข, ๐) to represent๐, we can have fewer than โฃ๐ธโฃ rateparameters to estimate but we do not know how many tiles we need and where theircenters are. is depends on the patterns of genetic differentiation across the habitat.
inferring effective migration from geographically indexed genetic data 31
To complete the Bayesian specification we place priors on the model parameters:
(number of Voronoi tiles) ๐ | ๐ โผ Po(๐), (4.8a)
(tile locations) ๐ข | ๐ iidโผ U(โ), (4.8b)
(tile effects) ๐ | ๐2๐ , ๐ iidโผ N(0, ๐2๐ ). (4.8c)
e hyperparameter ๐ controls howmuch spatial heterogeneity the effective migrationsurface exhibits. e rate hyperparameters are
(overall migration rate) โ s๐ โผ U(๐๐๐, ๐ข๐๐), (4.9a)
(tile variance) ๐2๐ โผ Inv-G(๐/2, ๐/2). (4.9b)
[For all results we report here ๐ = 6, ๐ = 3.] e lower and upper bounds on the meanlog rate are chosen so that the mean migration rate varies in the range [1/300, 300]on the original scale. e bounds are somewhat arbitrary but based on simulations ofgenetic data with ms [Hudson, 2002]. Restricting the support is necessary because themodel is not numerically stable at the two extremes:
โข When migration rates are very small (relatively to coalescence rates), it takes verylong time on average for two lineages from different demes to coalesce. In the limit,the population is a collection of unrelated subpopulations that evolve independently.
โข Whenmigration rates are very large (relatively to coalescence rates), the time it takestomove fromonedeme to another is negligible compared to the coalescence times. Inthe limit, the population behaves like a panmictic population without any structure.
e prior on the effective degrees of freedom is uniform on the log scale:
(degrees of freedom) ๐(๐) โ 1๐ . (4.10)
e prior is proper because ๐ is bounded: ๐ > ๐ because ๐ is the degrees of freedomparameter in aWishart distribution, and ๐ < ๐ because ๐ should not exceed the numberof observed sites (features). [e normalizing constant is therefore log(๐) โ log(๐).]
(scale parameter) ๐โ โผ Inv-G(๐/2, ๐/2). (4.11)
[For all results we report here ๐ = 1, ๐ = 1.]We use reversible-jump MCMC to estimate ๐ as the dimension of both ๐ข and ๐
changes as the number of tiles ๐ increases or decreases. Full details about the MCMCimplementation are given in Appendix 7.6.
4.3 Likelihood for distance matrices
e Wishart likelihood (4.5) is given in terms of the contrast basis ๐ฟ but it does notdepend on the choice of๐ฟ. Instead, it can bewritten in terms of the observed similarities๐ = ๐๐โฒ/๐, the model distances ฮ and its orthogonal projection ๐ given by
๐ = ๐ผ โ 11โฒฮโ1
1โฒฮโ11 . (4.12)
32
In Appendix 7.4 we show that the Wishart log likelihood that corresponds to themodel (4.5) can be written as
โ(๐, ๐โ, ฮ) =
โง{{{{{{{{{{{{{{โจ{{{{{{{{{{{{{{โฉ
+ [๐/2] logdet { โ (๐ฟฮ๐ฟโฒ)โ1/๐โ}
โ [๐/2] tr { โ (๐ฟฮ๐ฟโฒ)โ1๐ฟ๐๐ฟโฒ/๐โ}
+ [(๐ โ ๐)/2] logdet {๐ฟ๐๐ฟโฒ}
+ [๐(๐ โ 1)/2] log (๐/2)
โ log ฮ๐โ1(๐/2)
=
โง{{{{{{{{{{{{{{{โจ{{{{{{{{{{{{{{{โฉ
+ [๐/2] log Det { โ ฮโ1๐/๐โ}
โ [๐/2] tr { โ (ฮโ1๐)๐/๐โ}
+ [(๐ โ ๐)/2] logdet {๐}
+ [๐(๐ โ 1)/2] log (๐/2)
โ log ฮ๐โ1(๐/2)
+ [(๐ โ ๐)/2] log (1โฒ๐โ111โฒ1 )
โ [๐/2] logdet {๐ฟ๐ฟโฒ}
(4.13)
e only term that involves the residual basis ๐ฟ is (๐/2) logdet {๐ฟ๐ฟโฒ}. Regardless of thechoice for ๐ฟ, this terms does not depend on the parameters (๐, ๐โ, ฮ). Full details aboutthe likelihood computation are given in Appendix 7.5.
4.3.1 Related model
is is the marginal likelihood for distance matrices introduced in [McCullagh, 2009].Let ๐ท the ๐ ร ๐ pairwise dissimilarity matrix given by
๐ท = 1 diag(๐)โฒ + diag(๐)1โฒ โ 2๐. (4.14)
e orthogonal projection ๐ = ๐ผ โ 11โฒฮฃโ1/(1โฒฮฃโ11) satisfies
๐โฒฮฃโ1 = ๐โฒฮฃโ1๐ = โ๐๐โฒฮโ1๐ = โ๐๐ฮโ1 (4.15)
since ker {๐} = {1} and thus ๐1 = 0. Similarly, ๐๐ท๐โฒ = โ2๐๐๐โฒ. erefore, forfixed ๐ = ๐ and after we ignore all terms that do not involve ฮ or ๐โ,
โ(๐โ, ฮ ; ๐) โ โ(๐2, ฮฃ ; ๐ท) โ ๐2 log Det { โ ฮโ1๐/๐โ} โ ๐
2 tr { โ ฮโ1๐๐/๐โ}(4.16a)
= ๐2 log Det {ฮฃโ1๐/๐2} + ๐
4 tr {ฮฃโ1๐๐ท/๐2} (4.16b)
where ๐โ = ๐๐2.Recently, [Hanks and Hooten, 2013] build this likelihood into a parametric model
for isolation by resistance [McRae, 2006]. Briefly, the genetic data is generated by aGaussian Markov random field on an undirected graph (circuit). e covariance struc-ture is given by an intrinsic conditional autoregressive model, i.e., the conditional dis-tribution of each node given the rest of the graph is normal withmean and variance thatdepend on its first-degree neighbors only. [Hanks and Hooten, 2013] specify the modelso that the expected square differences between nodes are exactly effective resistancedistances on the population graph. In our notation, let ฮ = ๐ be the matrix of effective
inferring effective migration from geographically indexed genetic data 33
resistances. [Note that this is slightly different from the IBR-based approximation toexpected coalescence times ฮ = ๐.] [Bapat, 2004] shows that
๐ โ1 = โ12๐ฟ + ๐๐โฒ (4.17)
where ๐ฟ is the Laplacian of the graph๐บ = (๐, ๐ธ, ๐) and ๐๐โฒ is a rank-one update. en
๐ โ1๐ = ๐ โ1 โ ๐ โ111โฒ๐ โ1
1โฒ๐ โ11 = ( โ 12๐ฟ + ๐๐โฒ) โ (๐(1โฒ๐))(๐(1โฒ๐))โฒ
(1โฒ๐)2 = โ12๐ฟ (4.18)
(๐ + 11โฒ)โ1 = ๐ โ1๐ + ๐ โ111โฒ๐ โ1
1โฒ๐ โ11 โ ๐ โ111โฒ๐ โ1
1 + 1โฒ๐ โ11
= ๐ โ1๐ + ๐ โ111โฒ๐ โ1
(1โฒ๐ โ11)(1 + 1โฒ๐ โ11)
= โ12๐ฟ + ๐๐โฒ
1 + (1โฒ๐)2 (4.19)
๐[๐ + 11โฒ] = ๐ผ โ 1๐โฒ(1โฒ๐)1 + (1โฒ๐)2
1 + (1โฒ๐)2
(1โฒ๐)2 = ๐ผ โ 1๐โฒ
1โฒ๐ (4.20)
(๐ + 11โฒ)โ1๐ = โ12๐ฟ + ๐๐โฒ
1 + (1โฒ๐)2 โ ๐๐โฒ
1 + (1โฒ๐)2 = โ12๐ฟ (4.21)
at is, ๐ตโ1๐[๐ต]/4 = ๐ โ1๐[๐ ] where ๐ต = ๐ /4 + 11โฒ. [Hanks and Hooten, 2013] rep-resent conductances between connected nodes as a function of landscape features, e.g.,elevation. Instead we represent conductances [i.e., migration rates] through a coloredVoronoi tessellation and estimate edge weights without reference to available ecologicalvariables.
Modeling the dissimilarity matrix๐ท instead of the raw allele counts๐ is convenientbecause
โข Suppose that ๐ is an orthogonal transformation (rotation or reflection). en
๐๐ = (๐๐)(๐๐)โฒ = ๐๐๐โฒ๐โฒ = ๐๐โฒ = ๐
โข Suppose ๐ is a translation by ๐ = (๐1, โฆ , ๐๐)โฒ. en
๐ท๐๐๐ = โจ(๐ง๐ โ ๐) โ (๐ง๐ โ ๐)โฉ2 = โจ๐ง๐ โ ๐ง๐โฉ2 = ๐ท๐๐
Although the transformation from the entire data ๐ to the summary ๐ท is not a one-to-one transformation, wedonot lose information about relative distances, i.e., populationstructure. Instead we lose information about some nuisance parameters. For example,๐ โ ๐ท makes the labeling of the alleles irrelevant.
4.4 What do we lose from ignoring the means?
We can use themarginal likelihood (4.3) because sampled individuals are equally distantfrom the most recent common ancestor of the sample [the root of the genealogy tree],and therefore, share a commonmean. us ๐พ = 1 is a basis for the mean space. [Recallthat ๐ฟ is a basis for the residual space of pairwise differences.] erefore,
โโโโโโ
1โฒ
๐ฟโโโโโโ
๐ โผ N๐โโโโโโ
โโโโโโ
๐0โโโโโโ
, ๐2โโโโโโ
1โฒฮฃ1 1โฒฮฃ๐ฟโฒ
๐ฟฮฃ1 ๐ฟฮฃ๐ฟโฒ
โโโโโโ
โโโโโโ
(4.22)
34
Let ๐ = 1โฒ๐ = โ๐๐=1 ๐๐ and ๐ = ๐ฟ๐.
๐ = ๐ฟ๐ | ๐, ฮฃ โผ N๐โ1(0, ๐2๐ฟฮฃ๐ฟโฒ) (4.23a)
๐ | ๐, ฮฃ โผ N(๐ + 1โฒฮฃ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐, ๐2[1โฒฮฃ1 โ 1โฒฮฃ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟฮฃ1])๐ = ฮฃ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ
= N(๐ + 1โฒ๐๐, ๐21โฒ1(1โฒฮฃโ11)โ11โฒ1)= ๐ผ โ 1(1โฒฮฃโ11)โ11โฒฮฃโ1
= N(๐ + 1โฒ๐๐, (1 โ ๐)๐2๐2)1โฒฮฃโ11 = 1โฒโโ11/(1 โ ๐) (4.23b)
e conditional distribution of๐ depends onฮ only through the bias term 1โฒ๐๐. ere-fore we choose to ignore it and use only the marginal distribution of ๐ to infer ฮ.
4.5 Standardizing genotype data
Before performing PCA analysis for population structure it is common to standardizeSNPs and to set themissing alleles to the observed average at the correspondingmarker[McVean, 2009, Price et al., 2006]. e motivation is that without normalization SNPswith higher variance contribute more to the scatter matrix ๐๐โฒ. erefore, the proce-dure tends to up-weight the influence of rare variants. Here we discuss why neithercentering the genotypes to have mean 0 nor standardizing the variance is appropriatewhen analyzing population structure.
In matrix notation, let ๐ถ = ๐ผ โ 11โฒ/๐ be the centering matrix for ๐ observations.en multiplying by ๐ถ removes the mean:
๐ = ๐ถ๐ iidโผ N๐(๐๐ถ1, ๐2๐ถฮฃ๐ถ) = N๐(0, โ๐โ๐ถฮ๐ถ), (4.24)
is operation is convenient because๐๐โฒ has centralWishart distribution. It alsomakesthe labelling of alleles as ancestral or derived ['0' or '1'] irrelevant, up to a change insign. Suppose that we "flip" the 0/1 labels at a particular site, i.e., ๐งโ = 1 โ ๐ง. en๐ฅโ = ๐ถ(1 โ ๐ง) = โ๐ถ๐ง = โ๐ฅ because ๐ถ1 = (๐ผ โ 11โฒ/๐)1 = 0.
However, centering with ๐ถ assumes the individuals are independent and identicallydistributed, i.e., no population structure: Ifฮฃ = ๐ผ, then the projection๐ onto the spaceof contrasts is ๐ = ๐ผ โ 11โฒฮฃโ1/(1โฒฮฃโ11) = ๐ถ. Since our model assumes the individu-als are coupled with correlation given by 11โฒ โ๐ฮ, it is not appropriate to naively centerthe genotypes to have mean 0 or to substitute the average allele frequency for missingvalues. For SNP datasets, it is better to impute missing SNP values before analyzingpopulation structure. ere are various imputation algorithms but they all would takeinto account similarities across observed alleles to imputemissing ones. Formicrosatel-lite datasets, which are usuallymuch smaller and harder to impute, we use the likelihoodfor the observed pairwise distances only. [So that the sample configuration ๐ผ is reallysite-specific.]
Furthermore, itmight not be appropriate to standardize SPNs to have the same vari-ance precisely because this up-weights the contribution of rate alleles [McVean, 2009].A mutation in effect splits sampled individuals into two groups that are slightly differ-ent genetically โ those that carry the mutation versus those that do not. Intuitively,newer and especially private mutations, which are carried by a single individual, are in-formative for structure that is too fine to represent with a model at the level of demes.
5
Simulations of Structured Genetic Data
In this chapter we describe several simulated scenarios that we use to evaluate the per-formance of our method for estimating effective migration as well as to illustrate someof its properties. We use the program ms [Hudson, 2002] to simulate genetic data un-der the structured coalescent. Given the model parameters (deme sizes and migrationrates) and the sample configuration, ms first generates a random genealogy, which de-scribes the history of the sample from the present to its most recent common ancestor,and then places a Poisson number of mutations uniformly (and independently of eachother) on the tree.
Weusems to simulate independent and identically distributed genealogies under thestepping-stone model ๐บ = (๐, ๐ธ, ๐) with conservative migration ๐ = (๐๐ผ๐ฝ). ere-
fore, To generate histories with exactly onemutation, we choose a small mutationrate ๐ and discard genealogies that carryzero or multiple mutations.
the iid assumption across sites holds but the normality assumption is violated. Inall examples, we generate ๐ = 3000 single nucleotide polymorphisms for ๐ = 300 hap-loid individuals on a 12 ร 8 regular triangular grid. [e corresponding ms commands,with detailed explanations, are given in Appendix 7.8.]
5.1 Spatial structure due to constant migration
First we generate data under different patterns of migration โ either uniform or witha barrier โ to confirm that the method performs accurately when the underlying de-mographic model is correct. at is, the population does evolve on a known grid (๐, ๐ธ)of equally sized demes and unknown migration rates. In these simulations, therefore,effective migration rates are true migration rates [up to a constant of proportionalitythat depends on the coalescent timescale ๐0. We set up the simulations so that thisconstant is 1.]. We report migration rates, as they are parametrized, on the log (base10) scale, and the blue/brown color scheme is based on [Brewer et al., 2003].
truth (uniform migration) inferred migration surface log10
(m)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8
Figure 5.1: Uniform migration rates andequal deme sizes: ๐๐ผ = 1 for all ๐ผ โ ๐and ๐๐ผ๐ฝ = 1 for all (๐ผ, ๐ฝ) โ ๐ธ. e size ofthe gray circles indicates the number of in-dividuals sampled from the correspondingdeme.
In Figures 5.1 and 5.2 we directly compare the truth (left) with the posterior mean(right). Not every deme in the population graph is necessarily observed but sampling is
36
truth (sharp barrier to migration) inferred migration surface log10
(m)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8
truth (smooth barrier to migration) inferred migration surface log10
(m)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8
Figure 5.2: Barrier to migration and equaldeme sizes: migration rates vary betweenhigh, ๐๐ผ๐ฝ = 3, and low, ๐๐ผ๐ฝ = 1/3, ineither a sharp or smooth pattern.
a)
b)
Figure 5.3: Draws from the posterior dis-tribution of effective migration. a) sharpbarrier to migration; b) smooth barrier tomigration.
inferring effective migration from geographically indexed genetic data 37
.......................
balanced because areas with higher migration are sampled with higher probability. Inall three cases our method correctly captures the qualitative pattern of migration. Andin Figure 5.3 we show samples from the posterior distribution on the colored Voronoitessellation, to illustrate the uncertainty in the estimated effective migration surface.
5.2 Spatial structure due to variation in diversity
e next set of simulations demonstrate that effective migration reflects the combinedeffect of demographic processes on genetic differentiation. In particular, we use twoexamples to show how differences in effective population size can influence effectivemigration rates. In the first example, lower migration rates cancel the effect of biggerdemes sizes, to produce uniform effective migration. In the second example, only demesizes vary to produce the effect of a barrier to migration.
To describe the simulations, consider the example graph with two groups of demes,๐ด (circles, smaller in size) and ๐ต (squares, bigger in size), with deme sizes ๐๐ด and ๐๐ต,respectively. Let๐๐ด๐ด be themigration rate of all๐ดโ๐ด edges and๐๐ต๐ต be themigrationrate of all ๐ต โ ๐ต edges. We assign migration rates to the ''across'' edges ๐ด โ ๐ต and ๐ต โ ๐ดso that migration is conservative and deme sizes are constant over time, as required bythe stepping-stone model. Formally,
โ๐พโถ๐พโผ๐ผ,๐พโ๐ด
๐๐ด๐ด๐๐ด + โ๐พโถ๐พโผ๐ผ,๐พโ๐ต
๐๐ด๐ต๐๐ด = โ๐พโถ๐พโผ๐ฝ,๐พโ๐ด
๐๐ต๐ด๐๐ต + โ๐พโถ๐พโผ๐ฝ,๐พโ๐ต
๐๐ต๐ต๐๐ต (5.1)
by definition (2.14) where the coalescence rate is ๐๐ด = 1/๐๐ด. A sufficient condition forconservative migration is that
๐๐ด๐ต๐๐ด = ๐๐ต๐ด๐๐ต. (5.2)
is condition preserves the symmetry as much as possible because the number of mi-grants from ๐ผ โ ๐ด to ๐ฝ โ ๐ต is equal to the number of migrants from ๐ฝ to ๐ผ, i.e.,migration is balanced across every edge. erefore, given the deme sizes ๐๐ด and ๐๐ต,we let ๐๐ด๐ต = 1/๐๐ด, ๐๐ต๐ด = 1/๐๐ต. [Or more generally, we can let ๐๐ด๐ต = ๐๐ถ/๐๐ด,๐๐ต๐ด = ๐๐ถ/๐๐ต for a given between-group rate ๐๐ถ.]
In the first example, bigger demes exchange the same number ofmigrants as smallerdemes. To achieve this, we set ๐๐ต = 5๐๐ด, ๐๐ต๐ต = ๐๐ด๐ด/5 and thus ๐๐ด๐ด๐๐ด = ๐๐ต๐ต๐๐ต.All demographic parameters are scaled by the coalescent timescale ๐0, so the effectivemigration rate of both ๐ด โ ๐ด and ๐ต โ ๐ต edges is approximately ๐๐ด๐ด๐๐ด/๐0 = 1/๐0.at is, differences in population size are canceled by differences inmigration rate. Con-sequently, we expect the migration surface to be uniform, and indeed, this is what weobserve in Figure 5.5 a).
In the second example, bigger deme exchange more migrants. To achieve this, weset ๐๐ต = 5๐๐ด, ๐๐ต๐ต = ๐๐ด๐ด and thus ๐๐ต๐ต๐๐ต > ๐๐ด๐ด๐๐ด. Since migration rates arerelative to deme sizes, at the samemigration rate bigger demes exchange a higher num-ber of migrants which results in higher effectivemigration. erefore, coalescence timesbetween ๐ต demes [on the same side of the barrier but not across it] are shorter than co-alescence times between ๐ด demes. Genealogies with such topology are consistent withhigher migration at equal coalescence rates because lineages that transition more oftenbetween demes have fewer chances to coalesce. Consequently, we expect a barrier toeffective migration, and indeed, this is what we observe in Figure 5.5 b).
38
.......................
5.3 Spatial structure due to a split event
e final sequence of examples produces a barrier effect from a past demographic eventthat removes edges in the graph and thus splits the habitat into two regions that nolonger exchange migrants.
To describe the simulations, consider the example graph with two groups of demes,๐ด (circles) and ๐ต (squares), with the same deme size. [e demes in the middle, ๐ถ, arepart of the population but we collect no samples from that area which remains ''unob-served''.] ere are also two types of edges: the solid ones have constant migration rate๐; the dashed ones have migration rate 0 for ๐ฅ units of time (measured in ๐0 genera-tions) and migration rate ๐ from then on. Since Kingman's coalescent develops back-wards in time, this setup simulates a recent barrier to migration from the present topoint ๐ฅ in the past. Beyond time ๐ฅ the population graph is connected and has uniformmigration rate ๐.
In Figure 5.6 we increase the time of the split event from ๐ฅ = 0.3 to ๐ฅ = 9 unitsof time. If the split is too recent on the relative scale of the other parameters, its effectis hard to detect and the effective migration surface is uniform. Otherwise, the splitis detected as a barrier to effective migration. [e truth is a temporary barrier, themethod infers a constant barrier.] In these simulations, an equilibrium phase of highmigration followed by a recent interval of no migration produces genealogies that aredominated by a long branch between the common ancestor of ๐ด lineages and that of ๐ตlineages. Such topology is consistent with constant migration at low rates across thearea separating ๐ด and ๐ต.
5.4 e effect of SNP ascertainment
In this example we simulate the effect of ascertainment bias due to a very small dis-covery panel (Figure 5.4). In this case there is a true barrier to migration but the dis-covery panel comes from a very limited area on one side of the barrier. is skews theobserved genealogies as we observe only sites that are polymorphic in both the ascer-tainment sample (in red) and the analysis sample (in black). is example shows thatascertainment โ which is not an evolutionary process โ can have an effect of the in-ferred effective migration, especially if the discovery panel is not representative of thepopulation.
a) b) log10
(m)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8
Figure 5.4: Barrier to migration with as-certainment bias. a) True migration pat-tern and the discovery panel in red; b) Es-timated effective migration and the sam-ple in black.
inferring effective migration from geographically indexed genetic data 39
Figure 5.5: Population structure due todifferences in deme size. In a) biggerdemes exchange migrants at a lower rateand hence there is no variation in effectivemigration. In b) smaller demes exchangefewer migrants and hence there is an ef-fective barrier to migration.
a)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8log10(m)
b)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8log10(m)
Figure 5.6: A past demographic event re-sults in a barrier to effective migration.Here an ancestral population splits intosubpopulations ๐ด (east) and ๐ต (west) atpoint ๐ฅ in the past. e further back intime this event occurs, the more differen-tiated ๐ด and ๐ต are. a) ๐ฅ = 0.3; b) ๐ฅ = 3; c)๐ฅ = 9 units of time which is measured in๐0 generations.
a)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8log10(m)
b)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8log10(m)
c)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8log10(m)
40
5.5 e effect of uneven sampling on PCA projection and effective migration
It iswell known that PCAprojections areheavily influencedby irregular sampling [McVean,2009]. To examine the impact of sample composition on effective migration, we simu-late genetic data under the same barrier pattern as in Figure 5.2 but with various sam-pling schemes. We compare ourmethod of estimating effectivemigration and PCA anal-ysis of the observed covariance matrix in Figure 5.7. Even if sampling is biased towardsone area of the habitat, the presence and location of the barrier are correctly detectedas long as there are observations on both sides. On the other hand, the overall pat-tern of the PCA projections changes considerably. [Our method can be sensitive to theplacement and coarseness of the population grid.]
Figure 5.7: Barrier to migration with un-even sampling; colors indicate samplinglocation. e final example illustrates thatnaturally the method cannot detect vari-ation from uniform migration in areaswhere no genetic data is observed.
inferring effective migration from geographically indexed genetic data 41
5.6 Summary
esimulations in this chapter illustrate that effectivemigration can represent the com-bined effect of various demographic processes and events on genetic similarity and thatour method is robust to uneven sampling but not ascertainment bias. However, effec-tive migration does not help us to distinguish among possible histories as in this frame-work population structure is always explained with a stepping-stone model on a fixedgrid of equally sized demes.
e examples also underline why it is difficult to interpret effective migration interms of actual evolutionary history. As [McVean, 2009] emphasizes, very different de-mographic processes can produce very similar average genealogies. Our method, justlike PCA, uses the information contained in pairwise comparisons averaged across sitesand discards the sequential information contained in the ordering of sites along chro-mosomes, which can be helpful in selecting between possible histories.
6
Empirical Results
In this chapter we apply out method on four empirical datasets [three consist of SNPsand one of microsatellites] and we further demonstrate that effective migration ratescan explain the spatial structure in genetic variation.
6.1 Red-backed fairywrens in Australia
First we present results for a sample of red-backed fairywrens (Malurusmelanocephalus),a small passerine bird endemic toAustralia [Figure 6.1]. eRBFWdatasetwas collectedto study its population structure and demographic history across the Carpentarian bar-rier. Sampling and genotyping procedures aswell as cluster-based analysis of populationstructure are described in [Lee and Edwards, 2008].
Figure 6.1: Habitat of the red-backedfairywren (Malurus melanocephalus), withtheCarpentarian barrier (the black bar), innorthern and eastern Australia. e mapshows the ranges of two subspecies: M.m.cruentatus in yellow,M.m. melanocephalusin pink, and a broad hybrid zone in orange.e map also shows three major biogeo-graphic regions: Top End (TE), Cape York(CY) and Eastern Forest (EF). e map ismodified from [Lee and Edwards, 2008].
e Carpentarian barrier in northern Australia is a semi-arid region, roughly 150kmwide and extremely poor in vegetation [Lee and Edwards, 2008]. It has been arguedthat this region has had a primary effect on species distribution in northern and east-ernAustralia by acting as a barrier tomigration, with secondary barriers along the coast.[Lee and Edwards, 2008] choose to study the demographic structure of the red-backedfairywren because its taxonomy, which is based mainly on plumage color, is not consis-tent with the Carpentarian hypothesis. e species has been traditionally categorizedinto two subspecies but their ranges do not lie on either side of the Carpentarian barrier,as we would expect if it has been the major barrier contributing to their divergence.
e dataset wasmade available to us by S. Edwards.After initial data processing, theRBFW dataset consists of ๐ = 27 diploid individuals genotyped at ๐ = 1190 bi-allelic,polymorphic SNPs. [As a reference to the original data, we remove 3 out of 30 individu-als because most of their genotypes are missing and we also exclude monomorphic andtri-allelic SNPs.]
inferring effective migration from geographically indexed genetic data 43
roughout we will refer to three subpopulations of red-backed fairywrens as iden-tified according to location in [Lee and Edwards, 2008]: Top End (TE) in northern Aus-tralia to the west of the Carpentarian barrier, Cape York (CY) in northeastern Australiato the east of the Carpentarian barrier and including the hybrid zone, and Eastern For-est (EF) in eastern Australia to the south of the hybrid zone [Figure 6.1].
Top End (TE)Cape York (CY)Eastern Forest (EF)
Top E
nd
Cape Y
ork
Easte
rn F
ore
st
Top E
nd
Cape Y
ork
Easte
rn F
ore
st
Figure 6.2: PCA and STRUCTURE analysisof the red-backed fairywren (RBFW) data.
First we perform principal components analysis (PCA) and cluster-based analysis(STRUCTURE). In Figure 6.2 (left) we plot the leading two principal components of thegenetic covariancematrix, which explain 55% and 10%of the variance, repectively. PCAdetects population structure but the results are difficult to interpret in terms of the evo-lutionary history of the species: there is some differentiation between the three sub-populations [in particular, Top End (TE) is better separated from Cape York (CY) thanEastern Forest (EF)] but there are no clearly delineated clusters. Although the threebiogeographic groups are about equally represented, the sample is very small and muchof the observed variation is between individuals within groups.
In Figure 6.2 (right) we report STRUCTURE results with two and three clusters, andusing the sampling locations as prior information [Pritchard et al., 2000, Hubisz et al.,2009]. As observed in [Lee and Edwards, 2008], if we use STRUCTURE to assign thesamples into two distinct clusters, Cape York (CY) and Eastern Forest (EF) are groupedtogether, which possibly indicates that the Carpentarian barrier has played a role inshaping the genetic differentiation of the red-backed fairy wren. In our analysis the data has a slightly
higher likelihood with three clustersrather than two as in [Lee and Edwards,2008].
When we use STRUC-TURE to assign the samples into three distinct clusters, CY and EF individuals are es-timated to be admixtures (with different proportions) of two ''ancestral'' populations.is suggests migration across the hybrid zone.
While both STRUCTURE plots might be interpreted to provide support for the Car-pentarian hypothesis, STRUCTURE does notmodel the geographic distribution of sam-ples across the habitat and thus cannot account for isolation by distance. In a homo-geneous habitat, where the population is spatially distributed with uniform migration,genetic differentiation tends to increase with geographic distance. e RBFW data ex-hibits the isolation by distance property, at least at short tomedium distances. e rela-tionship between geographic and genetic distances appears to plateau as the Euclideandistance increases [Figure 6.3].
Cluster-basedmethods, such as STRUCTURE [Pritchard et al., 2000] andGENELAND[Guillot et al., 2005], attempt to find sharp boundaries between clusters, to maximizesimilarity within versus between clusters, in terms of allele frequency distributions.[esemethods can assign individuals tomultiple clusters according to individual-specificfractional membership, but again the differences between clusters must be sharp in or-
44
Euclidean distance
Gen
etic
dis
tanc
e
0 5 10 15 20 25
0.3
0.4
0.5
0.6
CY,CYEF,EFTE,TE
CY,EFTE,CYTE,EF
Figure 6.3: Observed genetic distance vs.Euclidean distance for the (๐
2) = 351 pairsin the RBFW dataset. Each point is col-ored according to group membership toemphasize that on average Cape York (CY)is closer to Eastern Forest (EF) than to TopEnd (TE).
der to estimate these proportions with certainty.] erefore, cluster-basedmethods arebetter suited to analyzing discrete population structure. However, genetic variation canexhibit continuous structure as genetic similarities tend to decay with distance and thedecay can be gradual rather than sharp as in Figure 6.3. In this case STRUCTURE effec-tively separates those individuals that are farthest apart in space as Top End (TE) andEastern Forest (EF) are assigned to different clusters.
e spatial structure of genetic variation in the RBFW data is continuous and there-fore it can be partially explained with isolation by distance. However, since the Carpen-tarian barrier may reduce gene flow between the TE and CY groups, we estimate thepatterns of migration rather than assume genetic differentiation increases as a func-tion of the Euclidean distance between sampling locations [or equivalently, migrationis uniform].
Figure 6.4: Irregular triangular grid (๐, ๐ธ)spanning the habitat of the red-backedfairywren. Samples are assigned to theclosest deme. e method allows thatsampling be both sparse and uneven. Ifthe geographic information is coarse, it isappropriate to choose a coarse grid.
1-3
4-6
7 8,9
10,1112-14
15,16
17,1819,20
21,22
2627
28,29
30
To apply our method for estimating effective migration rates, we first construct anirregular triangular grid (๐, ๐ธ) to cover the known range of the red-backed fairywren[Figure 6.4]. After running the MCMC chain from multiple random starting points tomonitor convergence and averaging the runs, we report the posterior mean of the effec-tive migration rates ๐ = (๐๐ผ๐ฝ) in Figure 6.5, on the log base 10 scale, with low migra-tion in blue and high migration in brown. For this small dataset, it is computationallyfeasible to use the coalescent-based distancematrix (i.e., the expected coalescence times๐๐ผ๐ฝ) as well as its approximation in terms of effective resistances ๐ ๐ผ๐ฝ. e two metrics
inferring effective migration from geographically indexed genetic data 45
produce very similar posterior estimates of the effective migration surface, shown inFigure 6.5 a) and b), respectively.
a) b) log10
(m
sm)
ฮ=T
TE
EF
CY
ฮ=R-1.8
-1.2
-0.6
0
0.6
1.2
1.8
Figure 6.5: Estimated relative rates of ef-fectivemigration for theRBFWdataset us-ing two distance metric on the graph: a)expected coalescence time ๐ = (๐๐ผ๐ฝ); b)effective resistance ๐ = (๐ ๐ผ๐ฝ).
6.1.1 What is the effect of the Carpentarian barrier on genetic differentiation?
Sincewe plot relativemigration rates, a completelywhitemigration surfacewould corre-spond to uniform migration; the colors indicate deviations from the expectation underuniform migration.
For the RBFW dataset, the most interesting feature is the area of lower effectivemigration that roughly covers the Cape York (CY) biogeographic region and the Car-pentarian barrier. is result is consistent with the hypothesis that the Carpentarianbarrier affects genetic differentiation. It is also consistent with the hypothesis that theCY group has a slightly lower effective population size (similar to the simulations inSection 5.2). Furthermore, CY is relatively less similar to TE than it is to EF as CY andTE are separated by longer effective distance [i.e., darker blue color]. Although this canalso be inferred from the PCA and STRUCTURE analysis, the effective migration plotcombines information about genetic dissimilarities and geographic distances and thusis an intuitive representation of spatial patterns in genetic variation.
Finally, we showdraws from the posterior distribution of effectivemigration [Figure6.6]. Although in most instances the region of the Carpentarian barrier falls inside atile of lower effective migration [relative to the rest of the habitat], there is a lot ofvariability in the location, shape and rate of the ''barrier''. One possible explanation isthat the Carpentarian barrier does not have a strong effect on the genetic structure ofthis species. However, the RBFWdataset is small and it is also possible that ourmethod
Figure 6.6: Draws from the posterior dis-tribution of effective migration in red-backed fairywrens.
46
cannot detect a strong barrier effect with certainty.
6.2 Forest and savanna elephants in Africa
Herewe present results for a dataset of African elephants sampled throughout the rangeof the species in Sub-Saharan Africa. e sample includes both forest elephants (Lox-odonta africana cyclotis) and savanna elephants (Loxodonta africana africana). Both sub-species are under threat, partly frompoaching, and thedatawere collected tohelp assigncontraband tusks to their location of origin [Wasser et al., 2004, 2007].
ere is observational and genetic evidence that forest and savanna elephants hy-bridize in the areas where their rangesmeet [Wasser et al., 2004]. erefore, we removeputative hybrids so that the dataset we analyze consists of 223 forest elephants and 896savanna elephants genotyped at 16 microsatellites. ese genetic markers were chosenin part because they can be isolated and amplified in samples of low quality and thusmicrosatellite DNA can be extracted from a small piece of tusk [Wasser et al., 2004].
Figure 6.7: Irregular triangular grid (๐, ๐ธ)spanning the habitat of the African ele-phant. emap showsfive regions as iden-tified in [Wasser et al., 2004]. e westand central regions comprise the range ofthe forest elephant. e north, east andsouth regions comprise the range of thesavanna elephant. Samples are assignedto the closest deme in the grid.
[Wasser et al., 2004] show that forest and savanna elephants can be accurately dis-criminated. is is also evident in the PC scatterplot of the sample covariance matrix[Figure 6.8] where the leading principal component separates forest (West, Central) andsavanna (North, East, South) and explains 29% of the observed genetic variation. PCAanalysis also indicates that there is more genetic diversity in forest than in savanna ele-phants and suggests no further population structure within the two subspecies. How-ever, the sample configuration is very uneven with about 4 times savanna than forestelephants, so the PCA results might be biased [Section 4.5].
We applied our method to the data provided by [Wasser et al., 2004]. e resultsconfirm that forest and savanna elephants are genetically differentiated enough to dis-tinguish between the two subspecies. In Figure 6.9 we observe a prominent barrier ineffective migration that curves through the habitat to separate the west and central re-gions (the range of forest elephants) from the north, east and south regions (the rangeof savanna) elephants.
Our model estimates migration rates to explain the overall sample structure. How-ever, each genotyped site has its own genealogy and thus observed genetic distancesvary across sites. With microsatellites, mutation rates are higher, and since more mu-tations meanmore information about relative branch lengths, we can also fit the modelat each microsatellite separately [Figure 6.10]. ere is great variability in effective mi-
inferring effective migration from geographically indexed genetic data 47
Figure 6.8: PCA analysis of the forest andsavanna elephants (FS) data.
a) b) log10
(m
sm)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8
Figure 6.9: Effective migration rates forforest and savanna elephants (a) using all16 microsatellites and (b) excluding themost variable locus.
gration across microsatellites. And the pattern of effective migration at the sixth locusproduces most of the overall pattern, except for the relationship between the west andcentral regions โ essentially, the relationship among forest elephants. is suggeststhat elephants can be categorized with high accuracy as forest or savanna based on justthis one microsatellite.
We also split the sample into only forest and only savanna elephants to explore sub-tler population structure in each subspecies. e genetic variation in forest elephantsis consistent with isolation by resistance with a very small bridge of higher effectivemigration between the west and central regions. e genetic variation in savanna ele-phants deviates from isolation by distance considerably: the central region is separatedfrom the rest with a barrier of low effective migration while the south and east regionsare more genetically similar than the large area they span would suggest.
6.2.1 STRUCTURE and GENELAND results
As a clusteringmethod, GENELAND [Guillot et al., 2005] looks for distinct clusters andtherefore sharp boundaries between them. Removing putative hybrids makes differ-ences in allele frequencies between biogeographic regions easy to detect [Figure 6.12].However, GENELAND does not explain the relationship between the regions โ theyare all distinct from each other.
48
Figure 6.10: Effective migration rates ateach of sixteen microsatellites. e 6thlocus is most variable, presumably due tohighest mutation rate.
a) b) log10
(m
sm)
-1.8
-1.2
-0.6
0
0.6
1.2
1.8
Figure 6.11: Effective migration rates for(a) only forest elephants and (b) only sa-vanna elephants, using the same triangu-lar grid as in Figure 6.7 and all 16 mi-crosatellites.
inferring effective migration from geographically indexed genetic data 49
On the other hand, STRUCTURE [Pritchard et al., 2000] with sampling locationprior [Hubisz et al., 2009] provides intuition for the relationship between the five bio-geographic regions [Figure 6.13]. It clearly detects the difference between forest ele-phants (west, central) and savanna elephants (north, east, south) as they fall into dif-ference clusters. Furthermore, STRUCTURE shows some evidence for isolation by dis-tance, particularly in savanna elephants. Most of these individuals are represented asweighted mixtures of four clusters that do not correspond to distinct geographic areas.
Figure 6.12: GENELAND posterior prob-abilities for belonging in each of five clus-ters, which correspond directly to the fivebiogeographic regions.
Figure 6.13: STRUCTURE membershipproportions in six clusters when usingsampling locations (not regions) to pro-vide prior information for cluster assign-ments.
50
both savanna and forest elephants(without hybrids)
only forest elephants
only savannah elephants
Figure 6.14: Scatterplots of genetic dis-similarities versus distances on the pop-ulation graph with a) uniform migrationand b) estimated effective migration.
inferring effective migration from geographically indexed genetic data 51
6.3 Human populations from Europe and Africa
egenetic structure of humanpopulationshas been extensively studied since [Menozziet al., 1978] first used PCA to summarize human genetic variation across continents.Here we analyze two large-scale genome-wide datasets. e European dataset is part ofthe POPRES collection [Nelson et al., 2008] and consists of 1387 individuals genotypedat 197,146 autosomal SNPs. Most samples were collected in Western Europe, so we an-alyze a subset of 1208 individuals from 15 countries. e Sub-Saharan African datasetconsists of 314 individuals from21 ethnic groups genotyped at 27,922 autosomal SNPs.
[Novembre et al., 2008, Lao et al., 2008] use PCA analysis to characterize the spatialstructure of genetic variation within Europe and find a close correspondence betweengenetic and geographic distances, and hence, evidence for isolation by distance. In fact,[Novembre et al., 2008] shows that the two leading principal components are stronglycorrelated with latitude and longitude, respectively. [Wang et al., 2012] use their Pro-crustes method to analyze the population structure within Africa [as well as within Eu-rope, Asia and world-wide] and also observe similarity between genetic and geographicmaps, after excluding hunter-gatherer populations.
e PCA plots reveal that the human population structure in Europe and Africa iscontinuous: while individuals from the same group tend to cluster together, the over-all arrangement qualitatively resembles the configuration of sampling locations [Figure6.15]. e correspondence between the PCA projections and the geographic map can beimproved with a rotation transformation such as Procrustes [Wang et al., 2010] but thiscannot improve the PCA analysis โ for example, by correcting for biased sampling.
a) b)
Figure 6.15: Sample configuration andPCA analysis of the European and Africandatasets. In the bottom row font size in-dicates relative sample size.
52
Figure 6.15 also illustrates the limitations of the sampling scheme. In both cases it isbiased but, more importantly, geographic locations are implied as individuals from thesame populations are automatically assigned to the same coordinates. [For the Euro-pean dataset, populationmembership is determined based on grandparents' country oforigin or self-repored country of birth [Novembre et al., 2008].] is is clearly not rep-resentative of human spatial distribution in either Europe or Africa and, furthermore,the geographic information might be too coarse to detect substructure within popula-tions. [On the other hand, the stepping-stone model is discrete. Since observations areassigned to the nearest deme, the grid itself implies a limit on how much geographicresolution our method can represent.]
As [Novembre et al., 2008, Wang et al., 2012] have shown, the spatial structure ofhuman genetic variation in both Europe and Africa exhibits broad isolation by distanceas genetic differentiation tends to increase gradually with geographic distance [Figure6.16; top row]. However, while geography explains some patterns of genetic differen-tiation, a homogeneous habitat (i.e., uniform migration) might not provide the bestexplanation for the observed data.
We applied our method to estimate the effective human migration in both Europeand Africa and plotted genetic differentiation against the inferred effective distances[Figure 6.16; bottom row]. e linear relationship between genetic dissimilarity andeffective distance is stronger [r2 increases from 33% to 85% for the European data, andfrom 24% to 91% for the African data]. However, since there are so many pairwisecomparisons in the scatterplots, it is more instructive to analyze the inferred effectivemigration surface [Figure 6.17].
a) b)
Figure 6.16: Genetic differentiation (lin-earized ๐น๐๐) versus resistance distance(๐ ๐ผ๐ฝ) with either uniform migration ratesor estimated effective migration rates onthe population graph (๐, ๐ธ). e colors,whichmatch those in Figure 6.17, are cho-sen to emphasize the difference betweenthe populations in red and those in blue.
inferring effective migration from geographically indexed genetic data 53
a) b) log10 ( ๐s๐ )
-1.8
-1.2
-0.6
0
0.6
1.2
1.8
Figure 6.17: Inferred effective migrationsurfaces for the West European and Sub-Saharan African datasets.
From the inferred effectivemigration surface in Europe [Figure 6.17 a)] we canmakeseveral observations that are difficult to make from the PCA plot or the distance scat-terplot. For example, the northern countries (Ireland, the UK, Scotland, Denmark, theNetherlands, Germany โ in blue) are more genetically similar that we would expectbased on the geographic distance alone. e same is true for the three southern coun-tries (Portugal, Spain and Italy โ in red). On the other hand, a barrier to effectivemigration separates the British Isles and the Iberian peninsula, and another barrierseparates Italy and France [roughly where the Alps are]. However, we cannot concludethat the effect is due only to lower migration rates across bodies of water or mountainranges. e observed patterns can also be influenced by differences in effective popula-tion size and other evolutionary processes. Finally, the inferred migration surface alsosuggests that there is more differentiation in the north/south direction rather than theeast/west direction as the north and the south are separated by two areas of lower effec-tive migration. is results is consistent with the hypothesis that a north/south clineis a distinguishing feature of population structure within Europe [Tian et al., 2008].
We can also make interesting observations from the inferred effective migrationsurface in Africa [Figure 6.17 b)]. ere is higher effective migration along the Atlanticcoast than in the interior of the continent, and therefore, inland populations (in blue)are more genetically dissimilar than coastal populations (in red). Consequently, there ismore differentiation in the east/west direction than in the north/ south direction. ispattern can be observed in the PCA plot, as noted by [Wang et al., 2012], where pop-ulations along the coast cluster closer together, inland populations form more isolatedclusters and the E/W-associated principal component explains twice as much variationas the N/S-associated one. On the other hand, the four Bantu speaking groups at thesouthern tip cluster together in the PCA plot but not in the effective migration sur-face. However, this might be the result of lower geographic resolution in that region:Pedi and Sotho/Tswana are assigned to the same deme, and similarly, Nguni and Xhosaare assigned to another deme. We use the program GENEPOP [Rousset,
2008] to compute pairwise ๐น๐๐s.e first pair has lower genetic differentiation than the
second [๐น๐๐(๐๐, ๐๐) = 0.0012, ๐น๐๐(๐๐, ๐โ) = 0.0019].ese patterns are present, to some extent, in the PCA plot and the distance scatter-
plot. But they are easy to observe only if we categorize the locations and color the points
54
appropriately. In contrast, the pattern is clear after the analysis of effective migration.
6.4 Arabidopsis thaliana in Europe and North America
Arabidopsis thaliana is a small flowering plant and a commonly studied model organismin population genetics. It has a broad natural rangeโEurope, Asia andNorth Africaโand now grows in North America as well. Although A. thaliana is a selfing plant with lowgene flow, its genetic variation has significant spatial structure [Nordborg et al., 2005,Platt et al., 2010]. On the continental scale, in Europe there is broad isolation by dis-tance with east-west gradient that has been interpreted as evidence for post-glaciationcolonization [Nordborg et al., 2005]. On the other hand, in North America there isgenome-wide linkage disequilibrium and haplotype sharing that have been interpretedas evidence for recent human introduction from Europe [Nordborg et al., 2005].
A large geographically referenced dataset is available from the Regional Mapping(RegMap) project [Horton et al., 2012]. We split the full dataset (1193 accessions, โผ220, 000 SNPs) into two subsetsโNorth America (180 plants) and Europe (823 plants)โwhichwe analyze both separately and together. [We exclude plants fromAsia becausethe continent is very sparsely sampled.]
a) b)
Figure 6.18: Sample configuration andPCA analysis of Arabidopsis thaliana datafrom the RegMap project; a) North Amer-ica, b) Europe. First we perform principal components analysis [Figure 6.18]. As we would expect
if A. thaliana has different history in Europe and North America, there are differencesin genetic variation on the two continents [Nordborg et al., 2005]. ere is little pop-
inferring effective migration from geographically indexed genetic data 55
ulation structure in the North American subset: the samples are separated to someextent in a north/south direction, with no obvious separation between samples fromaround Lake Michigan and those from the Atlantic coast despite the geographic dis-tance. On the other hand, the population structure in the European subset is continu-ous, with some correspondence between genetic variation and geographic distribution,as we would expect under isolation by distance [Platt et al., 2010].
Next we apply our method to estimate the effective migration for A. thaliana inNorth America and Europe [Figure 6.19]. In North America, the two sampled regionsโ Lake Michigan and the Atlantic coast โ are connected by a strip of high effectivemigration. is indicates that the regions are similar genetically even though they arefar apart in space. [is is the opposite of what we expect under isolation by distance.]ere is an area of higher effective migration at the south tip of Lake Michigan wheremost of the North American samples are collected. erefore, our results are consistentwith the observation that there is extensive haplotype sharing (which indicates identityby descent) not onlywithin but also between sampling locations [Nordborg et al., 2005].
Figure 6.19: Inferred effective migrationsurfaces for Arabidopsis thaliana from twoRegMap subsets; a) North America; b) Eu-rope; c) North America and Europe com-bined.
In continental Europe, the overall pattern in broad isolation by distance, with smallvariability in effective migration rates. On the other hand, the north of the British Islesis separated from the rest of Britain which in turn is connected to northern France byan area of high migration. Our results are consistent with previous studies of the pop-ulation structure of A. thaliana. [Platt et al., 2010] find that in Eurasia there is a strongtrend of isolation by distance (at three distance scales) while in North America thereis no relationship between geographic distance and allelic similarity (except at fine dis-tance scale). And [Horton et al., 2012] observe that in the PCA plot most accessionsfrom the British Isles are projected closest to France but some plants from Britain clus-ter with lines from Sweden. Our method summarizes and visualizes these patterns.
56
Figure 6.20: Genetic differentiation (lin-earized ๐น๐๐) versus resistance distance(๐ ๐ผ๐ฝ) with either uniform migration ratesor estimated effective migration rates onthe population graph (๐, ๐ธ). e sam-ples are assigned to a regular triangulargrid [because the designation of nationsas populations is not relevant] and thecolors are chosen to emphasize interest-ing patterns [that correspond to regionswith strong deviation from isolation bydistance].
inferring effective migration from geographically indexed genetic data 57
In the combined analysis, the overall pattern is a prominent corridor of high effec-tivemigration across the Atlantic ocean. While effectivemigration rates are symmetric,this supports the hypothesis that recent directedmigration introduced A. thaliana fromEurope to the New World. On this larger scale, effective migration rates within NorthAmerica and Europe are low [as we would expect in a plant species with low gene flow].
[Platt et al., 2010] considers two models of the continuous population structure ofA. thalianaโ amodel with constant uniform migration and another with uniform mi-gration and a single shift in dispersal rate โ and concludes that neither version is agood fit for the observed patterns of genetic dissimilarities. at is, even though theoverall pattern is consistent with isolation by distance, there might be deviations fromuniform migration (or too much noise). We can observe such complex details in theeffective migration surface, e.g., the British Isles in Figure 6.19. When we combine thetwo samples the strongest signal in the data is the genetic similarity between the twocontinents even though they are separated by the expanse of the Atlantic ocean. isillustrates howwe can observe finer details at smaller scales because effectivemigrationrates are parametrized relative to the overall mean log rate.
6.5 Conclusion
Genetic variation in natural populations often exhibits spatial structure as genetic sim-ilarity tends to decay with geographic distance. However, this relationship is often nothomogeneous and the distribution of similarities across the habitat contains informa-tion about the evolutionary and ecological history of the population.
Visualization is an important tool for detecting and understanding patterns of pop-ulation structure. We have developed a model-based method for estimating and visu-alizing effective migration to explain observed deviations from homogeneous dispersaland isolation by distance. It represents the population as a triangular grid of discretecomponents and its effectivemigration as a colored partition of a two-dimensionalmap.
Our method is particularly useful for characterizing continuous population struc-ture (even though the underlying stepping-stone model is discrete) because it models aspatially distributed population instead of a collection of distinct and isolated subpop-ulations. Neither is it necessary to categorize samples into regional groups to computemeasures of differentiation such as ๐น๐๐ . [Nevertheless, assigning colors and names togroups of genetically similar individuals, in areas of high effective migration, can behelpful for subsequent analysis.] If spatial structure is continuous, clustering samplesinto biogeographic regionsmight not be awell-definedproblem. In this scenario cluster-based methods such as STRUCTURE and GENELANDmay be inappropriate.
Our method also offers some advantages over PCA analysis. PCA produces two-dimensional visual summaries of observed genetic variation and can capture both con-tinuous and discrete population structure at the sample level. In contrast, our methodproduces a visual representation of geographic and genetic information at the popula-tion level. Consequently, it is easier to make qualitative comparisons between popula-tions or between geographic regions, in terms of both geographic and genetic distances.And our method can detect deviations from uniformmigration (and hence, isolation bydistance) that are not clearly evident in PC projections because PCA is strongly affectedby sampling bias and does not estimate relevant demographic parameters.
7
Appendices
7.1 Properties of the stepping-stone model of population subdivision
egoal of this section is to derive the systemof linear equations (2.15) for the expectedpairwise coalescence times ๐ = (๐๐ผ๐ฝ) as a function of the migration rates ๐ = (๐๐ผ๐ฝ)and the coalescence rates ๐ = (๐๐ผ) in the stepping-stone model.
7.1.1 Probabilities of identity by descent
In population genetics, the probability of identity is a measure of relatedness due toshared ancestry. e concept of identity can be defined as either the event that the lin-eages have the same ancestor in a reference population at a specified time in the past orthe event that no mutations have occurred since the lineages diverged from their mostrecent common ancestor. We use the second definition of identity known as identity bystate [versus identity by descent].
Let ๐๐ผ๐ฝ(๐) be the probability of identity by descent for two distinct lineages drawnat random from demes ๐ผ and ๐ฝ. e parameter ๐ = 2๐0๐ข is the mutation rate per2๐0 generations for a single lineage, or equivalently, the total mutation rate per ๐0generations for a pair of lineages.
To derive expressions for ๐๐ผ๐ฝ(๐) for every pair (๐ผ, ๐ฝ), consider the history of a sam-ple of size 2 backwards in time. Let ๐ฅ(๐ก) = {๐ฅ(๐ก)
๐ผ } be the state of the ancestral process๐ก generations ago when the sample has ๐ฅ(๐ก)
๐ผ ancestors in deme ๐ผ. It is convenient toconsider time in units of ๐0 generations. On this timescale and under certain assump-tions about reproduction and migration, the discrete-time ancestral process {๐ฅ(๐ก) โถ ๐ก =0, 1, โฆ } converges to a continuous-time ancestral process {๐ฅ(๐ก) โถ ๐ก โฅ 0}, called thestructured coalescent [Notohara, 1990, 1993]. Mutations are generated by a Poisson pro-cess with intensity ๐ such that in ๐ก units of time a lineage accummulates ๐พ โผ Po(๐๐ก)mutations.
To derive the probability of identity for the pair (๐ผ, ๐ฝ), consider the first event thatresults in a change of state. e initial state is ๐ฅ(0) = {๐ฅ(0)
๐ผ = 1, ๐ฅ(0)๐ฝ = 1, ๐ฅ(0)
๐พ = 0 โถ๐พ โ ๐ผ, ๐พ โ ๐ฝ}. If the two lineages are drawn from the same deme, i.e., ๐ผ = ๐ฝ, the firstevent can be a coalescence with rate ๐๐ผ, a mutation with rate ๐, or a migration to deme๐พ with rate 2๐๐ผ๐พ .Since the process starts with two lineages
in ๐ผ and the migration rate from ๐ผ to๐พ is ๐๐ผ๐พ for a single lineage, the totalrate of movement is 2๐๐ผ๐พ . Similarly, thecombined mutation rate is ๐.
If a mutation occurs, the lineages are no longer identical by descent.erefore, under equilibrium,
๐๐ผ๐ผ(๐) = ๐๐ผ๐ + ๐๐ผ + 2๐๐ผ
+ โ๐พโถ๐พโ ๐ผ
2๐๐ผ๐พ๐ + ๐๐ผ + 2๐๐ผ
๐๐ผ๐พ(๐), (7.1)
where ๐๐ผ = โ๐พโถ๐พโ ๐ผ ๐๐ผ๐พ is the total rate of migration out of ๐ผ. More precisely, sincethe coalescent proceeds backwards in time, ๐๐ผ๐พ is the rate at which offspring in ๐ผ have
inferring effective migration from geographically indexed genetic data 59
parents from ๐พ and ๐๐ผ is the total rate of ''outside-deme'' parentage.When the two lineages are drawn from two different demes, i.e., ๐ผ โ ๐ฝ, they cannot
coalesce in a single step. In this case, the first event can be a mutation with rate ๐, amigration from deme ๐ผ to deme ๐พ with rate ๐๐ผ๐พ , or a migration from deme ๐ฝ to deme๐พ with rate ๐๐ฝ๐พ . Under equilibrium,
๐๐ผ๐ฝ(๐) = โ๐พโถ๐พโ ๐ผ
๐๐ผ๐พ๐ + ๐๐ผ + ๐๐ฝ
๐๐พ๐ฝ(๐) + โ๐พโถ๐พโ ๐ฝ
๐๐ฝ๐พ๐ + ๐๐ผ + ๐๐ฝ
๐๐ผ๐พ(๐). (7.2)
Equations (7.1) and (7.2) represent a system of linear equations for the probabilities ofidentity by descent in terms of the mutation rate ๐, the coalescence rates ๐๐ผ and themigration rates ๐๐ผ๐ฝ. In matrix notation,
diag {๐}[ diag {ฮฆ} โ ๐ผ] = [๐ โ (๐/2)๐ผ]ฮฆ + ฮฆ[๐ โ (๐/2)๐ผ]โฒ. (7.3)
Here ๐ = (๐๐ผ๐ฝ) is the infinitesimal generator of the migration process with diagonalentries โ๐๐ผ = โ โ๐พโถ๐พโ ๐ผ ๐๐ผ๐พ , ฮฆ โก ฮฆ(๐) = (๐๐ผ๐ฝ(๐)) is the matrix of probabilities ofidentity at fixed mutation rate ๐, and ๐ = (๐๐ผ) is the vector of coalescence rates.
7.1.2 Expected pairwise coalescence times
A linear system for the expected pairwise coalescence times can be derived correspond-ingly. Since by definition ๐ is the probability that no mutation occurs in either lineagebefore coalescence at time ๐ก,
๐(๐) = P{๐พ = 0} = E{๐โ๐๐ก}. (7.4)
at is, the probability of identity ๐ is the Laplace transform of the coalescence time ๐ก[Hudson, 1990]. erefore,
E{๐ก} = โ๐โฒ(0) where ๐โฒ =โ
โ๐๐. (7.5)
To obtain a system for the expected coalescence times, differentiate equations (7.1) and(7.2) with respect to the mutation rate ๐ and evaluate at ๐ = 0. e result is
1 = (๐๐ผ + 2๐๐ผ)๐๐ผ๐ผ โ โ๐พโถ๐พโ ๐ผ
2๐๐ผ๐พ๐๐ผ๐พ and (7.6a)
1 = (๐๐ผ + ๐๐ฝ)๐๐ผ๐ฝ โ โ๐พโถ๐พโ ๐ผ
๐๐ผ๐พ๐๐พ๐ฝ โ โ๐พโถ๐พโ ๐ฝ
๐๐ฝ๐พ๐๐ผ๐พ . (7.6b)
Equivalently, in matrix notation,
diag {๐} diag {๐} โ ๐๐ โ ๐๐โฒ = 11โฒ. (7.7)
[Here 11โฒ is a ๐ ร ๐ matrix of 1s.] is method for deriving equations (7.6a) and (7.6b)is developed in [Bahlo and Griffiths, 2001]. Alternatively, [Hey, 1991] constructs aMarkov chain with ๐(๐ + 1)/2 non-absorbing states for each unique pair (๐ผ, ๐ฝ). eset of states includes ๐ homoallelic states, when the two lineages are in the same deme,and ๐(๐ โ 1)/2 heteroallelic states, when two lineages are in different demes. ereis also an absorbing state, which corresponds to coalescence. Transition probabilitiesbetween all these states reflect the migration rates ๐๐ผ๐ฝ and coalescence rates ๐๐ผ.
Furthermore, since the population evolves under equilibrium, migration is conser-vative and ๐โฒ๐โ1 = 0 by definition. If we multiply equation (7.7) by (๐โ1)โฒ on the left
60
and by ๐โ1 on the right, we obtain
1โฒ diag {๐}๐โ1 = (1โฒ๐โ1)2 โ โ๐ผ
๐๐ผ๐ผ(๐๐ผ/๐0) = ( โ๐ผ
๐๐ผ/๐0)2
๐0 โก โ๐ผ
๐๐ผ๐ผ(๐๐ผ/๐๐) = ๐๐/๐0 = ๐ (7.8)
where ๐0 is the [weighted] average within-deme coalescence time, ๐๐ = โ๐ผ ๐๐ผ is thetotal population size and ๐0 = ๐๐/๐ is the coalescent timescale. erefore, underconservative migration, the average within-deme coalescence time does not depend onthe exact pattern and rates of migration [Strobeck, 1987]. If migration is isotropic โ amuch stronger assumption โ the within-deme coalescence times ๐๐ผ๐ผ for all demes ๐ผdo not depend on the migration process.
7.2 Distance matrices
Here we discuss distance matrices, also called dissimilarity matrices, and review somerelevant properties. Two examples of a distance matrix are the matrix of expected pair-wise coalescence times, ๐, and the matrix of effective resistance distances, ๐ .
First we state two equivalent definitions of a distance matrix.
Definition 7.1 e matrix ๐ท = (๐2๐๐) is a distance matrix if there exist squared lengths
โ = (โ2๐ ) โ โ๐+ such that
โ1โฒ + 1โโฒ โ ๐ท โฝ 0. (7.9)
Definition 7.2 ematrix๐ท = (๐2๐๐)๐๐ is the set of symmetric ๐ ร ๐ matrices;
๐๐+ is the set of symmetric ๐ ร ๐ matriceswith nonnegative elements.
is a distancematrix if there exists pairwise similarities๐ = (๐๐๐) โ ๐๐+ such that
๐2๐๐ = ๐๐๐ + ๐๐๐ โ 2๐๐๐. (7.10)
Let ๐ = (๐ฅ1, โฆ , ๐ฅ๐)โฒ โ โ๐ร๐ represent ๐ points in ๐-dimensional inner product space.For example, in the setting of analyzing population structure, ๐ฅ๐ is a genotype vector of๐ polymorphic sites. en the squared distance between points ๐ and ๐ is given by
๐2๐๐ = โจ๐ฅ๐ โ ๐ฅ๐, ๐ฅ๐ โ ๐ฅ๐โฉ = โจ๐ฅ๐, ๐ฅ๐โฉ + โจ๐ฅ๐, ๐ฅ๐โฉ โ 2โจ๐ฅ๐, ๐ฅ๐โฉ โก โ2
๐ + โ2๐ โ 2๐๐๐, (7.11)
where ๐๐๐ = โจ๐ฅ๐, ๐ฅ๐โฉ is the inner product between two vectors in โ๐, and ๐ = ๐๐โฒ ispositive definite as a Gram matrix. In matrix notation,
๐ท = diag {๐}1 + 1 diag {๐}โฒ โ 2๐. (7.12)
Clearly the similarity matrix ๐ contains more information about ๐ than the dissimilar-ity matrix ๐ท: diag {๐} = โwhile diag {๐ท} = 0๐ท is nonnegative with 0s on the main
diagonal because the dissimilarity of apoint with itself is trivially 0.
. at is, ๐ captures the absolute positionof each point in the space (the length โ๐ is the distance to the center ๐) while ๐ท reflectsonly the relative difference for each pair of points.
eorem 7.1 ematrix ๐ท โ ๐ป๐ is a distance matrix if and only if it is conditionally nega-tive definite.
Sketch of proof.
โข Suppose that๐ท is a distancematrix. For every vector ๐ผ โ โ๐ such that 1โฒ๐ผ = 0 (thatis, ๐ผ is a contrast)
0 โค 2๐ผโฒ๐ท๐ผ = ๐ผโฒ(โ1โฒ + 1โโฒ โ ๐ท)๐ผ (7.13a)
= ๐ผโฒโ(1โฒ๐ผ) + (๐ผโฒ1)โโฒ๐ผ โ ๐ผโฒ๐ท๐ผ = โ๐ผโฒ๐ท๐ผ. (7.13b)
erefore, ๐ท is conditionally negative definite.
inferring effective migration from geographically indexed genetic data 61
โข Suppose that ๐ท is conditionally negative definite. Choose a vector ๐ค โ โ๐ such that1โฒ๐ค = 1 and define
๐ = ๐ผ โ 1๐คโฒ, (7.14a)
๐ = โ1
2๐๐ท๐โฒ = โ
1
2(๐ผ โ 1๐คโฒ)๐ท(๐ผ โ ๐ค1โฒ). (7.14b)
en ๐ is a centering matrix such that
๐๐ = ๐ผ โ 1๐คโฒ โ 1๐คโฒ + 1(๐คโฒ1)๐คโฒ = ๐ผ โ 1๐คโฒ = ๐, (7.15a)
๐คโฒ๐๐ฅ = ๐คโฒ๐ฅ โ (๐คโฒ1)๐คโฒ๐ฅ = ๐คโฒ๐ฅ โ ๐คโฒ๐ฅ = 0 (7.15b)
for every ๐ฅ โ โ๐. at is, ๐ is an orthogonal projection onto the hyperplane {๐ค}โ.Furthermore, (๐ผ โ ๐ค1โฒ)๐ฅ at is, 1โฒ(๐ผ โ ๐ค1โฒ)๐ฅ = 0.is a contrast and
๐ฅโฒ(๐ผ โ 1๐คโฒ)๐ท(๐ผ โ ๐ค1โฒ)๐ฅ = โ2๐ฅโฒ๐๐ฅ โค 0 (7.16)
since ๐ท is conditionally negative definite. erefore, Using equation (7.12) the (๐, ๐)-th element is
โ1
2๐๐(๐๐ท๐โฒ)๐๐ โ
1
2๐๐(๐๐ท๐โฒ)๐๐ + ๐๐(๐๐ท๐โฒ)๐๐
= โ1
2[๐คโฒ๐ท๐ค โ ๐คโฒ๐ท๐๐ โ ๐โฒ
๐๐ท๐ค]
โ1
2[๐คโฒ๐ท๐ค โ ๐คโฒ๐ท๐๐ โ ๐โฒ
๐๐ท๐ค]
+ ๐คโฒ๐ท๐ค โ ๐คโฒ๐ท๐๐ โ ๐โฒ๐๐ท๐ค + ๐ท๐๐ = ๐ท๐๐
where ๐ท๐๐ = 0 and ๐๐ is the ๐-th standard basisvector.
๐ is a positive definite matrixand it has a decomposition ๐ = ๐๐โฒ. It is straightforward to show that
๐ท = diag { โ1
2๐๐ท๐โฒ}1โฒ + 1 diag { โ
1
2๐๐ท๐โฒ}โฒ + ๐๐ท๐โฒ. (7.17)
at is, the vectors๐ = (๐ฆ1, โฆ , ๐ฆ๐)โฒ generate the distance matrix๐ท. However, notethat the similarity matrix ๐ depends on the choice of ๐ค. It is not surprising that๐ท does not determine ๐ (nor ๐) uniquely since it contains information only aboutrelative differences.
e vector ๐ค determines the position of the origin ๐. e condition 1โฒ๐ค = 1 implies that ๐ค is avector of weights.
For example, ๐ค = 1/๐ cor-responds to placing the origin at the centroid (the center of mass) 1โฒ๐/๐ = s๐ฆ and๐ค = ๐๐ โat the ๐th point ๐โฒ
๐๐ = ๐ฆ๐. Different decompositions ๐ = ๐๐โฒ give differentorientations about the origin ๐.
l
Nowwe consider the special case A covariance matrix is a circumhyper-
sphere with radius ๐2 and a correlationmatrix โ with radius 1.
when the lengths โ๐ are all equal to ๐ for some ๐ > 0 andthus the points ๐ฅ๐ are the same distance from the center ๐. Geometrically, the pointslie on the circumference of a sphere with radius ๐ in โ๐ [Gower, 1985] and so ๐ท is calleda spherical distance metric. is puts a constraint on the choice of ๐ค. In general,
diag {๐} =1
2diag {1๐คโฒ๐ท + ๐ท๐ค1โฒ โ 1๐คโฒ๐ท๐ค1โฒ โ ๐ท} diag {๐ท} = 0 and diag {๐ค1โฒ} = ๐ค.(7.18a)
= ๐ท๐ค โ1
2(๐คโฒ๐ท๐ค)1. (7.18b)
If โ = ๐21, then diag {๐} = โ = ๐21. erefore,
๐ท๐ค โ1
2(๐คโฒ๐ท๐ค)1 = ๐21. (7.19)
If ๐ท is nonsingular, ๐ทโ1 exists and we can right-multiply by ๐ทโ1. en
๐ค = (12๐คโฒ๐ท๐ค + ๐2)๐ทโ11. (7.20)
Recall that ๐ค satisfies ๐คโฒ1 = 1, so that
1โฒ๐ค = (12๐คโฒ๐ท๐ค + ๐2)1โฒ๐ทโ11 = 1 (7.21)
is implies 1โฒ๐ทโ11 โ 0 and
๐ค = ๐ทโ111โฒ๐ทโ11 , In this case ๐ = ๐ผ โ 11โฒ๐ทโ1/1โฒ๐ทโ11
is the orthogonal projection onto the
hyperplane {๐ทโ11}โ.
๐2 = 1/21โฒ๐ทโ11 . (7.22)
us we have proved the following
62
Corollary 7.1 Suppose that ๐ท โ ๐ป๐ is a distance matrix such that det {๐ท} โ 0. en1โฒ๐ทโ11 > 0.
7.3 Conditional definite matrices
Here we derive a sufficient condition for positive definiteness of the covariance matrixฮฃ = 11โฒ โ ๐ฮ where ฮ โ ๐ป๐ is a distance matrix [or more generally, a conditionallynegative definite matrix] and ๐ is a positive constant. e derivation is based on [Bapatand Raghavan, 1997] and uses the Spectraleoremwhich states thatฮฃ โป 0 if and onlyif all its eigenvalues are positive.
Definition 7.3 A matrix ฮ โ ๐๐ is conditionally negative definite if
๐ผโฒฮ๐ผ โค 0 (7.23)
for all ๐ผ โ โ๐ such that โ๐ ๐ผ๐ = ๐ผโฒ1 = 0. us, conditional negative definiteness isequivalent to negative definiteness on the subspace {1}โ.
eorem 7.2 A conditionally negative definite (c.n.d) matrix ฮ โ ๐๐ has at most one posi-tive eigenvalue.
Sketch of proof. We consider the c.n.d. case. Suppose to the contrary that ฮ hastwo positive eigenvalues ๐ข1 > 0 and ๐ข2 > 0 with corresponding eigenvectors ๐ฅ and ๐ฆ.Without loss of generality, we can assume that the eigenvectors are normalized so thatโ๐ ๐ฅ๐ = โ๐ ๐ฆ๐ โ โ๐(๐ฅ๐ โ ๐ฆ๐) = 0. at is, (๐ฅ โ ๐ฆ)โฒ1 = 0 and ๐ฅ โ ๐ฆ is a contrast.Furthermore,
(๐ฅ โ ๐ฆ)โฒฮ(๐ฅ โ ๐ฆ) = ๐ฅโฒฮ๐ฅ + ๐ฆโฒฮ๐ฆ โ ๐ฆโฒฮ๐ฅ โ ๐ฅโฒฮ๐ฆ (7.24a)
= ๐ข1๐ฅโฒ๐ฅ + ๐ข2๐ฆโฒ๐ฆ โ ๐ข1๐ฆโฒ๐ฅ โ ๐ข2๐ฅโฒ๐ฆ๐ฅ โ ๐ฆ โ ๐ฅโฒ๐ฆ = 0 (7.24b)
= ๐ข1๐ฅโฒ๐ฅ + ๐ข2๐ฆโฒ๐ฆ > 0 (7.24c)
since ๐ข1, ๐ข2 > 0. is contradicts the definition of conditionally negative definite ma-trices. l
A distance matrix ฮ is nonnegative, with main diagonal of 0s, and is also conditionallynegative definite byeorem 7.1.
Corollary 7.2 Suppose that ฮ is a nonnegative, nonzero symmetric matrix. en ฮ has atleast one positive eigenvalue.
Sketch of proof. Since ฮ is symmetric, by the Spectral eorem it has real eigenvalues๐ข = {๐ข๐}. Furthermore, tr {ฮ} = โ๐
๐=1 ๐ข๐ โฅ 0. e trace of ฮ is nonnegative because ฮis nonnegative; its eigenvalues are not all zero because ฮ is nonzero. Since โ๐ ๐ข๐ โฅ 0,at least one of the eigenvalues is positive. l
Corollary 7.3 Suppose thatฮ is a nonnegative, nonzero, conditionally negative definite ma-trix. en ฮ has exactly one positive eigenvalue.
Sketch of proof. Byeorem 7.2 ฮ has at most one positive eigenvalue while by Corol-lary 7.3 it has at least one positive eigenvalue. erefore, it has exactly one positiveeigenvalue. If ฮ is strictly c.n.d, its other ๐ โ 1 eigenvalues are negative. l
inferring effective migration from geographically indexed genetic data 63
So far we know that ฮ is both conditionally negative definite and nonnegative, andtherefore, it has exactly one positive eigenvalue. On the other hand, ฮฃ = 11โฒ โ ๐ฮis conditionally positive definite: for all ๐ผ such that ๐ผโฒ1 = 0, we have
๐ผโฒฮฃ๐ผ = ๐ผโฒ(11โฒ โ ๐ฮ)๐ผ = (๐ผโฒ1)2 โ ๐(๐ผโฒฮ๐ผ) โฅ 0. (7.25)
erefore, ฮฃ has at most one negative eigenvalue. Finally, by the matrix-determinantlemma for a rank-one update,
๐โ๐=1
๐ขโ๐ = det {ฮฃ} = (1 โ 1โฒฮโ11
๐ ) det { โ ๐ฮ} = (1 โ 1โฒฮโ11๐ )
๐โ๐=1
(โ๐)๐ข๐, (7.26)
where ๐ข = {๐ข๐} are the eigenvalues of ฮ, ๐ขโ = {๐ขโ๐ } are the eigenvalues of ฮฃ.
To ensure that the ๐ขโ๐ s are positive, we use the fact that the product on the left in
equation (7.26) has atmost one negative termwhile the product on the right has exactlyone negative term. erefore, a necessary and sufficient condition for ฮฃ โฝ 0 is that ๐satisfies
1 โ 1โฒฮโ11๐ โค 0. (7.27)
7.4 Restricted maximum likelihood (REML) in the general case
Consider the model ๐ โผ N(๐๐ฝ, ฮฃ) with design matrix ๐ and covariance matrix ฮฃ. Let๐พ โ โ๐ร๐ be a basis for the mean space and ๐ฟ โ โ(๐โ๐)ร๐ be a basis for the residualspace. For example, ๐พ = ๐ if the design matrix has full rank, or otherwise, ๐พ is ๐linearly independent columns of ๐. By construction ๐ฟ๐พ = 0 and ker {๐ฟ} = span {๐พ}.Also let ๐[ฮฃ] be the unique orthogonal projection with kernel ๐พ given by
๐ = ๐ผ โ ๐พ(๐พโฒฮฃโ1๐พ)โ1๐พโฒฮฃโ1. (7.28)
[McCullagh, 2009] shows that regardless of the choice for ๐ฟ is, ๐ has an equivalentcharacterization given by
๐ = ฮฃ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ. (7.29)
To prove this, it is sufficient to show that
โข ฮฃ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ is a projection:๐๐ = (ฮฃ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ)(ฮฃ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ) = ฮฃ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ = ๐
โข ฮฃ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ is self-adjoint with respect to the inner product โจ๐ข, ๐ฃโฉ = ๐ขฮฃโ1๐ฃ:๐โฒฮฃโ1 = (ฮฃ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ)โฒฮฃโ1 = ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ = ฮฃโ1(ฮฃ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ) = ฮฃโ1๐
โข ker {ฮฃ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ} = {๐พ}: ๐๐พ = (ฮฃ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ)๐พ = ฮฃ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ(๐ฟ๐พ) = 0
e orthogonal projection with kernel ๐พ (i.e., the orthogonal projection onto the resid-ual space) is unique, so (7.28) = (7.29).
To rewrite the Wishart log-likelihood in equation (4.13), we derive the followingexpressions involving ฮฃ, ๐, ๐ฟ and ๐พ.
det {๐ฟฮฃ๐ฟโฒ}โ1 det {๐ฟ๐ฟโฒ} = det {(๐ฟฮฃ๐ฟโฒ)โ1(๐ฟ๐ฟโฒ)}= Det {๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ}= Det {ฮฃโ1ฮฃ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ}= Det {ฮฃโ1๐} (7.30)
64
where the standard determinant, denoted by det, is the product of all eigenvalues andthe generalized determinant, denoted by Det, is the product of the nonzero eigenval-ues. e first equality holds because (๐ฟฮฃ๐ฟโฒ)โ1๐ฟ๐ฟโฒ is (๐โ๐)ร(๐โ๐)with ๐โ๐ nonzeroeigenvalues and ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ is ๐ ร ๐ with ๐ โ ๐ nonzero eigenvalues and the two ma-trices have the same nonzero eigenvalues:
If (๐ฟฮฃ๐ฟโฒ)โ1๐ฟ๐ฟโฒ = ๐๐ข, then ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ(๐ฟโฒ๐ข) = ๐(๐ฟโฒ๐ข).
If ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ๐ฃ = ๐๐ฆ, then ๐ฟ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ๐ฃ = ๐๐ฟ๐ฆ,(๐ฟฮฃ๐ฟโฒ)โ1๐ฟ๐ฃ = (๐ฟ๐ฟโฒ)โ1๐ฟ๐ฆ,
(๐ฟฮฃ๐ฟโฒ)โ1(๐ฟ๐ฟโฒ)(๐ฟ๐ฟโฒ)โ1๐ฟ๐ฆ = (๐ฟ๐ฟโฒ)โ1๐ฟ๐ฆ.
Similarly,
det {๐พโฒฮฃโ1๐พ}โ1 det {๐พโฒ๐พ} = Det {ฮฃ(๐ผ โ ๐)โฒ}. (7.31)
Following [Verbyla, 1990], let ๐ด = [๐พ, ๐ฟโฒ]. Using both characterizations of the projec-tion ๐ and the formula for the determinant of a block matrix,
det {๐ดโฒฮฃ๐ด} = detโโโโโโ
๐พโฒฮฃ๐พ ๐พโฒฮฃ๐ฟโฒ
๐ฟฮฃ๐พ ๐ฟฮฃ๐ฟโฒ
โโโโโโ
= det {๐ฟฮฃ๐ฟโฒ} det {๐พโฒฮฃ๐พ โ ๐พโฒฮฃ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟฮฃ๐พ}
= det {๐ฟฮฃ๐ฟโฒ} det {๐พโฒ[๐ผ โ ๐ฟโฒ(๐ฟฮฃ๐ฟโฒ)โ1๐ฟฮฃ]ฮฃ๐พ}= det {๐ฟฮฃ๐ฟโฒ} det {๐พโฒ[๐ผ โ ๐]ฮฃ๐พ}= det {๐ฟฮฃ๐ฟโฒ} det {๐พโฒ[๐พ(๐พโฒฮฃโ1๐พ)โ1๐พโฒฮฃโ1]ฮฃ๐พ}= det {๐ฟฮฃ๐ฟโฒ} det {๐พโฒ๐พ(๐พโฒฮฃโ1๐พ)โ1๐พโฒ๐พ}= det {๐ฟฮฃ๐ฟโฒ} det {๐พโฒ๐พ} det {๐พโฒฮฃโ1๐พ}โ1 det {๐พโฒ๐พ};
det {๐ดโฒ๐ด} = detโโโโโโ
๐พโฒ๐พ ๐พโฒ๐ฟโฒ
๐ฟ๐พ ๐ฟ๐ฟโฒ
โโโโโโ
= detโโโโโโ
๐พโฒ๐พ 00 ๐ฟ๐ฟโฒ
โโโโโโ
= det {๐พโฒ๐พ} det {๐ฟ๐ฟโฒ}.
Since ๐ด is full-rank by construction,
det {ฮฃ} = det {๐ดโฒฮฃ๐ด}det {๐ดโฒ๐ด} = det {๐พโฒ๐พ} det {๐ฟฮฃ๐ฟโฒ}
det {๐ฟ๐ฟโฒ} det {๐พโฒฮฃโ1๐พ} . (7.32)
Finally, by applying first (7.30) and then (7.31),
det {ฮฃ} = det {๐พโฒ๐พ}[ det {๐พโฒฮฃโ1๐พ} Det {ฮฃโ1๐}]โ1(7.33a)
= det {๐ฟ๐ฟโฒ}โ1 det {๐ฟฮฃ๐ฟโฒ} Det {ฮฃ(๐ผ โ ๐)โฒ}. (7.33b)
7.4.1 Restricted maximum likelihood (REML) in a special case
Rather than a general covariancematrixฮฃ, our model for population structure in termsof distances on a population graph specifies
ฮฃ = 11โฒ โ ๐ฮ, (7.34)
where ฮ is a conditionally negative definite matrix such that 1โฒฮโ11 = 1. e normal-ization simplifies notation and defines equivalence classes {ฮโ โถ (1โฒฮโ1โ 1)ฮโ = ฮ}. It is
inferring effective migration from geographically indexed genetic data 65
also convenient because under this parametrization ๐ โ (0, 1) is a sufficient conditionfor ฮฃ โป 0, as we show in Appendix 7.3.
Because the covariancematrix has the form (7.34) we can avoid explicitly construct-ing ฮฃ and instead work with ฮ. Using the Sherman-Morrison formula for the inverseof a rank-one update,
ฮฃโ1 = โ 1๐(ฮโ1 โ ฮโ111โฒฮโ1
1 โ ๐ ) (7.35a)
ฮฃโ11 = โ 1๐(ฮโ11 โ ฮโ11
1 โ ๐) = 11 โ ๐ฮโ11 (7.35b)
1โฒฮฃโ11 = 11 โ ๐ (7.35c)
e orthogonal projection ๐ = ๐[ฮฃ] with kernel ๐พ = 1 is given by
๐ = ๐ผ โ 11โฒฮฃโ1
1โฒฮฃโ11 = ๐ผ โ 11โฒฮโ1 (7.36a)
ฮฃโ1๐ = โ 1๐ฮโ1(๐ผ โ 1
1 โ ๐11โฒฮโ1)๐ = โ 1๐ฮโ1๐ (7.36b)
e projection matrix ๐ is not symmetric in general but for every ๐,
๐โฒฮฃโ1 = ๐โฒฮฃโ1๐ = ฮฃโ1๐ and ๐โฒฮโ1 = ๐โฒฮโ1๐ = ฮโ1๐. (7.37)
at is, ฮฃโ1๐ and ฮโ1๐ are symmetric.Now let's express Det {ฮฃโ1๐} as a function of Det {ฮโ1๐} where the generalized
determinant Det is the product of the nonzero eigenvalues. Since
ฮโ1๐๐ฃ = ๐ผ๐ฃ โ ฮฃโ1๐ = โ 1๐(๐ผ๐ฃ), (7.38)
the two generalized eigenvalue problems are equivalent up to a proportionality constantand
Det {ฮฃโ1๐} = ( 1๐)
๐โ1Det { โ ฮโ1๐} = Det { โ ฮโ1๐/๐}. (7.39)
where rank {ฮฃโ1๐} = rank { โ ฮโ1๐} = ๐ โ 1. Finally, we apply equation (7.32) withฮฃ = ๐ and ๐พ = 1 to obtain
det {๐ฟ๐๐ฟโฒ} = det {๐} det {1โฒ๐โ11}det {๐ฟ๐ฟโฒ}det {1โฒ1} , (7.40)
and equation (7.30) with ฮฃ = 11โฒ โ ๐ฮ and ๐ฟฮฃ๐ฟโฒ = โ๐(๐ฟฮ๐ฟโฒ) to obtain
det { โ (๐ฟฮ๐ฟโฒ)โ1/๐โ} = Det {ฮฃโ1๐/๐2}/ det {๐ฟ๐ฟโฒ}= Det { โ ฮโ1๐/๐โ}/ det {๐ฟ๐ฟโฒ} (7.41a)
tr { โ (๐ฟฮ๐ฟโฒ)โ1๐ฟ๐๐ฟโฒ/๐โ} = tr { โ (๐ฟโฒ(๐ฟฮ๐ฟโฒ)โ1๐ฟ)๐/๐โ}= tr { โ ฮโ1ฮ(๐ฟโฒ(๐ฟฮ๐ฟโฒ)โ1๐ฟ)๐/๐โ}= tr { โ (ฮโ1๐)๐/๐โ}. (7.41b)
7.5 Efficient computation
Let ๐ = (๐ ๐ผ๐ฝ) be the matrix of effective resistances between pairs of observed demes(๐ผ, ๐ฝ) in the population graph๐บ = (๐, ๐ธ, ๐). From [McRae, 2006]we know that๐๐ผ๐ผ โ
66
๐ and ๐๐ผ๐ฝ โ ๐(1+๐ ๐ผ๐ฝ/4)where ๐ is the number of demes in the population graph and๐ is the number of observed demes. With this motivation, let
(ฮ๐ผ๐ฝ) = ๐(1๐1โฒ๐ + (๐ ๐ผ๐ฝ)/4 โ ๐ผ๐) (7.42)
be the matrix of (expected) pairwise distances between observed demes.If individuals are exchangeable within demes, we can model distances between in-
dividuals in terms of distances between demes. For a pair (๐ โ ๐ผ, ๐ โ ๐ฝ),
ฮ๐๐ = ๐(1 + ๐ ๐ผ๐ฝ/4 โ ๐{๐=๐}). (7.43)
Equivalently, in matrix notation,
(ฮ๐๐) = ๐(1๐1โฒ๐ + ๐ฝ๐ ๐ฝโฒ/4 โ ๐ผ๐) (7.44)
where ๐ฝ = (๐ฝ๐๐ผ) โ โค๐ร๐ is an indicator matrix such that
๐ฝ๐๐ผ =โง{โจ{โฉ
1 if ๐ โ ๐ผ0 if ๐ โ ๐ผ
. (7.45)
To simplify the notation, we will drop the subscripts and write plainly 1 for the vectorof ones and ๐ผ for the identity matrix. e dimension will be clear from the context if wekeep in mind that ๐ = (๐ ๐ผ๐ฝ) is an ๐ ร ๐ matrix and ฮ = (ฮ๐๐) is an ๐ ร ๐ matrix.
To evaluate the Wishart log-likelihood in equation (4.13) we need to compute theterms tr {ฮโ1๐๐} and Det { โ ฮโ1๐} where
๐ = ๐ผ โ 11โฒฮโ1
1โฒฮโ11 (7.46)
is a projection matrix, which removes the common mean, and ๐ is the observed sim-ilarity matrix. We also standardize the distance matrix ฮ so that 1โฒฮโ11 = 1. Withthis normalization, multiplying ฮ by a (positive) constant has no effect on the productฮโ1๐, so we can ignore the scale ๐ in equation (7.44).
e distance matrix ฮ = 11โฒ + ๐ฝ๐ ๐ฝโฒ/4 โ ๐ผ has an ''almost-block'' structure, exceptfor the diagonal of zeros: specifically, ฮ = ๐ฝ๐ต๐ฝโฒ โ๐ผ where ๐ต = ๐ /4+11โฒ is a known ๐ร๐matrix. [๐ต is a function of the migration rates.] e inverse ฮโ1 is also an almost-blockmatrix:
ฮโ1 = ๐ฝ๐๐ฝโฒ โ ๐ผ, (7.47)
where ๐ is an unknown ๐ ร ๐ matrix. Since ฮฮโ1 = ๐ผ, the solution ๐ must satisfy
๐ฝ๐ต๐ถ๐๐ฝโฒ โ ๐ฝ๐ต๐ฝโฒ โ ๐ฝ๐๐ฝโฒ + ๐ผ = ๐ผ, (7.48a)
๐ฝโฒ(๐ต๐ถ โ ๐ผ)๐๐ฝโฒ = ๐ฝ๐ต๐ฝโฒ, (7.48b)
Where ๐ถ = ๐ฝ๐ฝโฒ = diag {๐๐ผ} is the diagonal matrix of sample counts.Since every term in equation (7.48b) has an exact block structure which depends
on the sample configuration through ๐ฝ, it is sufficient to solve the lower-dimensionalproblem
(๐ต๐ถ โ ๐ผ)๐ = ๐ต โ (๐ถ โ ๐ตโ1)๐ = ๐ผ. (7.49)
is is a system of linear equations for the unknown ๐ in terms of the effective resis-tances ๐ and the counts ๐ถ, and therefore, it can be solved efficiently without matrix
inferring effective migration from geographically indexed genetic data 67
inversions. e diagonal matrix ๐ถ is invertible because here we consider only observeddemes, i.e., locations with at least one observation; the auxiliary matrix ๐ต = ๐ /4 + 11โฒ
is invertible because the matrix of effective resistances ๐ is invertible.Once we solve for๐, we could explicitly constructฮโ1 from๐ according to equation
(7.47). However, this is not necessary because we only need to compute Det { โ ฮโ1๐}and tr {ฮโ1๐๐}where ๐ is the (average) observed similarity. Using the definition of theorthogonal projection ๐ (equation 7.46) and the properties of the trace,
tr {ฮโ1๐๐} = tr {ฮโ1๐} โ 11โฒฮโ11 tr {11โฒฮโ1๐ฮโ1}. (7.50)
We consider each of these terms in turn:
1โฒฮโ11 = 1โฒ(๐ฝ๐๐ฝโฒ โ ๐ผ)1 = tr {๐(๐ฝโฒ11โฒ๐ฝ)}โ๐, (7.51a)
tr {ฮโ1๐} = tr {(๐ฝ๐๐ฝโฒ โ ๐ผ)๐}= tr {๐(๐ฝโฒ๐๐ฝ)}โ tr {๐}, tr {๐๐} = โ๐ผ,๐ฝ ๐๐ผ๐ฝ๐๐ผ๐ฝ. So the trace
can be computed as sum(sum(Y.*T)).
(7.51b)
tr {11โฒฮโ1๐ฮโ1} = 1โฒ๐ถ๐(๐ฝโฒ๐๐ฝ)๐๐ถ1+1โฒ๐1โ 2 tr {๐(๐ฝโฒ๐11โฒ๐ฝ)}. (7.51c)
All the terms in red are constants and can be precomputed and stored for easy access.e point is that there is no need to construct the ๐ ร ๐ matrix ฮโ1 in order to computetr {ฮโ1๐๐}; we can work with the ๐ ร ๐ matrix ๐ instead.
Nextwe showhow to compute efficiently the generalizeddeterminantDet {โฮโ1๐}.Since ฮ โ ๐ป๐ is conditionally negative definite (and nonnegative),
Det {ฮฃโ1๐} = (1โฒ1โฒ)/(1โฒฮฃโ11)det {ฮฃ}
and
det {ฮฃ} = (1 โ 1โฒโโ11/๐) det { โ ๐โ}
1โฒฮฃโ11 = (1 โ ๐/1โฒโโ11)โ1
Det { โ ฮโ1๐} = (1โฒ1)/(1โฒฮโ11)โ det { โ ฮ} . (7.52)
Furthermore, ฮ has one positive eigenvalue and ๐ โ 1 negative eigenvalues, as we showin Appendix 7.3. erefore, โ det { โ ฮ} is guaranteed to be positive and it is sufficientto compute โฃ det {ฮ}โฃ, or equivalently, find the eigenvalues of ฮ. Since ฮ = ๐ฝ๐ต๐ฝโฒ โ ๐ผ,
eig {ฮ} = eig {๐ฝ๐ต๐ฝโฒ} โ 1, (7.53)
where ๐ฝ๐ต๐ฝโฒ is a block matrix and thus it has ๐ nontrivial eigenvalues besides 0, whichhas multiplicity ๐ โ ๐. Furthermore, for any vector ๐ฃ โ โ๐,
ฮ(๐ฝ๐ฃ) = ๐ฝ๐ต๐ถ๐ฃ โ ๐ฝ๐ฃ = ๐ฝ(๐ต๐ถ โ ๐ผ)๐ถ๐ฃ. (7.54)
erefore, if (๐ฃ, ๐) is an eigenpair for ๐ต๐ถ โ ๐ผ, then (๐ฝ๐ฃ, ๐) is an eigenpair for ฮ. at is,the ๐ nontrivial eigenvalues of ฮ are equal to the eigenvalues of ๐ต๐ถ โ ๐ผ.
7.6 Markov chain Monte Carlo
Expected coalescence times in a stepping-stone model are determined by the migra-tion rates between demes and the coalescence rates within demes according to equation(2.15). roughout, we assume that the coalescence rate is the same for all demes andmigration is symmetric. e approximation in terms of effective resistances on an undi-rected graph given by equation (2.28) makes the symmetry assumption explicitly andthe equal size assumption implicitly. To use either expression for computing effectivedistances, we need to specify a migration rate for each undirected edge (๐ผ, ๐ฝ) in the grid(๐, ๐ธ). We assume that the migration rates are piecewise constant and we model them
68
in terms of a colored Voronoi tiling ๐ฑ of the habitat โ . Under the tessellation ๐ฑ , eachtile has its own migration log rate and all edges within a tile share this parameter.
Since the spatial structure of the population is unknown, an appropriate Voronoitessellation of the habitat must be estimated given the data. We use a version of themethod based on colored Voronoi tessellations implemented in GENELAND [Guillotet al., 2005]. e main difference is that the ''colors'' in GENELAND are cluster indices;in our framework the ''colors'' are log (base 10)migration rates as edges within the sametile share a common rate to encourage locally smooth migration surfaces.
7.6.1 Updating the number of tiles ๐ with birth/death moves
Unlike the log rates and locations of tiles, thenumber of tiles present a transdimensionalinference problem because adding or removing a tile changes the dimensionality of theparameter space. For such a problem we can use the birth-death Markov chain MonteCarlo algorithm (BD-MCMC) which has been applied to other variable dimension prob-lems such as a mixture model with unknown number of components [Stephens, 2000,van Lieshout, 2000].
Assume that the Markov chain is currently in state (๐ก, ฮ๐ก) with ๐ก Voronoi tiles andparameters ฮ๐ก, and that there are two options for the next move: with probability ๐(๐ก)the proposed move is (๐ก + 1, ฮ๐ก+1), i.e., the birth of a tile; with probability 1 โ ๐(๐ก) theproposedmove is (๐กโ1, ฮ๐กโ1), i.e., the death of a tile. Since we consider only these twomoves, we assume that they occur with equal probability: ๐(1) = 1A death event is impossible with only one
tile.and ๐(๐ก) = 1โ๐(๐ก) =
12 for ๐ก > 1. For a given number of tiles ๐ก, the model parameters include the migrationlog rates {โ๐1, โฆ , โ๐๐ก} and locations {๐ข1, โฆ , ๐ข๐ก}, as well as common parameters ๐ thatdo not depend on the tessellation and are not updated during a birth/death move. Let
ฮ๐ก = (โ๐1, โฆ , โ๐๐ก, ๐ข1, โฆ , ๐ข๐ก, ๐). (7.55)
A full Bayesian model for the Voronoi tiling is specified by the likelihood on pairwisedistances (4.4) together with the following prior distributions for the number of tilesand tile-specific parameters:
๐ | ๐ โผ Po(๐), (7.56a)
๐ข | ๐ iidโผ U(โ), (7.56b)
โ๐ | ๐, ๐ iidโผ N(โ s๐, ๐2๐). (7.56c)
where ๐ is the number of tiles, (๐ข, โ๐) are the tile centers and log rates, respectively,and ๐ = (โ s๐, ๐2๐) are hyperparameters: the mean log rate โ s๐ and the variance ๐2๐ .e intensity (Poisson rate) ๐ controls the spatial organization. is prior specificationimplies that rates and locations are a priori independent.
It is convenient to denote the component parameters (location and log rate) by๐๐ก =(๐ข๐ก, โ๐๐ก). Since the tiles are not ordered,
๐(๐, ๐ข, โ๐ | ๐, ๐) โก ๐(๐, ๐1, โฆ , ๐๐ | ๐, ๐) (7.57a)
= ๐(๐ | ๐) ร ๐! ร ๐(๐1 |๐) โฏ ๐(๐๐ |๐) (7.57b)
at is, conditional on the number of tiles ๐, the ๐๐กs are independent and identicallydistributed from a product distribution with density
๐(๐ | ๐) โ ๐{๐(1) โ โ} โ N(๐(2) ; โ s๐, ๐2๐). (7.58)
inferring effective migration from geographically indexed genetic data 69
Note that the prior is invariant under relabeling of the tiles, i.e.,
๐(๐, ๐1, โฆ , ๐๐ | ๐, ๐) = ๐(๐, ๐๐(1), โฆ , ๐๐(๐) | ๐, ๐) (7.59)
for every permutation ๐ of the indices 1, โฆ , ๐. at is, the tile parameters are ex-changeable.
Next we construct a birth-death MCMC that allows only two types of moves: thebirth of a new tile and the death of an existing tile (when ๐ > 1). Suppose that thecurrent state is (๐ก, ๐1, โฆ , ๐๐ก). If the proposal is a birth, we add a new tile with lograte โ๐๐ก+1 โผ N(โ s๐, ๐2๐) and location ๐ข๐ก+1 โผ U(โ). We denote the birth density by๐(๐ก) = ๐(๐๐ก+1 |๐). If the proposal is a death, we select a tile to remove uniformly atrandom, i.e., with probability ๐(๐ก) = 1
๐ก .To guarantee that the birth-death chain is reversible and the stationary distribu-
tion is the posterior ๐(๐, ๐1, โฆ , ๐๐ | ๐ง, ๐, ๐, ๐) given observed data ๐ง, we choose theacceptance probabilities ๐ผ(โ , โ ) so that they satisfy the detailed balance condition:
๐(๐ก)๐(๐ก)๐(๐ก, ๐1, โฆ , ๐๐ก | ๐ง, ๐, ๐, ๐)๐ผ(๐ก, ๐ก + 1)= [1 โ ๐(๐ก + 1)]๐(๐ก + 1)๐(๐ก + 1, ๐1, โฆ , ๐๐ก+1 | ๐ง, ๐, ๐, ๐)๐ผ(๐ก + 1, ๐ก). (7.60)
Since ๐(๐ก) = ๐(๐ก + 1) = 12 ,
๐(๐ก) = ๐ผ(๐ก, ๐ก + 1)๐ผ(๐ก + 1, ๐ก) = ๐(๐ก + 1)
๐(๐ก)๐(๐ก + 1, ๐1, โฆ , ๐๐ก+1 | ๐ง, ๐, ๐, ๐)
๐(๐ก, ๐1, โฆ , ๐๐ก | ๐ง, ๐, ๐, ๐) (7.61a)
= ๐(๐ก + 1)๐(๐ก)
๐(๐ก + 1 | ๐)๐(๐ก | ๐)
๐(๐1, โฆ , ๐๐ก+1 | ๐ก + 1, ๐)๐(๐1, โฆ , ๐๐ก | ๐ก, ๐)
๐๐ก+1(๐ง ; ๐๐ก+1, ๐)๐๐ก(๐ง ; ๐๐ก, ๐) (7.61b)
= ๐๐ก + 1
๐๐ก+1(๐ง ; ๐๐ก+1, ๐)๐๐ก(๐ง ; ๐๐ก, ๐) Apply equation (7.57) with ๐(๐ก | ๐) =
๐๐ก๐โ๐/๐ก!(7.61c)
and ๐ผ(๐ก, ๐ก + 1) = min {๐(๐ก), 1}. erefore, the following algorithm simulates a Markovchain with stationary distribution๐(๐, ๐๐ | ๐ง, ๐, ๐, ๐)where, for simplicity of notation,we write ๐๐ก = (๐1, โฆ , ๐๐ก).1. Choose between a birth event and a death event, with equal probability.
2. If a birth is proposed, its location, migration log rate and coalescence log rate aresampled from the priors, and the acceptance probability is
๐ผ(๐, ๐ + 1) = min { ๐๐ + 1
๐๐+1(๐ง ; ฮ๐+1)๐๐(๐ง ; ฮ๐) , 1} (7.62)
3. If a death is proposed, a tile to be removed is selected uniformly at random, and theacceptance probability is
๐ผ(๐ + 1, ๐) = min {๐ + 1๐
๐๐(๐ง ; ฮ๐)๐๐+1(๐ง ; ฮ๐+1) , 1} (7.63)
because a deletionmove is the reverse of an additionmove [Byers andRaftery, 2002].
7.6.2 Updating the Voronoi centers (for a fixed number of tiles ๐)
is is a Metropolis-Hastings symmetric random-walk update. Sequentially, for eachtile ๐ก, we propose a new center๐ขโ
๐ก . eproposal distribution is bivariate normal centeredat the current value ๐ข๐
๐ก = (๐ฅ๐ก, ๐ฆ๐ก) [with correlation 0]. e proposal is accepted withprobability
๐ผ = min {๐(๐ขโ
๐ก |๐, ฮ\๐ข๐ก)๐(๐ข๐
๐ก |๐, ฮ\๐ข๐ก), 1} = min {๐ (๐ ; ฮโ)๐(๐ขโ)
๐ (๐ ; ฮ๐)๐(๐ข๐) , 1}. (7.64)
70
e prior distribution of the center locations ๐ข = (๐ข๐ก โถ ๐ก = 1, โฆ , ๐) is uniform overthe domain โ ,
๐(๐ข) โ ๐ {๐ข๐ก โ โ โถ ๐ก = 1, โฆ , ๐} . (7.65)
On the log scale, log(๐ผ) = โโ if at least one component of ๐ขโ falls outside of thedomain โ . Otherwise,
log(๐ผ) = min {โ(ฮโ |๐) โ โ(ฮ๐ |๐), 0}. (7.66)
7.6.3 Updating the log-transformed migration rates โ๐Weassume that themigration log (base 10) rates are normally distributedwith commonmean โ s๐ and variance ๐2๐ ,
โ๐๐ก | โ s๐, ๐2๐iidโผ N(โ s๐, ๐2๐), (7.67)
or in an equivalent parametrization,
โ๐๐ก = โ s๐ + ๐๐ก, ๐๐กiidโผ N(0, ๐2๐). (7.68)
where โ s๐ is the mean log rate and ๐๐ก is the effect of tile ๐ก, relative to the mean. esecond parametrization is more convenient because it allows scaling all migration ratessimultaneously by adjusting โ s๐.
We choose a vague prior for the hyperparameters โ s๐, ๐2๐ assuming prior indepen-dence of location and scale,
โ s๐ โผ U(๐๐๐, ๐ข๐๐), (7.69)
๐2๐ โผ Inv-G( ๐2 , ๐
2). (7.70)
at is, the hyperprior on (โ s๐, ๐2๐) is is semi-conjugate.To simulate a Markov chain with stationary distribution ๐(๐, ๐ข, โ๐ | ๐ง, ๐, ๐),
1. Update each error in turn (or all errors at once) with a Metropolis-Hastings stepand a random-walk proposal. at is, we draw a new migration log rate parameterโ๐โ
๐ก โผ N(โ๐๐๐ก , ) for each tile in the current Voronoi decomposition and accept the
proposal โ๐โ = {โ๐โ๐ก โถ ๐ก = 1, โฆ , ๐} with probability
๐ผ = min {๐ (๐ ; ฮ\โ๐, โ๐โ)๐(โ๐โ | โ s๐, ๐2๐)๐ (๐ ; ฮ\โ๐, โ๐๐)๐(โ๐๐ | โ s๐, ๐2๐) , 1}. (7.71)
2. Update themeanmigration log rate โ s๐ with aMetropolis-Hastings step and a random-walk proposal.
3. Update the common log rate variance ๐2๐ with a Gibbs step by sampling from its fullconditional distribution:
๐(๐2๐ |๐, ฮ) โ Inv-G( ๐2 , ๐
2)๐
โ๐ก=1
N(๐๐ก ; 0, ๐2๐) (7.72a)
โ {1
๐2๐}
๐/2+1exp { โ
๐
2๐2๐} ร
๐โ๐ก=1
{1
๐2๐}
1/2exp { โ
๐2๐ก
2๐2๐} (7.72b)
โ {1
๐2๐}
๐/2+๐/2+1exp { โ
1
2๐2๐(๐ + ๐ 2๐)}, (7.72c)
inferring effective migration from geographically indexed genetic data 71
where ๐ 2๐ = โ๐๐ก=1 ๐2
๐ก is the sum of squares for the relative tile effects on the log scale.Because we conveniently choose the conjugate inverse-gamma prior for ๐2๐ , we canupdate this parameter by drawing
๐2๐ โผ Inv-G((๐ + ๐)/2, (๐ + ๐ 2๐ /2)). (7.73)
7.6.4 Updating the degrees of freedom ๐Here we consider updating the degrees of freedom ๐. e proposal distribution is
๐โ โผ N(๐๐, ๐ฃ๐๐๐ค) (7.74)
where ๐๐ is the current value and ๐ฃ is the proposal variance.Since the Wishart degrees of freedom for a ๐ ร ๐ matrix is a real number ๐ that
satisfies ๐ > ๐ โ 1, the support of this parameter is (๐, ๐). If the proposed value ๐โ isnot valid,
log {๐(๐โ)๐(๐๐) } = โโ (7.75)
and the proposal is rejected. Otherwise, it is accepted with probability
๐ผ = min {๐(๐โ)๐ (๐ ; ฮ\๐, ๐โ)๐(๐๐)๐ (๐ ; ฮ\๐, ๐๐) , 1} (7.76)
Here ๐ (๐ ; ฮ\๐, ๐โ) is the likelihood for the given value of ๐ with the rest of the param-eters ฮ\๐ fixed to their current values. e prior on the degrees of freedom is uniformon the log scale, i.e.,
๐(๐) โ 1๐ . (7.77)
Since ๐ is bounded, the prior is proper with normalizing constant log(๐) โ log(๐).
7.6.5 Updating the scale nuisance parameter ๐โ
e nuisance parameter ๐โ = ๐๐2 can be efficiently updated with a Gibbs step if wechoose the conjugate prior distribution, Inv-G(๐/2, ๐/2). en the full conditional isalso Inverse Gamma with shape and scale parameters given by
๐โ = ๐ + ๐(๐ โ 1), (7.78a)
๐โ = ๐ + ๐ tr {ฮโ1๐๐}. (7.78b)
For microsatellites, ๐(๐โ1 , โฆ , ๐โ๐ |๐, ฮ) factorizes into the full conditional of each site-
specific scale parameter ๐โ๐ , so there is no loss of efficiency to estimate a small numberof microsatellites.
7.7 MATLAB implementation
7.7.1 Triangular (isometric) grid
Suppose that the genotypes individuals are sampledwithin a rectangular regionโ By convention, ๐ฅ denotes longitudes and๐ฆ latitudes.
boundedby (๐ฅ0, ๐ฆ0) on the bottom right and (๐ฅ1, ๐ฆ1) on the top right.
To initialize the program,we specify the dimensions โ๐ฅรโ๐ฆ By definition, a triangular grid is formedby dividing the plane regularly intoequilateral triangles.
of a triangular grid (๐, ๐ธ)to tile the habitat โ . e resulting grid is regular but not strictly isometric, unless โ๐ฅand โ๐ฆ are chosen to match the size of the habitat.
72
...๐ . ๐ธ.
๐๐
.
๐๐ธ
.
๐๐
.
๐๐ธ
7.7.2 Data structures
Here I describe the MATLAB implementation and data structures. e problem is spec-ified in terms of
โข โ๐ฅ ร โ๐ฆ triangular grid (๐, ๐ธ) which spans the habitat โ ;
โข (โ๐ฅโ๐ฆ) ร (โ๐ฅโ๐ฆ) symmetric matrix ๐ of migration rates.
e order of (๐, ๐ธ) is |๐| = โ๐ฅโ๐ฆ and the size is ๐๐ โก |๐ธ| = (โ๐ฅ โ1)โ๐ฆ +(2โ๐ฅ โ1)(โ๐ฆ โ1).Both the grid (๐, ๐ธ) and the migration matrix ๐ are very sparse because each vertex๐ฃ โ ๐ has at most six neighbors and
๐ = (๐๐ผ๐ฝ) =โง{{โจ{{โฉ
๐๐ผ๐ฝ if (๐ผ, ๐ฝ) โ ๐ธ0 otherwise.
โซ}}โฌ}}โญ
๐๐ผ๐ฝ = 2๐0๏ฟฝฬ๏ฟฝ๐ผ๐ฝ where ๐0 is the
coalescent timescale.
(7.79)
at is, (๐, ๐ธ) and ๐ together describe a weighted matrix ๐บ = (๐, ๐ธ, ๐). It is notrequired that ๐ be symmetric; the linear system for ฮ is valid as long as (๐, ๐ธ) is con-nected: If all demes communicate, the sample will eventually coalesce, i.e., the distancebetween lineages is finite. is guarantees that
ฮ = (๐2๐ผ๐ฝ) = {๐2
๐ผ๐ฝ < โ for (๐ผ, ๐ฝ) โ ๐ ร ๐.} (7.80)
Although the twomatrices have the same size,๐ is sparse butฮ is full and hencemightbe expensive to compute. With a denser grid (๐, ๐ธ), few of the demes are sampled fromand computing the entire distancematrixฮ is not necessary. To compute the likelihoodof the data, we need only the sample distance matrix ฮ.
In the rest of this section, let ๐๐ฃ = โ๐ฅโ๐ฆ be the number of demes and ๐๐ = (๐๐ฃ2 ) =
๐๐ฃ(๐๐ฃ โ 1)/2 be the number of unique pairs of demes. e number of unknowns is๐๐ฃ + ๐๐, the number of within-deme coalescence times plus the number of between-demes coalescence times.Vertex set representation: e vertices ๐ are stored in a ๐๐ฃ ร 2 matrix Vcoord, withthe ๐ฅ (longitude) coordinates in the first column and the ๐ฆ (latitude) coordinates in thesecond column. e locations of the Voronoi sites ๐ are stored similarly in Scoord.e triangular grid (๐, ๐ธ) is fixed, so Vcoord does not change. On the other hand, theVoronoi decomposition ofโ is updated regularly, which is reflected by (row) changes inScoord.
e two matrices are used to update the Voronoi tessellation whenever a tile movesits location. Recall that by definition the Voronoi tile (cell) ๐(๐ ) consists of the pointscloser to ๐ than to any other site.
euDist = rdist(Vcoord,Scoord);Compute all distances between the demesin Vcoord and the sites in Scoord.
[temp,Colors] = min(euDist,[],2);For each deme ๐ฃ โ ๐, find the closestVoronoi site ๐ โ ๐.
e vector Colors indicates which tile each deme falls into.Edge set representation e edges ๐ธ are stored in a ๐๐ฃ ร 6 matrix Edges. ere is onerow for each vertex (deme) and the columns are its six adjacent vertices, in the order๐,๐๐, ๐๐ธ, ๐ธ, ๐๐ธ, ๐๐ (clockwise). Vertices are identified by their row index in Edges.If the deme does not have a neighbor in some positions, the corresponding entries ofEdges are set to 0. e number of nonzero entries is twice the number of edges 2๐๐.Rate parameters representation e backward migration matrix is stored in a ๐๐ฃ ร ๐๐ฃsparse matrix Mrates with 2๐๐ nonzero elements.
inferring effective migration from geographically indexed genetic data 73
7.7.3 Computing coalescence distances
Our MCMC implementation requires repeatedly solving a system of linear equations๐ด๐ฅ = ๐. e matrix ๐ด = [๐ด1; ๐ด2] is large, sparse, nearly symmetric and positive defi-nite. e regularity of the grid ๐บ gives ๐ด its structure and sparseness.
๐ด1 represents the ๐๐ฃ within-deme equations
(๐๐ผ + ๐๐ผ)๐๐ผ๐ผ โ โ๐พโ๐๐๐(๐ผ)
๐๐ผ๐พ๐๐ผ๐พ = 1, (7.81)
and ๐ด2 represents the ๐๐ between-demes equations
(๐๐ผ + ๐๐ฝ)๐๐ผ๐ฝ โ โ๐พโ๐๐๐(๐ผ)
๐๐ผ๐พ๐๐ฝ๐พ โ โ๐พโ๐๐๐(๐ฝ)
๐๐ฝ๐พ๐๐ผ๐พ = 2. (7.82)
Here ๐๐๐(๐ผ) = {๐พ โ ๐ โถ (๐ผ, ๐พ) โ ๐ธ} is the set of vertices adjacent to ๐ผ and ๐๐ผ =โ๐พโ๐๐๐(๐ผ) ๐๐ผ๐พ is the rate of migration into ๐ผ. e equations also shows that ๐ =[1๐๐ฃ ; 1๐๐].
e matrix ๐ด is positive definite because
๐ด2 = โ2({๐๐ผ๐ฝ}), (7.83)
๐ด1 = diag {๐} + โ1({๐๐ผ๐ฝ}). (7.84)
e Laplacian matrices โ1, โ2 are functions of only the migration rates and โ =[โ1; โ2] is also a Laplacian matrix, and therefore, it is positive definite. We note thatthe matrix ๐ฌ = โโ is the infinitesimal generator the migration process where thelineages move from deme to deme according to ๐. For a continuous-time stochasticprocess, the infinitesimal generator is the matrix ๐ฌ = (๐๐ฅ,๐ฆ) with entries
๐๐ฅ,๐ฆ =โง{โจ{โฉ
โ๐๐ฅ if ๐ฅ = ๐ฆ,๐๐ฅ๐๐ฅ,๐ฆ otherwise
(7.85)
where ๐๐ฅ is the holding rate for state ๐ฅ and ๐ = (๐๐ฅ,๐ฆ) is the transition probabilitymatrix of the embedded jump chain. In this case, the transition probabilities are
๐๐ผ๐ฝโ๐พโ๐๐๐(๐ผ) ๐๐ผ๐พ
=๐0๏ฟฝฬ๏ฟฝ๐ผ๐ฝ
โ๐พโ๐๐๐(๐ผ) ๐0๏ฟฝฬ๏ฟฝ๐ผ๐พ=
๏ฟฝฬ๏ฟฝ๐ผ๐ฝ1 โ ๏ฟฝฬ๏ฟฝ๐ผ๐ผ
. (7.86)
Solving ๐ด๐ฅ = ๐, and thus finding all coalescence times at once, has the advantage ofreducing numerical errors. Because we use an iterative procedure (preconditioned con-jugate gradient), we control how close the approximate solution ๐ฅ is to the true solution๐ฅ. If we first solve for ๐ฅ2 and then substitute to find ๐ฅ1, numerical errors in ๐ฅ2 are prop-agated in ๐ฅ1.
7.7.4 Computing resistance distances
Consider again the matrix of migration rates between neighboring demes, ๐. Let ๐ฟ beits Laplacian matrix,
๐ฟ = diag {๐1} โ ๐ (7.87)
e effective resistance ๐ ๐ผ๐ฝ between a pair of demes (๐ผ, ๐ฝ) is equal to the ๐ฝth elementof the vector ๐ฅ given by
๐ฟโ๐ผ๐ฅ = ๐๐ฝ (7.88)
74
..1. 2. 3. 4. 5.
6
.
7
.
8
.
9
.
10
.
11
.
12
.
13
.
14
.
15
.
16
.
17
.
18
.
19
.
20
Figure 7.1: is is a5ร4 regular triangulargrid. If line weight is proportional to mi-gration rate, this pattern corresponds touniform migration with equal deme sizes.
..1. 2. 3. 4. 5.
6
.
7
.
8
.
9
.
10
.
11
.
12
.
13
.
14
.
15
.
16
.
17
.
18
.
19
.
20
Figure 7.2: Barrier to migration. Ifline weights are proportional to migrationrates, this patterns corresponds to a bar-rier across the middle of the habitat.
where ๐ฟโ๐ผ is the Laplacian ๐ฟ with the ๐ผth row and column removed, and ๐๐ฝ is the stan-dard basis vector with a 1 at the ๐ฝth coordinate and 0s elsewhere. is method can beoptimized by solving ๐ฟโ๐ผ๐ = ๐ธ where ๐ธ = (๐๐ฝ), so that we compute the effectiveresistance between ๐ผ and all other demes with a single matrix operation.
ere are other methods to compute ๐ , e.g., with a single matrix inversion [Babiฤet al., 2002]
๐ป = (๐ฟ + ๐โ1๐ฃ ๐ฝ)โ1 (7.89)
๐ ๐ผ๐ฝ = ๐ป๐ผ๐ผ + ๐ป๐ฝ๐ฝ โ 2๐ป๐ผ๐ฝ (7.90)
is method is more efficient for sparser grids where most demes are sampled from.
7.8 Simulations with ms
Here we describe how we produce samples from a spatially distributed population thatevolves under Kimura's stepping-stone model using the program ms [Hudson, 2002].For all simulations, we first construct a regular triangular grid (๐, ๐ธ) of โ๐ฅ ร โ๐ฆ demes(vertices) with coordinates (๐ฅ๐ผ, ๐ฆ๐ผ). e spatial information is not explicitly used by msand instead we set ๐๐ผ๐ฝ = 0 = ๐๐ฝ๐ผ for all pairs of demes such that (๐ผ, ๐ฝ) โ ๐ธ. Wealso specify the size of each deme, ๐๐ผ, and the migration rates across each edge in thetwo opposite directions, ๐๐ผ๐ฝ and ๐๐ฝ๐ผ. Migration is not necessarily symmetric but isconservative [i.e., it preserves deme sizes.] Both the deme sizes ๐๐ผ = ๐0๐๐ผ and themigration rates ๐๐ผ๐ฝ = 4๐0๏ฟฝฬ๏ฟฝ๐ผ๐ฝ are relative to the coalescent timescale ๐0. [at is, ๐๐ผis the relative size of deme ๐ผ and ๏ฟฝฬ๏ฟฝ๐ผ๐ฝ is the backward migration fraction from ๐ผ to ๐ฝper generation.] e input arguments are ๐๐ผ and ๐๐ผ๐ฝ, respectively.
7.8.1 Spatial structure due to constant migration
Here edges in themiddle of the habitat havemigration rate that is a factor ofmagnitudelower than the rate of the edges on either side. [e rates range from 0.3 to 3.] ispatterns is a barrier to actual migration and it results in a barrier to effective migration.
ms 20 1 -s 1 -I 20 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
-m 1 2 3.0 -m 2 1 3.0 -m 1 6 3.0 -m 6 1 3.0 -m 2 3 1.7 -m 3 2 1.7
-m 2 6 3.0 -m 6 2 3.0 -m 2 7 1.7 -m 7 2 1.7 -m 3 4 0.3 -m 4 3 0.3
-m 3 7 0.3 -m 7 3 0.3 -m 3 8 0.3 -m 8 3 0.3 -m 4 5 1.7 -m 5 4 1.7
-m 4 8 0.3 -m 8 4 0.3 -m 4 9 1.7 -m 9 4 1.7 -m 5 9 3.0 -m 9 5 3.0
-m 5 10 3.0 -m 10 5 3.0 -m 6 7 1.7 -m 7 6 1.7 -m 6 11 3.0 -m 11 6 3.0
-m 6 12 3.0 -m 12 6 3.0 -m 7 8 0.3 -m 8 7 0.3 -m 7 12 1.7 -m 12 7 1.7
-m 7 13 0.3 -m 13 7 0.3 -m 8 9 1.7 -m 9 8 1.7 -m 8 13 0.3 -m 13 8 0.3
-m 8 14 0.3 -m 14 8 0.3 -m 9 10 3.0 -m 10 9 3.0 -m 9 14 1.7
-m 14 9 1.7 -m 9 15 3.0 -m 15 9 3.0 -m 10 15 3.0 -m 15 10 3.0
-m 11 12 3.0 -m 12 11 3.0 -m 11 16 3.0 -m 16 11 3.0 -m 12 13 1.7
-m 13 12 1.7 -m 12 16 3.0 -m 16 12 3.0 -m 12 17 1.7 -m 17 12 1.7
-m 13 14 0.3 -m 14 13 0.3 -m 13 17 0.3 -m 17 13 0.3 -m 13 18 0.3
-m 18 13 0.3 -m 14 15 1.7 -m 15 14 1.7 -m 14 18 0.3 -m 18 14 0.3
-m 14 19 1.7 -m 19 14 1.7 -m 15 19 3.0 -m 19 15 3.0 -m 15 20 3.0
-m 20 15 3.0 -m 16 17 1.7 -m 17 16 1.7 -m 17 18 0.3 -m 18 17 0.3
-m 18 19 1.7 -m 19 18 1.7 -m 19 20 3.0 -m 20 19 3.0
inferring effective migration from geographically indexed genetic data 75
..1. 2. 3. 4. 5.
6.
7
.
8
.
9
.
10.
11
.
12
.
13
.
14
.
15
.
16
.
17
.
18
.
19
.
20
Figure 7.3: Barrier to effective migrationdue to differences in effective populationsize. e demes in bold are 5 times bigger;the edges in red are directedโ this is nec-essary to preserve equilibrium in time.
..1. 2. 3. 4. 5.
6.
7
.
8
.
9
.
10.
11
.
12
.
13
.
14
.
15
.
16
.
17
.
18
.
19
.
20
Figure 7.4: Uniform effective migrationeven though there are differences in bothpopulation size and in migration rates.e demes in bold are 4 times bigger; theedges in red are directed โ this is neces-sary to preserve equilibrium in time.
7.8.2 Spatial structure due to variation in diversity
Here some demes have bigger size and thus lower coalescence rate and higher geneticdiversity. In the first version, migration rates are constant but there are differencesin effective population size. Since demes in the ''east'' and ''west'' of the habitat are 5times bigger than those in the middle, the effect is a barrier to effective migration thatis qualitatively very similar to the true barrier in the previous simulation.
[A few edges are directed, with rate ๐๐ผ๐ฝ = 0.2 from a big deme to a small demeand rate ๐๐ฝ๐ผ = 1 in the other direction. ese edges cross the ''boundary'' betweenthe areas of high and low diversity and their rates are assigned so that migration isconservative: the same number of migrants are exchanged between ๐ผ and ๐ฝ because๐๐ผ๐๐ผ๐ฝ = ๐๐ฝ๐๐ฝ๐ผ.]
ms 20 1 -s 1 -I 20 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
-n 1 5.0 -n 2 5.0 -n 3 1.0 -n 4 1.0 -n 5 1.0 -n 6 5.0 -n 7 1.0
-n 8 1.0 -n 9 1.0 -n 10 5.0 -n 11 5.0 -n 12 1.0 -n 13 1.0 -n 14 1.0
-n 15 5.0 -n 16 1.0 -n 17 1.0 -n 18 1.0 -n 19 5.0 -n 20 5.0
-m 1 2 1.0 -m 2 1 1.0 -m 1 6 1.0 -m 6 1 1.0 -m 2 3 0.2 -m 3 2 1.0
-m 2 6 1.0 -m 6 2 1.0 -m 2 7 0.2 -m 7 2 1.0 -m 3 4 1.0 -m 4 3 1.0
-m 3 7 1.0 -m 7 3 1.0 -m 3 8 1.0 -m 8 3 1.0 -m 4 5 1.0 -m 5 4 1.0
-m 4 8 1.0 -m 8 4 1.0 -m 4 9 1.0 -m 9 4 1.0 -m 5 9 1.0 -m 9 5 1.0
-m 5 10 1.0 -m 10 5 0.2 -m 6 7 0.2 -m 7 6 1.0 -m 6 11 1.0 -m 11 6 1.0
-m 6 12 0.2 -m 12 6 1.0 -m 7 8 1.0 -m 8 7 1.0 -m 7 12 1.0 -m 12 7 1.0
-m 7 13 1.0 -m 13 7 1.0 -m 8 9 1.0 -m 9 8 1.0 -m 8 13 1.0 -m 13 8 1.0
-m 8 14 1.0 -m 14 8 1.0 -m 9 10 1.0 -m 10 9 0.2 -m 9 14 1.0 -m 14 9 1.0
-m 9 15 1.0 -m 15 9 0.2 -m 10 15 1.0 -m 15 10 1.0 -m 11 12 0.2
-m 12 11 1.0 -m 11 16 0.2 -m 16 11 1.0 -m 12 13 1.0 -m 13 12 1.0
-m 12 16 1.0 -m 16 12 1.0 -m 12 17 1.0 -m 17 12 1.0 -m 13 14 1.0
-m 14 13 1.0 -m 13 17 1.0 -m 17 13 1.0 -m 13 18 1.0 -m 18 13 1.0
-m 14 15 1.0 -m 15 14 0.2 -m 14 18 1.0 -m 18 14 1.0 -m 14 19 1.0
-m 19 14 0.2 -m 15 19 1.0 -m 19 15 1.0 -m 15 20 1.0 -m 20 15 1.0
-m 16 17 1.0 -m 17 16 1.0 -m 17 18 1.0 -m 18 17 1.0 -m 18 19 1.0
-m 19 18 0.2 -m 19 20 1.0 -m 20 19 1.0
In the second version, differences in migration rates compensate for differences indeme size because ๐๐พ๐๐พ๐ = ๐๐๐๐๐พ for all edges (๐พ, ๐) โ ๐ธ. e result is no varia-tion in effective migration although both the deme sizes and the migration rates varyacross the habitat.
ms 20 1 -s 1 -I 20 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
-n 1 5.0 -n 2 5.0 -n 3 1.0 -n 4 1.0 -n 5 1.0 -n 6 5.0 -n 7 1.0
-n 8 1.0 -n 9 1.0 -n 10 5.0 -n 11 5.0 -n 12 1.0 -n 13 1.0 -n 14 1.0
-n 15 5.0 -n 16 1.0 -n 17 1.0 -n 18 1.0 -n 19 5.0 -n 20 5.0
-m 1 2 0.2 -m 2 1 0.2 -m 1 6 0.2 -m 6 1 0.2 -m 2 3 0.2 -m 3 2 1.0
-m 2 6 0.2 -m 6 2 0.2 -m 2 7 0.2 -m 7 2 1.0 -m 3 4 1.0 -m 4 3 1.0
-m 3 7 1.0 -m 7 3 1.0 -m 3 8 1.0 -m 8 3 1.0 -m 4 5 1.0 -m 5 4 1.0
-m 4 8 1.0 -m 8 4 1.0 -m 4 9 1.0 -m 9 4 1.0 -m 5 9 1.0 -m 9 5 1.0
-m 5 10 1.0 -m 10 5 0.2 -m 6 7 0.2 -m 7 6 1.0 -m 6 11 0.2 -m 11 6 0.2
-m 6 12 0.2 -m 12 6 1.0 -m 7 8 1.0 -m 8 7 1.0 -m 7 12 1.0 -m 12 7 1.0
-m 7 13 1.0 -m 13 7 1.0 -m 8 9 1.0 -m 9 8 1.0 -m 8 13 1.0 -m 13 8 1.0
-m 8 14 1.0 -m 14 8 1.0 -m 9 10 1.0 -m 10 9 0.2 -m 9 14 1.0 -m 14 9 1.0
76
..1. 2. 3. 4. 5.
6
.
7
.
8
.
9
.
10
.
11
.
12
.
13
.
14
.
15
.
16
.
17
.
18
.
19
.
20
Figure 7.5: Barrier to effective migrationdue to a split in time and otherwise uni-form migration rates. e dashed edgesare disconnected at the same time in thepast.
-m 9 15 1.0 -m 15 9 0.2 -m 10 15 0.2 -m 15 10 0.2 -m 11 12 0.2
-m 12 11 1.0 -m 11 16 0.2 -m 16 11 1.0 -m 12 13 1.0 -m 13 12 1.0
-m 12 16 1.0 -m 16 12 1.0 -m 12 17 1.0 -m 17 12 1.0 -m 13 14 1.0
-m 14 13 1.0 -m 13 17 1.0 -m 17 13 1.0 -m 13 18 1.0 -m 18 13 1.0
-m 14 15 1.0 -m 15 14 0.2 -m 14 18 1.0 -m 18 14 1.0 -m 14 19 1.0
-m 19 14 0.2 -m 15 19 0.2 -m 19 15 0.2 -m 15 20 0.2 -m 20 15 0.2
-m 16 17 1.0 -m 17 16 1.0 -m 17 18 1.0 -m 18 17 1.0 -m 18 19 1.0
-m 19 18 0.2 -m 19 20 0.2 -m 20 19 0.2
7.8.3 Spatial structure due to a split event
Here the effect of a barrier to effective migration is produced by a past event that ze-roes out somemigration rates and thus disconnects the ''east'' and ''west'' regions of thehabitat. e split is instantaneous and occurs 3๐0 generations back in the past. iscreates a barrier in time that is detected as a barrier to effective migration.
ms 20 1 -s 1 -I 20 4 3 0 0 0 3 0 0 0 0 0 0 0 0 3 0 0 0 3 4 0
-m 1 2 1.0 -m 2 1 1.0 -m 1 6 1.0 -m 6 1 1.0 -m 2 3 1.0 -m 3 2 1.0
-m 2 6 1.0 -m 6 2 1.0 -m 2 7 1.0 -m 7 2 1.0 -m 5 10 1.0 -m 10 5 1.0
-m 6 7 1.0 -m 7 6 1.0 -m 6 11 1.0 -m 11 6 1.0 -m 6 12 1.0 -m 12 6 1.0
-m 9 10 1.0 -m 10 9 1.0 -m 9 15 1.0 -m 15 9 1.0 -m 10 15 1.0
-m 15 10 1.0 -m 11 12 1.0 -m 12 11 1.0 -m 11 16 1.0 -m 16 11 1.0
-m 14 15 1.0 -m 15 14 1.0 -m 14 19 1.0 -m 19 14 1.0 -m 15 19 1.0
-m 19 15 1.0 -m 15 20 1.0 -m 20 15 1.0 -m 18 19 1.0 -m 19 18 1.0
-m 19 20 1.0 -m 20 19 1.0
-em 3.0 3 7 1.0 -em 3.0 3 8 1.0 -em 3.0 3 4 1.0 -em 3.0 4 3 1.0
-em 3.0 4 8 1.0 -em 3.0 4 9 1.0 -em 3.0 4 5 1.0 -em 3.0 5 4 1.0
-em 3.0 5 9 1.0 -em 3.0 7 12 1.0 -em 3.0 7 13 1.0 -em 3.0 7 8 1.0
-em 3.0 7 3 1.0 -em 3.0 8 7 1.0 -em 3.0 8 13 1.0 -em 3.0 8 14 1.0
-em 3.0 8 9 1.0 -em 3.0 8 4 1.0 -em 3.0 8 3 1.0 -em 3.0 9 8 1.0
-em 3.0 9 14 1.0 -em 3.0 9 5 1.0 -em 3.0 9 4 1.0 -em 3.0 12 16 1.0
-em 3.0 12 17 1.0 -em 3.0 12 13 1.0 -em 3.0 12 7 1.0 -em 3.0 13 12 1.0
-em 3.0 13 17 1.0 -em 3.0 13 18 1.0 -em 3.0 13 14 1.0 -em 3.0 13 8 1.0
-em 3.0 13 7 1.0 -em 3.0 14 13 1.0 -em 3.0 14 18 1.0 -em 3.0 14 9 1.0
-em 3.0 14 8 1.0 -em 3.0 16 17 1.0 -em 3.0 16 12 1.0 -em 3.0 17 16 1.0
-em 3.0 17 18 1.0 -em 3.0 17 13 1.0 -em 3.0 17 12 1.0 -em 3.0 18 17 1.0
-em 3.0 18 14 1.0 -em 3.0 18 13 1.0
8
Bibliography
D. Babiฤ, D. J. Klein, I. Lukovits, S. Nikoliฤ, and N. Trinajstiฤ. Resistance-distance ma-trix: A computational algorithm and its application. International Journal of QuantumChemistry, 90(1):166โ176, 2002.
M. Bahlo and R. C. Griffiths. Coalescence time for two genes from a subdivided popu-lation. Journal of Mathematical Biology, 43(5):397โ410, 2001.
R. B. Bapat. Resistance matrix of a weighted graph. MATCH: Communications inMath-ematical and in Computer Chemistry, 50:73โ82, 2004.
R. B. Bapat and T. E. S. Raghavan. Nonnegative matrices and applications. CambridgeUniversity Press, 1997.
P. Beerli and J. Felsenstein. Maximum likelihood estimation of amigrationmatrix andeffective population sizes in ๐ subpopulations by using a coalescent approach. Proceed-ings of the National Academy of Sciences (PNAS), 98(8):4563โ4568, 2001.
C. A. Brewer, G. W. Hatchard, and M. A. Harrower. ColorBrewer in print: a catalog ofcolor schemes for maps. Cartography and Geographic Information Science, 30(1):5โ32,2003.
S. D. Byers and A. E. Raftery. Bayesian estimation and segmentation of spatial pointprocesses using Voronoi tilings. In Andrew B. Lawson andDavid G.T. Denison, editors,Spatial Cluster Modeling, page 109โ121. Chapman&Hall, 2002.
L. L. Cavalli-Sforza, P.Menozzi, andA. Piazza.ehistory and geography of humangenes.Princeton University Press, 1994.
A. K. Chandra, P. Raghavan, W. L. Ruzzo, R. Smolensky, and P. Tiwari. e electricalresistance of a graph captures its commute and cover times. Computational Complexity,6(4):312โ340, 1996.
A. G. Clark, M. J. Hubisz, C. D. Bustamante, S. H. Williamson, and R. Nielsen. Ascer-tainment bias in studies of human genome-wide polymorphism. Genome Research, 15(11):1496โ1502, 2005.
C. C. Cockerham. Variance of gene frequencies. Evolution, 23(1):72โ84, 1969.
J. Felsenstein. A pain in the torus: Some difficulties with models of isolation by dis-tance. e American Naturalist, 109(967):359โ368, 1975.
J. C. Gower. Properties of Euclidean and non-Euclidean distance matrices. LinearAlgebra and its Applications, 67(1):81โ97, 1985.
78
G. Guillot, A. Estoup, F. Mortier, and J. F. Cosson. A spatial statistical model for land-scape genetics. Genetics, 170(3):1261โ1280, 2005.
E.M.Hanks andM.B.Hooten. Circuit theory andmodel-based inference for landscapeconnectivity. Journal of the American Statistical Association, 108(501):22โ33, 2013.
J. Hey. A multi-dimensional coalescent process applied to multi-allelic selection mod-els and migration models. eoretical Population Biology, 39(1):30โ48, 1991.
MW. Horton, A. M. Hancock, Y. S. Huang, C. Toomajian, S. Atwell, and et al. Genome-wide patterns of genetic variation in worldwide Arabidopsis thaliana accessions fromthe RegMap panel. Nature Genetics, 44(2):212โ216, 2012.
M. J. Hubisz, D. Falush, M. Stephens, and J. K. Pritchard. Inferring weak popula-tion structure with the assistance of sample group information. Molecular Ecology Re-sources, 9(5):1322โ1332, 2009.
R. R. Hudson. Gene genealogies and the coalescent process. In Douglas Futuyma andJanis Antonovics, editors,Oxford surveys in evolutionary biology, volume 7, pages 1--44.Oxford University Press, 1990.
R. R. Hudson. Generating samples under a Wright-Fisher neutral model of geneticvariation. Bioinformatics, 18(2):337โ338, 2002.
M. Kimura. e number of heterozygous nucleotide sites maintained in a finite popu-lation due to steady flux of mutations. Genetics, 61(4):893โ903, 1969.
M. Kimura and G. H.Weiss. e stepping stonemodel of population structure and thedecrease of genetic correlation with distance. Genetics, 49(4):561โ576, 1964.
J. F. C. Kingman. On the genealogy of large populations. Journal of Applied Probability,19(A):27โ43, 1982a.
J. F. C. Kingman. e coalescent. Stochastic Processes and their Applications, 13(3):235โ248, 1982b.
O. Lao, T. T. Lu, M. Nothnagel, O. Junge, S. Freitag-Wolf, A. Caliebe, and et al. Cor-relation between genetic and geographic structure in Europe. Current Biology, 18(16):1241โ1248, 2008.
D. J. Lawson andD. Falush. Population identificationusing genetic data. AnnualReviewof Genomics and Human Genetics, 13:337โ361, 2012.
J. Y. Lee and S. V. Edwards. Divergence across Australia's Carpentarian barrier: Statis-tical phylogeography of the red-backed fairy wrenMalurusmelanocephalus. Evolution,62(12):3117โ3134, 2008.
D. A. Levin, Y. Peres, and E. L. Wilmer. Markov chains and mixing times. AmericanMathematical Society, 2008.
P. McCullagh. Marginal likelihood for distancematrices. Statistica Sinica, 19:631โ649,2009.
B. H. McRae. Isolation by resistance. Evolution, 60(8):1551โ1561, 2006.
inferring effective migration from geographically indexed genetic data 79
B. H. McRae, B. G. Dickson, T. H. Keitt, and V. B. Shah. Using circuit theory to modelconnectivity in ecology, evolution, and conservation. Ecology, 89(10):2712โ2742,2008.
G. McVean. A genealogical interpretation of principal components analysis. PLoSGenetics, 5(10):e1000686, 2009.
P.Menozzi, A. Piazza, and L. L. Cavalli-Sforza. Syntheticmaps of human gene frequen-cies in Europeans. Science, 201(4358):786โ792, 1978.
T.Nagylaki.e strong-migration limit in geographically structured populations. Jour-nal of Mathematical Biology, 9(2):101โ114, 1980.
M.Nei. Analysis of genediversity in subdividedpopulations. Proceedings of theNationalAcademy of Sciences (PNAS), 70(12):3321โ3323, 1973.
M. R. Nelson, K. Bryc, K. S. King, A. Indap, A. R. Boyko, J. Novembre, and et al. epopulation reference sample, POPRES: A resource for population, disease, and phar-macological genetics research. e American Journal of Human Genetics, 83(3):347โ358, 2008.
R. Nielsen. Estimation of population parameters and recombination rates from singlenucleotide polymorphisms. Genetics, 154(2):931โ942, 2000.
M. Nordborg, T. T. Hu, Y. Ishino, J. Jhaveri, C. Toomajian, H. Zheng, E. Bakker, andet al. e pattern of polymorphism in Arabidopsis thaliana. PLoS Biology, 3(7):e196,2005.
M. Notohara. e coalescent and the genealogical process in geographically structuredpopulation. Journal of Mathematical Biology, 29(1):59โ75, 1990.
M.Notohara. e strong-migration limit for the genealogical process in geographicallystructured populations. Journal of Mathematical Biology, 31(2):115โ122, 1993.
J. Novembre and M. Stephens. Interpreting principal component analyses of spatialpopulation genetic variation. Nature Genetics, 40(5):646โ649, 2008.
J. Novembre, T. Johnson, K. Bryc, Z. Kutalik, A. R. Boyko, A. Auton, A. Indap, K. S.King, S. Bergmann, M. R. Nelson, M. Stephens, and C. D. Bustamante. Genes mirrorgeography within Europe. Nature, 456(7218):98โ101, 2008.
A. Okabe, B. Boots, K. Sugihara, and S. N. Chiu. Spatial tessellations : concepts andapplications of Voronoi diagrams. Wiley Series in Probability and Statistics. Wiley, 2000.
A. Platt,M.Horton, Y. S.Huang, Y. Li, A. E. Anastasio, and et al. e scale of populationstructure in Arabidopsis thaliana. PLoS Genetics, 6(2):e1000843, 2010.
A. L. Price, N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick, and D. Reich.Principal components analysis corrects for stratification in genome-wide associationstudies. Nature Genetics, 38(8):904โ909, 2006.
A. L. Price, N. A. Zaitlen, D. Reich, and N. Patterson. New approaches to populationstratification in genome-wide association studies. NatureReviewsGenetics, 11(7):459โ463, 2010.
J. K. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure usingmultilocus genotype data. Genetics, 155(2):945โ959, 2000.
80
N. A. Rosenberg, S. Mahajan, S. Ramachandran, C. Zhao, J. K. Pritchard, and M. W.Feldman. Clines, clusters, and the effect of study design on the inference of humanpopulation structure. PLoS Genetics, 1(6):e70, 2005.
F. Rousset. Genetic differentiation and estimation of gene flow fromF-statistics underisolation by distance. Genetics, 145(4):1219โ1228, 1997.
F. Rousset. Genetic structure and selection in subdivided populations. Princeton Univer-sity Press, 2004.
F. Rousset. GENEPOP'007: a complete re-implementation of the GENEPOP softwarefor Windows and Linux. Molecular Ecology Resources, 8:103โ106, 2008.
D. Serre and S. Pรครคbo. Evidence for gradients of human genetic diversity within andamong continents. Genome Research, 14(9):1679โ1685, 2004.
M. Slatkin. Inbreeding coefficients and coalescence times. Genetical Research, 58(2):167โ175, 1991.
M. Stephens. Bayesian analysis of mixture models with an unknown number of com-ponents โan alternative to reversible jump methods. e Annals of Statistics, 28(1):40โ74, 2000.
C. Strobeck. Average number of nucleotide differences in a sample from a single sub-population: a test for population subdivision. Genetics, 117(1):149โ153, 1987.
C. Tian, R. M. Plenge, M. Ransom, A. Lee, P. Villoslada, C. Selmi, and et al. Analysisand application of European genetic substructure using 300K SNP information. PLoSGenetics, 4(1):e4, 2008.
M. N. M. van Lieshout. Markov point processes and their applications. Imperial CollegePress, 2000.
A. P. Verbyla. A conditional derivation of residual maximum likelihood. AustralianJournal of Statistics, 32(2):227โ230, 1990.
C. Wang, Z. A. Szpiech, J. H. Degnan, M. Jakobsson, T. J. Pemberton, J. A. Hardy, A. B.Singleton, andN. A. Rosenberg. Comparing spatialmaps of humanpopulation-geneticvariation using Procrustes analysis. Statistical Applications in Genetics and MolecularBiology, 9(1):Article 13, 2010.
C. Wang, S. Zรถllner, and N. A. Rosenberg. A quantitative comparison of the similaritybetween genes and geography in worldwide human populations. PLoS Genetics, 8(8):e1002886, 2012.
S. K. Wasser, A. M. Shedlock, K. Comstock, E. A. Ostrander, B. Mutayoba, andM. Stephens. Assigning African elephant DNA to geographic region of origin: Ap-plications to the ivory trade. Proceedings of the National Academy of Sciences (PNAS), 10(41):14847โ14852, 2004.
S. K.Wasser, C.Mailand, R. Booth, B.Mutayoba, E. Kisamo, B. Clark, andM. Stephens.Using DNA to track the origin of the largest ivory seizure since the 1989 trade ban.Proceedings of the National Academy of Sciences (PNAS), 104(10):4228โ4233, 2007.
G. H. Weiss and M. Kimura. A mathematical analysis of the stepping stone model ofgenetic correlation. Journal of Applied Probability, 2(1):129โ149, 1965.
inferring effective migration from geographically indexed genetic data 81
S. Wright. Isolation by distance. Genetics, 28(2):114โ138, 1943.