Transcript
Page 1: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

D E S I S L A V A P E T K O V A

INFERRINGEFFECTIVE MIGRATION FROMGEOGRAPHICALLY INDEXEDGENETIC DATA

T H E U N I V E R S I T Y O F C H I C A G O

Page 2: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

Contents

1 Population Structure in Genetic Variation 6

2 Population Structure due to Migration 8

3 Genetic Dissimilarities and Distance Matrices 20

4 Estimating Effective Rates of Migration 28

5 Simulations of Structured Genetic Data 35

6 Empirical Results 42

7 Appendices 58

8 Bibliography 77

Page 3: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

List of Figures

2.1 A genealogy describes the ancestral history of a genotyped sample 8

2.2 A randomwalk approximates the migration process in a population graph 182.3 Effective resistances approximate expected coalescence times: relative error 19

5.1 Population structure under uniform migration 355.2 Population structure due to a barrier to migration 365.3 Uncertainty in the inferred migration surface 365.4 Barrier to migration with ascertainment bias 385.5 Population structure due to differences in deme size 395.6 A past demographic event results in a barrier to effective migration 395.7 Barrier to migration with uneven sampling 40

6.1 Habitat of the red-backed fairywren with the Carpentarian barrier 426.2 PCA and STRUCTURE analysis of the red-backed fairywren data 436.3 Distance scatterplot for the red-backed fairywren data 446.4 Triangular population graph spans thehabitat of the red-backed fairywren 446.5 Inferred effective migration surface for the red-backed fairywren 456.6 Uncertainty in the inferred effectivemigrationof the red-backed fairywren 456.7 Triangular population graph spans the habitat of the African elephant 466.8 PCA analysis of the elephant data 476.9 Inferred effective migration surface for the African elephant 476.10 Effective migration rates at each of sixteen microsatellites 486.11 Inferred effective migration surface for the savanna and forest elephants 486.12 GENELAND analysis of the African elephant data 496.13 STRUCTURE analysis of the African elephant data 496.14 Distance scatterplots for the African elephant data 506.15 Sample configuration and PCA analysis of the European and African data 516.16 Distance scatterplots for the European and African data 526.17 Inferred effective migration for human populations in Europe and Africa 536.18 Sample configuration and PCA analysis of Arabidopsis thaliana data 546.19 Inferred effective migration surfaces for Arabidopsis thaliana 556.20 Distance scatterplots for the Arabidopsis thaliana data 56

7.1 ms command: uniform migration on a regular triangular grid 747.2 ms command: barrier to migration on a regular triangular grid 747.3 ms command: barrier to effectivemigrationdue todifferences inpopulation size 757.4 ms command: uniformeffectivemigrationdespite differences in population size 757.5 ms command: barrier to effective migration due to a split in time 76

Page 4: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

4

Genetic data often exhibit patterns that are broadly consistent with ''iso-

lation by distance'' โ€” a phenomenon where genetic similarity tends to

decaywith geographic distance. In a heterogeneous habitat, decaymay oc-

curmore quickly in some regions than others: for example, barriers to gene

flow such as mountains or deserts could accelerate the genetic differenti-

ation between neighboring groups. In this thesis we present a method to

quantify and visualize variation in effective migration across the habitat,

and, under further assumptions, to infer the presence or absence of barri-

ers to migration, from geographically indexed large-scale genetic data.

Page 5: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 5

First and foremost, I would like to express my deepest gratitude to my su-pervisor, Professor Matthew Stephens, for his guidance and encourage-ment throughout the development of this project. His tremendous sup-port made this formidable journey less complicated, if not easier.

I am also grateful to my colleagues at the Departments of Statisticsand Human Genetics for their inspiring companionship, for their collec-tive critical eye, but above all, for their advice, assistance and guidance onmany difficult problems.

I am indebted to so many but I especially wish to acknowledge the ef-forts, love and encouragement of my parents. ey watched me from adistance while I worked towards my degree. e completion of this thesiswould mean a lot to them, so I dedicate this project to my parents, Emaand Ivo.

And finally, I would like to thank my sister and my friends for allow-ing me to realize my own potential and without whose love, affection andencouragement this thesis and many other pursuits would not have beensuccessful.

Page 6: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

1

Population Structure in Genetic Variation

e term ''population structure'' is used to describe nonrandom patterns of genetic sim-ilarity (or alternatively, dissimilarity) between individuals from the same species. Onetask is to detect such patterns; this is often done in association studies because system-atic ancestry differences between cases and controls that are not genetic risk factors forthe disease can bias the results of the study [Price et al., 2010]. A more challenging taskisAdmixture is partial ancestry from two

or more distinct subpopulations as theresult of interbreeding.

to explain population structure as the outcome of events in the evolutionary historyof the species such as splits or admixture (events in time) and/or migration (events inspace) [Lawson and Falush, 2012].

Twowidely used approaches for inferring genetic ancestry are principal componentsanalysis and model-based clustering. In both cases, interpretation of results and infer-ence of demography are founded on the assumption that sample structure is evidencefor population structure, to the exclusion of other possible sources such as family struc-ture, cryptic relatedness or sample processing artifacts.

Principal component analysis (PCA) was first used in population genetics to sum-marize human genetic variation across continents [Menozzi et al., 1978, Cavalli-Sforzaet al., 1994]. eir synthetic maps of allele frequency variation show gradients thatcould support hypotheses for specific migration events such as the spread of Neolithicfarming. is interpretation of PCA maps is not universally accepted because PCA canproduce similar wave patterns in simulated spatial data, where gradients result fromlocal dispersal and not directed migration [Novembre and Stephens, 2008]. However,even though PCA might not explain what processes generated the structured variationin genetic data, themethod has been successfully applied to detect population stratifica-tion and infer genetic ancestry. For example, the top principal components of the sam-ple covariancematrix across a large number of (randomly selected) SNPs align well withgeographic distribution in some datasets [Novembre et al., 2008, Wang et al., 2012].

Alternatively, population structure can be analyzed with a model-based clusteringapproach. For example, STRUCTURE [Pritchard et al., 2000] assigns individuals into ๐พgenetically homogeneous subpopulations [i.e., randommating and hence under Hardy-Weinberg equilibrium], with individual-specific ancestry proportions. As a clusteringalgorithm, STRUCTURE assumes the number of clusters is known. Even more impor-tantly, it uses a discrete model of population structure that is most applicable wherehigh level of divergence have resulted into well differentiated clusters.

Both PCA and STRUCTURE can produce results that are difficult to interpret. Forexample, STRUCTURE can fail if the population consists of distinct groups character-ized by small differences in allele frequencies, or a single population where the distribu-tion of allele frequencies varies continuously across space. In both cases, it is hard for aclustering algorithm to distinguish between clusters, or find the correct number of clus-

Page 7: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 7

ters. e study design can also influence the extent of observed ''clusteredness'' asmanydatasets consist ofmultiple observations from few locations [Serre and Pรครคbo, 2004]. Ifthe sample configuration does not represent the geographic distribution of the species,muchnaturally occurring genetic variation remains unobserved. And indeed the geneticdifferentiation of a widely distributed species such as humans is likely to exhibit evi-dence for both clusters, which correspond to discontinuous jumps in allele frequenciesacross large barriers such as oceans or the Himalayas, and clines, which reflect smoothgradations in allele frequencies across unbroken geographic regions [Rosenberg et al.,2005].

In the case of low differentiation between clusters, STRUCTURE results can be im-proved with a stronger, more informative prior on cluster membership. For example,[Hubisz et al., 2009] introduces a prior that places more weight on cluster assignmentsthat are correlated with sampling locations (because origin is often informative aboutancestry). In another modification of STRUCTURE that incorporates geographic infor-mation, [Guillot et al., 2005] explicitly models the distribution of clusters across thehabitat to encourage spatially continuous clusters (because subspecies often occupy lo-cally connected areas).

In the case of smoothly varying population structure, it is not appropriate to as-sign individuals to a fixed number of distinct clusters, even if the clustering methodallows fractional membership. PCA is effective in presenting continuous variation andPC projections are related to the underlying genealogical process [McVean, 2009]. How-ever, the algorithm is not based on a population genetics model, so it does not estimaterelevant demographic parameters, and its results are strongly affected by uneven sam-pling.

Genetic data often exhibit patterns that are broadly consistentwith "isolationbydis-tance" [Weiss and Kimura, 1965, Rousset, 1997] where genetic similarity tends to decaywith geographic distance. at is, a population inwhich the exchange ofmigrants is con-stant in both space and time still has structure as individuals that are close together are,on average, more genetically similar than individuals that are far apart [if reproductionand dispersal tend to occur locally/over small distances in every generation].

In a heterogeneous habitat, genetic similarity may decrease faster in some regionsthan others because a barrier to migration could accelerate the genetic differentiationbetween neighboring groupsโ€” thus creating patterns of population structure that arenot consistent with uniform migration. Here we develop a method aimed at investi-gating this kind of scenario. Specifically, we introduce a parametric model for geneticstructure that attempts to explain the spatial structure observed in geographically in-dexed large-scale genetic data in terms of effective rates of migration. We say "effective"because themodel's applicability to genetic data ismotivated under of series of assump-tions [most importantly, equilibrium in time] that mean estimated rates cannot be in-terpreted as actual rates of migration unless the assumptions are reasonably satisfied.However, even when estimated population parameters are not directly interpretable interms of demographic history, our method provides an intuitive and informative wayto quantify and visualize spatial patterns of population structure.

Page 8: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

...

1

.(๐‘–)

.

1

.(๐‘—)

.

1

..

0

..

0

..

0

.

โ€ข

.๐‘ก๐‘–๐‘—

.

๐‘ก๐‘š๐‘Ÿ๐‘๐‘Ž โˆ’ ๐‘ก๐‘–๐‘—

Figure 2.1: is genealogy specifies onepossible history for a sample of size 6.Since exactly one mutation occurs, de-noted by โ€ข, we observe a pattern of both0s and 1s. Regardless of which branch car-ries the mutation, the events 'only 0s' and'only 1s' are excluded.

2

Population Structure due to Migration

In this background chapter we explain how population structure is reflected in observedgenetic data via the genealogy of the sample and review briefly a mathematical modelfor spatially structured populations.

In natural populations,mating is not randomdue to a complexmixture of evolution-ary and ecological factors. Non-randommating creates structure in genetic variation asclosely related individuals tend to be more similar genetically than distantly related in-dividuals. us shared ancestry leads to genetic similarity [Section 2.1].

An important factor for non-random mating is geographic distance as two individ-uals located close in space are more likely to reproduce than two individuals far apart.us geographic proximity leads to genetic similarity โ€” a phenomenon called isolationby distance [Section 2.2].

A population geneticsmodel that exhibits isolation by distance is Kimura's stepping-stone model [Section 2.3]. It represents a spatially distributed population as a graphwhere vertices are groups of randomly mating individuals (called demes) and edges aredirect routes of migration. us demes that are closer together in the graph tend to bemore similar.

In fact, the stepping-stone model can capture the effect of both geographic distanceand heterogeneous habitat on genetic similarity as edges can have different migrationrates to reflect heterogeneity in gene flow. is weighted population graph describesprecisely what it means for two demes to be ''close together'' [Section 2.4].

2.1 Pairwise expected coalescence times explain population structure

emore closely related two individuals are, themore genetically similar they are. ere-fore, the genetic similarities observed in a sample contain information about the evo-lutionary processes undergone by the entire population. In this section we explain theconnection between genealogical histories and genetic similarities; the review is largelybased on [McVean, 2009].

Let ๐‘ง1, โ€ฆ , ๐‘ง๐‘› be the genotypes of ๐‘› individuals at a single segregating locus. Forsimplicity, assume the genetic markers are biallelic (e.g., SNPs): each individual carrieseither the ancestral allele, labeled '0', or the derived allele, labeled '1'.

Although life occurs forward in time and in discrete generations, it is often moreconvenient to model the ancestry of a sample backwards in time using a continuous-time process called the coalescent [Kingman, 1982b,a] that traces the lineages backwardsin time until their convergence into a single common ancestor. us the coalescentconstructs the history of the sample, at a single locus, in the form of a genealogical tree[Figure 2.1]. e most important demographic functions of the genealogy are

Page 9: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 9

โ€ข the time to the most recent common ancestor ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Ž [the height of the tree];

โ€ข the total size of the tree ๐‘ก๐‘ก๐‘œ๐‘ก [the sum of all branches];

โ€ข the pairwise time to coalescence ๐‘ก๐‘–๐‘— for every pair (๐‘–, ๐‘—) [the length of the path from ๐‘–,or equivalently from ๐‘—, to the most recent common ancestor of ๐‘– and ๐‘—].

With a slight abuse of notation, let ๐พ๐‘ก denote the number of mutations that occur ona path with length ๐‘ก. If mutations are generated by a Poisson process with intensity[mutation rate] ๐œƒ, the probability that the path accumulates a mutation depends onlyon its relative length, not on its position within the genealogy. In particular, P{๐พ๐‘ก =0} = E{๐‘’โˆ’๐œƒ๐‘ก}. Similarly, let๐พ๐‘ก๐‘ก๐‘œ๐‘ก denote thenumber ofmutations that occur throughoutthe genealogy. us {๐พ๐‘ก๐‘ก๐‘œ๐‘ก > 0} is the event that the site segregates in the sample. Fora fixed mutation rate ๐œƒ, the probability that at least one mutation occurs on a path withlength ๐‘ก and none in the rest of the tree is

P{๐พ๐‘ก > 0, ๐พ๐‘ก๐‘ก๐‘œ๐‘กโˆ’๐‘ก = 0} = E{(1 โˆ’ ๐‘’โˆ’๐œƒ๐‘ก)๐‘’โˆ’๐œƒ(๐‘ก๐‘ก๐‘œ๐‘กโˆ’๐‘ก)}. (2.1)

Similarly, the probability that the site segregates is

P{๐พ๐‘ก๐‘œ๐‘ก > 0} = E{1 โˆ’ ๐‘’โˆ’๐œƒ๐‘ก๐‘ก๐‘œ๐‘ก}, (2.2)

where the expectation is with respect to all possible genealogies of the sample.[Nielsen, 2000] argues that if we assume the mutation rate is low and condition on

the site segregating in the sample, then the mutation rate ๐œƒ is of little interest and soit can be treated as a nuisance parameter. Following [Nielsen, 2000] Alternatively, without explicitly making

the infinitely-many-sites assumption,we can ignore the probability of event{๐พ๐‘ก > 1} if the mutation rate ๐œƒ is verylow.

we can eliminate๐œƒ from the analysis by taking the limit ๐œƒ โ†’ 0. Under the infinitely-many-sites model[Kimura, 1969], the event of at least one mutation is equivalent to the event of exactlyone mutation. erefore, P{๐‘ก = 0} and P{๐‘ก = 1} are complementary events. Togetherwith the low mutation limit ๐œƒ โ†’ 0, this implies

P{๐พ๐‘ก = 1|๐พ๐‘ก๐‘œ๐‘ก = 1} = P{๐พ๐‘ก > 0, ๐พ๐‘ก๐‘œ๐‘ก > 0}P{๐พ๐‘ก๐‘œ๐‘ก > 0} =

P{๐พ๐‘ก > 0, ๐พ๐‘ก๐‘ก๐‘œ๐‘กโˆ’๐‘ก = 0}P{๐พ๐‘ก๐‘œ๐‘ก > 0} (2.3a)

= lim๐œƒโ†’0

๐œƒโˆ’1E{๐‘’โˆ’๐œƒ(๐‘ก๐‘ก๐‘œ๐‘กโˆ’๐‘ก) โˆ’ ๐‘’โˆ’๐œƒ๐‘ก๐‘ก๐‘œ๐‘ก}๐œƒโˆ’1E{1 โˆ’ ๐‘’โˆ’๐œƒ๐‘ก๐‘ก๐‘œ๐‘ก}

Interchange limit and expectation [valid

if E{๐‘ก๐‘ก๐‘œ๐‘ก} < โˆž] and use the Taylorapproximation ๐‘’โˆ’๐‘ฅ โˆผ 1 โˆ’ ๐‘ฅ.

(2.3b)

= E{๐‘ก๐‘ก๐‘œ๐‘ก โˆ’ (๐‘ก๐‘ก๐‘œ๐‘ก โˆ’ ๐‘ก)}E{๐‘ก๐‘ก๐‘œ๐‘ก}

= E{๐‘ก}E{๐‘ก๐‘ก๐‘œ๐‘ก}

โ‰ก ๐‘‡๐‘‡๐‘ก๐‘œ๐‘ก

, (2.3c)

where for convenience we denote the expectation of coalescence time ๐‘ก by ๐‘‡.erefore, for biallelic markers and under the conditions specified above, there is a

relationship between expected coalescence times and the probability that a particularbranch in the genealogy carries the derived allele. We will use it to derive the first twomoments of the genotype vector ๐‘ = (๐‘1, โ€ฆ , ๐‘๐‘›).Proposition 2.1 Suppose that a sample of size ๐‘› is collected from a population that evolvesaccording to the neutral infinitely-many-sitesmodel, wheremutations are generated by a Pois-son process with lowmutation rate. At segregating sites where exactly one mutation occurs inthe sample, the allele carried by individual ๐‘–, denoted by ๐‘๐‘–, is a binary random variable suchthat

Eโˆ—{๐‘๐‘–} = ๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž๐‘‡๐‘ก๐‘œ๐‘ก

. (2.4)

Furthermore, for two distinct individuals ๐‘– and ๐‘—,

Eโˆ—{๐‘๐‘–๐‘๐‘—} =๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž โˆ’ ๐‘‡๐‘–๐‘—

๐‘‡๐‘ก๐‘œ๐‘ก. (2.5)

Page 10: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

10

Here the symbolโˆ— indicates that the expectation is with respect to all possible sample genealo-gies with exactly one mutation.

Proof. In a genealogical tree with exactly one mutation, the ๐‘–th lineage carries thederived allele if the mutation occurs anywhere on the path from ๐‘–th external branch tothe most recent common ancestor of the entire sample. is path has length ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Ž forall ๐‘–; its average length is ๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž. erefore, the conditional probability of observing thederived allele is the same for every individual and

Eโˆ—{๐‘๐‘–} = E{๐พ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Ž = 1|๐พ๐‘ก๐‘œ๐‘ก = 1} = ๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž๐‘‡๐‘ก๐‘œ๐‘ก

. (2.6)

at is, the genotypes at a biallelic marker are Bernoulli random variables with fre-quency๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž/๐‘‡๐‘ก๐‘œ๐‘ก. Furthermore, since the genotypes are binary, the event {๐‘๐‘– = 1, ๐‘๐‘— =1} โ‡” {๐‘๐‘–๐‘๐‘— = 1} implies that the mutation occurs on the branch from the pair's mostrecent common ancestor to the most recent common ancestor of the sample. is an-cestral branch has length ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Ž โˆ’ ๐‘ก๐‘–๐‘—. erefore, the conditional expectation that twoindividuals ๐‘– and ๐‘— carry a common mutation at a biallelic marker is given by

E{๐‘๐‘–๐‘๐‘— |๐พ๐‘ก๐‘œ๐‘ก = 1} = P{๐พ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’๐‘ก๐‘–๐‘— = 1|๐พ๐‘ก๐‘œ๐‘ก = 1} =E{๐‘ก๐‘š๐‘Ÿ๐‘๐‘Ž} โˆ’ E{๐‘ก๐‘–๐‘—}

E{๐‘ก๐‘ก๐‘œ๐‘ก}. (2.7)

l

us the individual genotypes have the same marginal distribution: the ๐‘๐‘–s are identi-cally distributed but not independent. Finally, in equations (2.4) and (2.5) the expectedcoalescence times ๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž, ๐‘‡๐‘ก๐‘œ๐‘ก, ๐‘‡๐‘–๐‘— are marginal expectations with respect to all possiblehistories [genealogies] of the sample, not only histories that can induce the observedpattern of 0s and 1s.

e principle behind equation (2.5) states that the more history two individualsshare, the more genetically similar they are. Here we should interpret "shared history"precisely as "common ancestral branch" in the genealogy rather than broadly as a "de-mographic past" in the sense of evolutionary history. Different models can produce thesame expected genealogy. For example, a long branch separating two samples couldcorrespond to a split into distinct subpopulations some time in the past or constantmi-gration between two locations at a low rate. Conversely, without further assumptions,patterns of similarities and differentiation observed in genetic data reveal informationabout the underlying genealogies, and hence, indirectly, about the demographic modelthat generated them. In this thesis we average observed genetic similarities acrossmarkers; thus we ignore information (e.g., the variance) that could in principle improvethe ability to distinguish demographic models.

2.1.1 Bias due to SNP ascertainment

Ascertainment bias refers to systematic deviations in the SNP discovery process where asmall number of individuals are used to find sites polymorphic in the entire population[Clark et al., 2005]. In particular, rare SNPs are harder to ascertain and more likely tobe underrepresented. Furthermore, the genetic variation in a geographic region couldbe misrepresented in a panel with unbalanced sample configuration. [McVean, 2009]observes that two samples are effectively involved in ascertainment โ€” first a panel todiscover SNPs for genotyping on a microchip and then a sample to genotype. We condi-tion on sites that segregate in both samples and this can distort (the average shape of)the observed genealogies and thus produce misleading results. In this thesis we ignoreSNP ascertainment as a potential source of sample structure.

Page 11: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 11

2.2 Isolation by distance in a spatially distributed population

Geographic separation can act as a genetic barrier because in a natural population mi-gration tends to be local rather than long-range. If long-distance migration events arerare, a mutation that arises in one area might take a long time to spread throughout thehabitat (if at all). Consequently, individuals that are closer together tend to be moresimilar genetically than those that are far apart. is phenomenon is known as isola-tion by distance. However, the relationship between geography and genetic similarityalso depends on dispersal. If the habitat is homogeneous andmigration is characterizedby the same dispersal density everywhere, genetic similarity decreases as a function ofrelative distance.

e effect of subdivision on population structure is often quantified in terms of astatistic called ๐น๐‘†๐‘‡ that measures the genetic variation among subpopulations relativeto the total genetic variation. Several definitions of ๐น๐‘†๐‘‡ [Wright, 1943] introduces ๐น๐‘†๐‘‡ as the

statistic var{๐‘}/[s๐‘(1 โˆ’ s๐‘)] where var{๐‘}is the variance in allele frequency amongsubpopulations and s๐‘ is the overallmean allele frequency in the population.Intuitively, ๐น๐‘†๐‘‡ is high when individualsare similar within subpopulations anddifferent between subpopulations.

have been proposed [Wright,1943, Cockerham, 1969, Nei, 1973]. We use Nei's definition where ๐น๐‘†๐‘‡ is a functionof the probabilities of identity within and between subpopulations. [Two lineages areidentical, at a given locus, if they carry the same allele.] e ๐น-statistic is defined as

๐น๐‘†๐‘‡ = ๐œ™0 โˆ’ ๐œ™1 โˆ’ ๐œ™ , (2.8)

where ๐œ™ is the probability of identity for two individuals chosen at random withoutreference to geography, and ๐œ™0 is the probability of identity for two individuals chosenat random from the same subpopulation.

As ameasure of genetic differentiation, the๐น-statistic is related to coalescence timesbecause identity means neither lineage accumulates a mutation in the time to most re-cent common ancestor. If the mutation process is Poisson with low mutation rate ๐œƒ,

๐œ™(๐œƒ) = E{๐‘’โˆ’๐œƒ๐‘ก} โ‰ˆ 1 โˆ’ ๐œƒE{๐‘ก}. (2.9)

In this case, by substituting ๐œ™0 = 1 โˆ’ ๐‘‡0 and ๐œ™ = 1 โˆ’ ๐‘‡ in equation (2.8), [Slatkin,1991] derives the approximation

๐น๐‘†๐‘‡ โ‰ˆ ๐‘‡ โˆ’ ๐‘‡0๐‘‡ , (2.10)

where ๐‘‡0, ๐‘‡ are the expected coalescence times for a pair of distinct lineages sampledat random from the same subpopulation and from the entire population, respectively.e coalescent-based approximation to the ๐น-statistic is very general: [Slatkin, 1991]derives it in the low mutation limit but otherwise makes no assumptions about thedemographic model. us, the approximation holds under a subdivided population atequilibrium, a growing population, or a population that has undergone a split some timein the past.

By analogy, [Rousset, 1997] considers the ๐น-statistic for two demes separated bydistance ๐‘ฅ,

๐น๐‘†๐‘‡(๐‘ฅ) = ๐œ™0 โˆ’ ๐œ™๐‘ฅ1 โˆ’ ๐œ™๐‘ฅ

โ‰ˆ ๐‘‡๐‘ฅ โˆ’ ๐‘‡0๐‘‡๐‘ฅ

, (2.11)

as well as the linearized ๐น-statistic given by

๐น๐‘†๐‘‡(๐‘ฅ)1 โˆ’ ๐น๐‘†๐‘‡(๐‘ฅ) โ‰ˆ ๐‘‡๐‘ฅ โˆ’ ๐‘‡0

๐‘‡0. (2.12)

[Rousset, 1997] analyzes the relationship between genetic differentiation, ๐น๐‘†๐‘‡ , and ge-ographic distance, ๐‘ฅ, in a spatially-homogeneous stepping-stone model where demes

Page 12: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

12

are equally sized and regularly spaced (on a ring in one dimension and a torus in twodimensions), and migration is determined by a symmetric dispersal kernel. e impor-tant demographic parameters are the effective population density ๐ท per length/areaunit and the mean squared dispersal distance ๐œŽ2, which determines the speed at whichtwo lineages with a common ancestor move away from each other in a generation. Bysymmetry, the probability of identity for two randomly sampled individuals is also afunction of the relative distance ๐‘ฅ. [Rousset, 1997, 2004] derives the following large-distance approximations to the linearized ๐น๐‘†๐‘‡ ,

๐น๐‘†๐‘‡(๐‘ฅ)1 โˆ’ ๐น๐‘†๐‘‡(๐‘ฅ) โ‰ˆ ๐‘ฅ

4๐ท๐œŽ2 + ๐ถ1; (in one dimension) (2.13a)

๐น๐‘†๐‘‡(๐‘ฅ)1 โˆ’ ๐น๐‘†๐‘‡(๐‘ฅ) โ‰ˆ ln(๐‘ฅ/๐œŽ)

4๐œ‹๐ท๐œŽ2 + ๐ถ2; (in two dimensions) (2.13b)

where the constants ๐ถ1 and ๐ถ2 depend on the population density and the dispersaldistribution but not on the population sizes or the mutation rate.

erefore, if migration is uniform, the linearized ๐น๐‘†๐‘‡ increases with geographic dis-tance. is relationship is appropriate only for homogeneous habitats as it ignores theeffect of barriers (or corridors) to migration: two demes separated by a barrier wouldappear to be more genetically dissimilar than relative distance would suggest. In otherwords, we need a measure of effective distance to describe the patterns of movementacross the habitat.

2.3 e stepping-stone model of population subdivision

Section 2.1 explains that coalescence times represent population structure because ge-netically similar individuals are likely tohave a recent commonancestor and thus shortercoalescence time. e relationship between genetic correlation and coalescence times inequation (2.5) is very general. For example, [McVean, 2009] uses as an example amodelof population split in which groups derived from a common ancestor do not exchangemigrants and thus develop independently after the split. In this thesis we aim to an-alyze the spatial structure of genetic variation, and therefore, we need to model [andapply equation (2.5) to] a spatially distributed population.

Kimura's stepping-stone model [Kimura and Weiss, 1964] represents a populationacross the span of its habitat as a connected grid of panmictic (randomlymating) demes(colonies) which exchangemigrants in a fixed pattern. For simplicity, in this chapter weconsider a haploid population.A haploid organism has a single copy of

its genome; a diploid organism has twocopies, one inherited from the father andthe other from the mother.

To extend the framework, a diploid individual can berepresented as the sum of two independent haplotypes, one from each parent.

e stepping-stone model makes the following assumptions:

โ€ข ere are ๐‘‘ demes and deme ๐›ผ consists of๐‘๐›ผ randomlymating individuals. e totalpopulation number is ๐‘๐‘‡ = โˆ‘๐›ผ ๐‘๐›ผ and the average deme size is ๐‘0 = ๐‘๐‘‡/๐‘‘. edemes remain constant in size and ๐‘๐›ผ โˆผ ๐’ช(๐‘0) for all ๐›ผ.

โ€ข e mutation rate per site per generation is ๐‘ข and the scaled mutation rate for twodistinct lineages in ๐‘0 generations is ๐œƒ = 2๐‘0๐‘ข.

โ€ข e coalescence ratee ancestral process develops backwardsin time, from the present towards thepast. A coalescence event means that twoindividuals have the same parent and amigration event means that an individualfrom ๐›ผ has a parent from ๐›ฝ.

for a pair of distinct lineages drawn at random from deme ๐›ผis ๐‘ž๐›ผ = ๐‘0/๐‘๐›ผ โˆผ ๐’ช(1). Two lineages coalesce when they merge into a commonancestor and in a single generation this event has probability 1/๐‘๐›ผ.

โ€ข e migration rate for a lineage to move from deme ๐›ผ to deme ๐›ฝ โ‰  ๐›ผ is ๐‘š๐›ผ๐›ฝ โˆผ

Page 13: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 13

๐’ช(1). e migration matrix ๐‘€ = (๐‘š๐›ผ๐›ฝ), where ๐‘€๐›ผ๐›ผ = โˆ’ โˆ‘๐›ฝโˆถ๐›ฝโ‰ ๐›ผ ๐‘š๐›ผ๐›ฝ, describesthe transition process of a single lineage backwards in time.

All rate parameters are constant in times and on the scale of ๐‘0 generations. e as-sumptions ๐‘ž๐›ผ โˆผ ๐’ช(1) for every deme ๐›ผ and ๐‘š๐›ผ๐›ฝ โˆผ ๐’ช(1) for every pair (๐›ผ โ‰  ๐›ฝ) implythatmigration isweak. at is, the probability ofmultiplemigration and/or coalescenceevents occurring in the same generation [before scaling by ๐‘0] is ๐’ช(๐‘โˆ’2

0 ) and can beignored.

e stepping-stone model describes how a spatially distributed population evolvesunder equilibrium in time, i.e., under the condition that bothmigration and coalescencerates are the same in every generation. erefore, the model can characterize system-atic differences between the groups due to gene flow but not due to splits or admixtureevents. In other words, the stepping-stonemodel can represent population structure inspace but not in time. [As we show through simulations in Chapter 5, temporal struc-ture can be explained as spatial structure, in terms of effective rates of migration.]

If demes of constant size exchange migrants at fixed rates as required under equi-librium, the number of individuals to emigrate is equal the number of individuals toimmigrate, i.e., migration is conservative [Nagylaki, 1980]. Mathematically,

โˆ‘๐›ฝโˆถ๐›ฝโ‰ ๐›ผ

๐‘š๐›ผ๐›ฝ/๐‘ž๐›ผ = โˆ‘๐›ฝโˆถ๐›ฝโ‰ ๐›ผ

๐‘š๐›ฝ๐›ผ/๐‘ž๐›ฝ โ‡” ๐‘€โ€ฒ๐‘žโˆ’1 = 0 (2.14)

where ๐‘žโˆ’1 = (๐‘žโˆ’1๐›ผ ) = (๐‘๐›ผ/๐‘0) is the vector of coalescence rates.In a general stepping-stone model, migration is not necessarily symmetric. How-

ever, in this thesis we assume that ๐‘š๐›ผ๐›ฝ = ๐‘š๐›ฝ๐›ผ for all edges (๐›ผ, ๐›ฝ). e condition thatmigration is both symmetric and conservative implies that all demes have the same size:on one hand, ๐‘€๐‘žโˆ’1 = ๐‘€โ€ฒ๐‘žโˆ’1 = 0, and on the other, ๐‘€1 = 0 as ๐‘€ is a Laplacian ma-trix; hence ๐‘ž โˆ 1. us the average deme size ๐‘0 = ๐‘๐‘‡/๐‘‘ is a convenient choice forthe coalescent timescale.

e stepping-stone model characterizes dispersal not in terms of an explicit disper-sal density but indirectly through the combined effect of the graph topology and themigration rates. It may not seem natural to represent the geographic distribution of or-ganisms with a graph. However, discrete models for migration are common in popula-tion genetics. In fact, a continuousmodel of isolation by distance (with normal dispersaland continuous spatial distribution) can lead to inconsistencies [Felsenstein, 1975].

2.3.1 Expected coalescence times in a subdivided population

In Section 2.1 we described how the probability that two individuals both carry the de-rived allele is related to the expected coalescence time to their most recent commonancestor. We will use this connection between genetic similarity and shared ancestry toanalyze the spatial structure in genetic variation, and in particular, to estimate migra-tion rates. e inference procedure requires that we express pairwise coalescence timesas functions of migration rates.

e coalescent process can be extended to represent the ancestry of a sample fromthe stepping-stonemodel [Notohara, 1990, 1993]. is version, called the structured co-alescent, describes themovement of lineages between demes as well as their coalescenceinto common ancestors. We can use the properties of the structured coalescent [as wedo in Appendix 7.1] to derive the following system of linear equations for the pairwiseexpected coalescence times ๐‘‡ = (๐‘‡๐›ผ๐›ฝ) as a function of the coalescence rates ๐‘ž = (๐‘ž๐›ผ)

Page 14: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

14

and the migration rates ๐‘€ = (๐‘š๐›ผ๐›ฝ):

diag {๐‘ž} diag {๐‘‡} โˆ’ ๐‘€๐‘‡ โˆ’ ๐‘‡๐‘€โ€ฒ = 11โ€ฒ. (2.15)

Furthermore, if the migration rates are symmetric, as we assume throughout, thenthere is no variation in coalescence rates across demes, i.e., ๐‘ž = 1, ๐‘€ = ๐‘€โ€ฒ and

diag {๐‘‡} โˆ’ ๐‘€๐‘‡ โˆ’ ๐‘‡๐‘€ = 11โ€ฒ. (2.16)

In equation (2.15) ๐‘‡๐›ผ๐›ฝ is the expected coalescence time between two randomly chosenlineages, one from ๐›ผ and the other from ๐›ฝ. In equation (2.5) ๐‘‡๐‘–๐‘— is the expected coales-cence times between two sampled individuals ๐‘– โˆˆ ๐›ผ๐‘– and ๐‘— โˆˆ ๐›ผ๐‘—. Crucially, the pair-

wise coalescence times do not depend on the sample configuration,Individuals are exchangeable withindemes but not across demes because thesample location is informative about thealleles an individual carries.

๐›ผ = (๐›ผ1, โ€ฆ , ๐›ผ๐‘›),because individuals are exchangeable within each deme. erefore, the expected coales-cence time for an observed pair (๐‘– โˆˆ ๐›ผ๐‘–, ๐‘— โˆˆ ๐›ผ๐‘—) is the same as the expected coalescencetime for any pair (๐‘–โ€ฒ โˆˆ ๐›ผ๐‘–, ๐‘—โ€ฒ โˆˆ ๐›ผ๐‘—) from the subdivided population:

๐‘‡๐‘–๐‘— = ๐‘‡๐›ผ๐‘–๐›ผ๐‘— . (2.17)

Notation: We use Greek letters [๐›ผ, ๐›ฝ] to denote subpopulations and Latin letters [๐‘–, ๐‘—]to denote sampled individuals. And we will distinguish between the population matrix๐‘‡ = (๐‘‡๐›ผ๐›ฝ โˆถ demes ๐›ผ, ๐›ฝ) and the sample matrix ๐‘‡ = (๐‘‡๐‘–๐‘— โˆถ individuals ๐‘–, ๐‘—) where๐‘‡ = ๐‘‡(๐›ผ) โˆ’ diag {๐‘‡(๐›ผ)}. e diagonal is subtracted because coalescence time with selfis always 0.

In any population graph,๐‘‡๐›ผ๐›ฝ > ๐‘‡๐›ผ๐›ผ because coalescence is possible only for lineagesin the same deme. However, if ๐›ผ and ๐›ฝ are separated by a barrier, fewer migrants movebetween ๐›ผ and ๐›ฝ, and so the pairwise coalescence times ๐‘‡๐›ผ๐›ฝ would be larger than thetime expected under isolation by distance, i.e., uniform migration. us, the matrix ofpairwise coalescence times๐‘‡ = (๐‘‡๐›ผ๐›ฝ)would contain evidence for habitat heterogeneity.

Since longer coalescent time mean less genetic similarity, coalescence times are anaturalmeasure of genetic dissimilarity andhencepopulation structure. For the stepping-stonemodel we can compute the matrix of expected coalescence times, ๐‘‡, given the mi-gration rates ๐‘€ and the coalescence rates ๐‘ž using equation (2.15). Alternatively, thereexists a computationally efficient method to approximate ๐‘‡, which we discuss next.

2.4 Isolation by resistance is a metric for gene flow

Isolation by resistance (IBR) [McRae, 2006, McRae et al., 2008] draws an analogy be-tween a subdivided population in which neighboring demes exchange migrants and anelectrical network in which current flows through conductors. [Or in other words, be-tween Kimura's stepping-stone model and an undirected random walk.] To understandthe analogy better, concepts in electrical networks can be given population genetic in-terpretation [Table 2.1]. Using this correspondence between population genetics andcircuit theory, McRae develops IBR to test whether putative barriers to genetic flow af-fect genetic differentiation.

Isolation by resistance predicts effective distances from a raster grid of landscaperesistance (or friction): each cell in the grid specifies how difficult it is for an animalto migrate locally and these values are assigned based on expert knowledge about thespecies and the habitat. If the effective distances agree with the observed genetic dis-similarities, then the hypothesized grid explains the data well. Such a raster map couldbe hard to produce, especially at fine scales, and if the agreement is low, there is no

Page 15: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 15

Electrical term Ecological interpretation

conductance

๐‘๐‘ฅ๐‘ฆ โˆถ โˆ€(๐‘ฅ, ๐‘ฆ) โˆˆ ๐ธ direct migration ๐‘š๐›ผ๐›ฝ: the number of migrants exchanged be-tween two neighboring demes ๐›ผ and ๐›ฝ in a single generation.

(On the coalescent timescale ๐‘š๐›ผ = ๐‘0๏ฟฝฬ‡๏ฟฝ๐›ผ๐›ฝ where ๏ฟฝฬ‡๏ฟฝ๐›ผ๐›ฝ is theprobability that a lineage in ๐›ผ has a parent from ๐›ฝ.)

resistance

๐‘๐‘ฅ๐‘ฆ = 1/๐‘๐‘ฅ๐‘ฆcost 1/๐‘š๐›ผ๐›ฝ: measure of local landscape friction in the directionfrom ๐›ผ to ๐›ฝ. If migration is symmetric, ๐‘š๐›ผ๐›ฝ = ๐‘š๐›ฝ๐›ผ.

(Since ๐‘๐›ผ = ๐‘0, the ๐‘š๐›ผ๐›ฝs are comparable across the habitat.)effective conductance

๐ถ๐‘ฅ๐‘ฆ โˆถ โˆ€(๐‘ฅ, ๐‘ฆ) โˆˆ ๐‘‰ ร— ๐‘‰ effectivemigration๐‘€๐›ผ๐›ฝ: the number ofmigrants thatwould pro-duce the same level of genetic differentiation between ๐›ผ and ๐›ฝ ifthese two demes made up a two-deme system.

effective resistance

๐‘…๐‘ฅ๐‘ฆ = 1/๐ถ๐‘ฅ๐‘ฆdistance metric ๐‘…๐‘ฅ๐‘ฆ: quantifies the genetic differentiation be-tween a pair of demes (๐›ผ, ๐›ฝ) by taking into account the existenceof multiple pathways between them.

Table 2.1: Circuit theory concepts andtheir ecological interpretation, adaptedfrom [McRae et al., 2008]. McRae spec-ifies the edge conductances as ๐‘๐›ผ๐›ฝ =๐‘š๐›ผ๐›ฝ/๐‘ž๐›ผ. However, it is natural to defineconductances only in terms of the migra-tion process because lineages cannot coa-lesce until they meet.

method to facilitate improving the map of resistances. However, IBR does provide anuseful and efficient approximation to expected coalescence times.

To begin with, consider a stepping-stone model that has only two demes, ๐›ผ and ๐›ฝ,with equal size and a single edge with migration rate ๐‘š๐›ผ๐›ฝ. In this population, alsoknown as a two-island model,

๐‘š๐›ผ๐›ฝ =(๐‘‡๐›ผ๐›ผ + ๐‘‡๐›ฝ๐›ฝ)/8

๐‘‡๐›ผ๐›ฝ โˆ’ (๐‘‡๐›ผ๐›ผ + ๐‘‡๐›ฝ๐›ฝ)/2 . (2.18a)

[is follows from the system of linear equations (2.15).] e two-islandmodel is a veryspecial case and the equation (2.18a) does not hold more generally. In fact, unless thepopulation graph is fully connected, many pairs of demes might not exchange migrantsdirectly and then ๐‘š๐›ผ๐›ฝ = 0. However, [McRae, 2006] extends the relevance of the re-lationship (2.18a) to the general stepping-stone model by introducing the concept ofeffective migration ๐‘€๐›ผ๐›ฝ between ๐›ผ and ๐›ฝ. It is given by

๐‘€๐›ผ๐›ฝ โ‰ก(๐‘‡๐›ผ๐›ผ + ๐‘‡๐›ฝ๐›ฝ)/8

๐‘‡๐›ผ๐›ฝ โˆ’ (๐‘‡๐›ผ๐›ผ + ๐‘‡๐›ฝ๐›ฝ)/2 . (2.19)

at is, the effective migration ๐‘€๐›ผ๐›ฝ is the number of migrants that would producethe actual genetic differentiation between ๐›ผ and ๐›ฝ in a hypothetical two-island system.Since two lineages take the same time to reach their common ancestor, ๐‘‡๐›ผ๐›ฝ = ๐‘‡๐›ฝ๐›ผ andthe definition (2.19) implies that effective migration is always symmetric even thoughthe underlying true migration patterns might not be symmetric.

It is natural to relate the concept of effective migration in a subdivided population,๐‘€๐›ผ๐›ฝ, and the concept of effective conductance in an electrical circuit, ๐ถ๐›ผ๐›ฝ. In circuittheory, ๐ถ๐›ผ๐›ฝ is the conductance in a two-node, single-conductor network required toproduce the same amount of current between ๐›ผ and ๐›ฝ as in the original network.

Proposition 2.2 Consider apopulation graph (๐‘‰, ๐ธ)with symmetricmigration rates {๐‘š๐›ผ๐›ฝ โˆถโˆ€(๐›ผ, ๐›ฝ) โˆˆ ๐ธ}. is corresponds to a circuit network (๐‘‰, ๐ธ) with conductances {๐‘๐›ผ๐›ฝ = ๐‘š๐›ผ๐›ฝ}.For every pair (๐›ผ, ๐›ฝ) โˆˆ ๐‘‰ ร— ๐‘‰, the effective conductance ๐ถ๐›ผ๐›ฝ in the circuit is a measure ofthe effective migration ๐‘€๐›ผ๐›ฝ in the population:

๐‘€๐›ผ๐›ฝ โ‰ˆ ๐ถ๐›ผ๐›ฝ. (2.20)

And thus the resistance distance ๐‘…๐›ผ๐›ฝ = 1/๐ถ๐›ผ๐›ฝ is a measure of genetic differentiation.

Page 16: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

16

Proof. e relationship between effective migration and effective conductance is exactonly if migration is isotropic, i.e., demes are equivalent with respect to the size and pat-tern of movement. Here we assume only that migration is symmetric and conservative.

e migration process can be represented as a continuous-time discrete-space ran-domwalk on an undirected graph [Levin et al., 2008]. en๐‘€ = (๐‘š๐›ผ๐›ฝ) is the transitionkernel of the embedded jump chain, which determines the sequence of locations occu-pied by the lineage, and ๐‘š๐›ผ = โˆ‘๐›ฝโˆถ๐›ฝโ‰ ๐›ผ ๐‘š๐›ผ๐›ฝ are the rates of the holding distributions,which determine the waiting times before jumps. Let ๐‘š = (1/๐‘‘) โˆ‘๐›ผ ๐‘š๐›ผ be the averageholding rate.

Since migration is symmetric and conservative, the demes have the same size ๐‘0,which is also a convenient choice for the coalescent timescale. Let ๐‘‡0 be the averagewithin-deme expected coalescence time. en by Strobeck's theorem [see equation (7.8)in Appendix 7.1],

๐‘‡0 โ‰ก โˆ‘๐›ผ

๐‘‡๐›ผ๐›ผ/๐‘‘ = ๐‘‘. (2.21)

us ๐‘‡0 does not depend on the migration process.Furthermore, let๐œ๐›ผ๐›ฝ be the expected time for two lineages, one from๐›ผ and the other

from ๐›ฝ, to occupy the same deme. en

(๐‘‡๐›ผ๐›ผ + ๐‘‡๐›ฝ๐›ฝ)/2 โ‰ˆ ๐‘‡0, (2.22a)

๐‘‡๐›ผ๐›ฝ โˆ’ (๐‘‡๐›ผ๐›ผ + ๐‘‡๐›ฝ๐›ฝ)/2 โ‰ˆ ๐œ๐›ผ๐›ฝ. (2.22b)

ese two approximations are exact ifmigration is isotropic: since the demes are equiva-lent with respect to themigration process, the within-deme coalescence times๐‘‡๐›ผ๐›ผ mustbe equal by symmetry. Hence, ๐‘‡๐›ผ๐›ผ = ๐‘‡0, ๐‘‡๐›ผ๐›ฝ = ๐œ๐›ผ๐›ฝ + ๐‘‡0 and once the lineages meetfor the first time, we can restart the random walk with two lineages in the same demechosen at random.

Under the coalescent process, two lineages โ€” one from ๐›ผ and another from ๐›ฝ โ€”move simultaneously until they coalesce into a common ancestor. Suppose that theymeet for the first time in deme ๐›พ. Together the paths ๐›ผ โ†’ ๐›พ and ๐›ฝ โ†’ ๐›พ have half thelength of a commute between ๐›ผ and ๐›ฝ that passes through ๐›พ. erefore, the expectedtime to first meet, ๐œ๐›ผ๐›ฝ, can be related to the expected commute length, ๐พ๐›ผ๐›ฝ, in thecorresponding circuit network:

๐œ๐›ผ๐›ฝ โ‰ˆ ๐พ๐›ผ๐›ฝ/(4๐‘š), (2.23)

where ๐พ๐›ผ๐›ฝ is the expected number of jumps in a random walk that starts at ๐›ผ, visits ๐›ฝand returns to ๐›ผ, and 1/(2๐‘š) is the average waiting time before either lineage jumps.e relationship is approximate because the waiting time varies across vertices.

Finally, by [Chandra et al., 1996] for a undirected graph [whether isotropic or not],

๐พ๐›ผ๐›ฝ = ๐‘๐บ๐‘…๐›ผ๐›ฝ = ๐‘๐บ/๐ถ๐›ผ๐›ฝ, (2.24)

where ๐‘…๐›ผ๐›ฝ is the effective resistance between nodes ๐›ผ and ๐›ฝ, ๐ถ๐›ผ๐›ฝ is the effective con-ductance, and ๐‘๐บ is the total conductance of the network given by

๐‘๐บ = โˆ‘๐›ผ

โˆ‘๐›ฝโˆถ๐›ฝโ‰ ๐›ผ

๐‘š๐›ผ๐›ฝ = โˆ‘๐›ผ

๐‘š๐›ผ = ๐‘‘๐‘š. (2.25)

erefore,

๐‘€๐›ผ๐›ฝ =(๐‘‡๐›ผ๐›ผ + ๐‘‡๐›ฝ๐›ฝ)/8

๐‘‡๐›ผ๐›ฝ โˆ’ (๐‘‡๐›ผ๐›ผ + ๐‘‡๐›ฝ๐›ฝ)/2 โ‰ˆ ๐‘‡0/4๐œ๐›ผ๐›ฝ

โ‰ˆ (๐‘‘/4)๐‘…๐›ผ๐›ฝ(๐‘‘๐‘š)/(4๐‘š) = ๐ถ๐›ผ๐›ฝ (2.26)

l

Page 17: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 17

Essentially,McRae's approximation splits the between-deme coalescence time,๐‘‡๐›ผ๐›ฝ, intothe time to firstmeet, ๐œ๐›ผ๐›ฝ, and the averagewithin-deme coalescence time,๐‘‡0. However,since the population graph is not necessarily symmetric, not every deme ๐›พ is equallylikely to be the deme where two lineages, starting from ๐›ผ and ๐›ฝ, meet for the first time.And furthermore, the within-deme coalescence times are not necessarily equal. ere-fore, the effective resistancemetric reflects themigration process accurately but ignoresthe fact that the lineages do not necessarily coalesce on their first opportunity. On theother hand, the coalescence time metric correctly captures the effect of both processesbecause Kingman's coalescent models migration and coalescence by explicitly trackingboth lineages until their common ancestor. Since higher rates imply faster mixing, wecan conclude that the higher migration rates are, the better McRae's approximation is.See Figures 2.2 and 2.3.

2.4.1 Effective resistance approximates expected coalescence time

McRae's method approximates the ancestral process of two lineages evolving simulta-neously in terms of one lineage evolving at twice the rate. However, one random walkcannot represent a coalescence event where two lineages merge into their most recentcommon ancestor. us, while effective resistance, ๐‘…๐›ผ๐›ฝ, provides a measure for thegenetic differentiation between demes, it does not capture the genetic differentiationbetween individuals from the same deme [๐‘…๐›ผ๐›ผ = 0 for every deme ๐›ผ]. However, itfollows directly fromMcRea's approximation that

๐‘‡๐›ผ๐›ฝ โ‰ˆ ๐œ๐›ผ๐›ฝ + ๐‘‡0 โ‰ˆ ๐‘‡0(๐‘…๐›ผ๐›ฝ/4 + 1), (2.27)

or equivalently in matrix notation,

๐‘‡ โ‰ˆ ๐‘‡0(๐‘…/4 + 11โ€ฒ). (2.28)

emain advantage of approximating coalescence times in terms of effective resistancesis computational efficiency. To compute๐‘‡, we solve a linear system of equations๐ด๐‘ = ๐‘ฅwith ๐‘‘(๐‘‘ + 1)/2 unknowns that corresponds to eq. (2.16). In this problem ๐ด is sparse(because the population graph ๐บ is sparse) and positive definite, and so we can use aniterative preconditioned gradient method. ere are several methods to compute ๐‘…; weuse a method that inverts the ๐‘‘ ร— ๐‘‘ matrix ๐‘€ + 11โ€ฒ [Babiฤ‡ et al., 2002]. Since ๐ด is ofhigher order than ๐‘€, it is more efficient to compute ๐‘…. Furthermore, ๐‘… gives a verygood approximation to ๐‘‡ whenmigration rates are high and it is more appropriate thanother distance metrics such Euclidean distance and least-cost path. erefore, effec-tive resistance offers a compromise between accuracy of representation and efficiencyof computation.

In this chapter we introduced two important components of ourmethod for analyz-ing spatial population structure: the stepping-stone model and the effective resistancemetric. In the next chapters we describe how we can estimate and visualize effectiverates of migration from geographically referenced genetic data.

Page 18: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

18

Tab - (Taa+Tbb)/20 100 200 300 400 500

0

100

200

300

400

500

isotropic on a circle | m = 0.01R a

b(d/

4)

Tab - (Taa+Tbb)/20 1 2 3 4 5

0

1

2

3

4

5

m = 1

R ab(

d/4)

Tab - (Taa+Tbb)/20.00 0.01 0.02 0.03 0.04 0.05

0.00

0.01

0.02

0.03

0.04

0.05

m = 100

Tab - (Taa+Tbb)/20 500 1000

0

500

1000

1500

uniform on a grid | m = 0.01

R ab(

d/4)

Tab - (Taa+Tbb)/20 5 10 15

0

5

10

15

m = 1R a

b(d/

4)

Tab - (Taa+Tbb)/20.00 0.05 0.10 0.15

0.00

0.05

0.10

0.15

m = 100

Tab - (Taa+Tbb)/20 200 400 600 800 1000

0

200

400

600

800

1000

barrier on a grid | m = 0.01

R ab(

d/4)

Tab - (Taa+Tbb)/20 2 4 6 8 10

0

2

4

6

8

10

m = 1

R ab(

d/4)

Tab - (Taa+Tbb)/20.00 0.02 0.04 0.06 0.08 0.10

0.00

0.02

0.04

0.06

0.08

0.10

m = 100

Figure 2.2: On the ๐‘ฅ-axis, ๐‘‡๐›ผ๐›ฝ โˆ’ (๐‘‡๐›ผ๐›ผ +๐‘‡๐›ฝ๐›ฝ)/2 is the expected time to reach thesame deme; on the ๐‘ฆ-axis,๐‘…๐›ผ๐›ฝ(๐‘‘/4) is the(appropriately scaled) effective resistance.As the migration rate increases, ๐‘…๐›ผ๐›ฝ be-comes a better approximation of the ex-pected time to first meet, ๐œ๐›ผ๐›ฝ, even if mi-gration is not isotropic. [Results for a 5ร—4regular triangular grid with uniform mi-gration rate ๐‘š = 0.01, 1 or 10.]

Page 19: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 19

Tab

0 100 200 300 400 5000

100

200

300

400

500

isotropic on a circle | m = 0.01

d(R a

b4+

1)

Tab

20 21 22 23 24 25

20

21

22

23

24

25

m = 1

d(R a

b4+

1)

Tab

20.00 20.01 20.02 20.03 20.04 20.05

20.00

20.01

20.02

20.03

20.04

20.05

m = 100

Tab

0 500 1000

0

500

1000

1500

uniform on a grid | m = 0.01

d(R a

b4+

1)

Tab

15 20 25 30

20

25

30

35

m = 1

d(R a

b4+

1)

Tab

19.95 20.00 20.05 20.10

20.00

20.05

20.10

20.15

m = 100

Tab

0 200 400 600 800 1000

0

200

400

600

800

1000

1200

barrier on a grid | m = 0.01

d(R a

b4+

1)

Tab

18 20 22 24 26 28 30

20

22

24

26

28

30

32

m = 1

d(R a

b4+

1)

Tab

19.98 20.00 20.04 20.08 20.10

20.00

20.02

20.04

20.06

20.08

20.10

20.12

m = 100

Figure 2.3: On the ๐‘ฅ-axis, ๐‘‡๐›ผ๐›ฝ is the ex-pected time to coalescence; on the ๐‘ฆ-axis,๐‘‘(๐‘…๐›ผ๐›ฝ/4 + 1) is the IBR approximation.e approximation to the within-deme co-alescence times, ๐‘‡๐›ผ๐›ผ, is always ๐‘‡0 = ๐‘‘;there are the points closest to the originat ๐‘‡0 = 20 in a 5 ร— 4 grid. Althoughthe pattern does not change as the mi-gration rate increases, the relative errorโˆ†๐‘‡๐›ผ๐›ฝ/๐‘‡๐›ผ๐›ฝ decreases.

Page 20: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

3

Genetic Dissimilarities and Distance Matrices

Habitat heterogeneity can shape genetic variation by reducing or increasing gene flow.e stepping-stone model is a natural representation of a spatially distributed popula-tion and the effects of gene flow on its genetic structure. In this thesis a population isa graph ๐บ = (๐‘‰, ๐ธ, ๐‘€) comprised of vertices ๐‘‰ [randomly mating demes of equal size],edges๐ธ [symmetric routes ofmigration between neighboring demes] and aweight func-tion ๐‘€ โˆถ ๐ธ โ†’ โ„+ that specifies the rates at which migrants are exchanged.

roughout, we will assume that the population graph ๐บ is embedded in a two-dimensional habitat, with the vertex set ๐‘‰ and the edge set ๐ธ both fixed. In practice,this graph is not known and does not necessarily exist. For example, it might not bepossible to split the population into distinct groups that satisfy the randommating as-sumption. Instead, we cover the habitat with a regular triangular grid in which verticesdo not represent actual colonies. is simplification indicates that we should interpretthe migration parameters carefully โ€” as effective rather than actual rates of migration.

us the topology of the graph is determined by the shape of the habitat [and thesomewhat arbitrary choice that the graph is triangular and regularly spaced] and notthe sample configuration or the sample ''clusteredness''. And so we construct the graphdifferently from methods that aim to subdivide the population into clusters that aresimilar within and dissimilar between. However, if we make the grid (๐‘‰, ๐ธ) sufficientlyfine, we can reasonably assume that each vertex represents a randomly mating groupwithout further structure. In this case, individuals would be similar within demes butnot necessarily dissimilar between demes.

In a habitat with uniformmigration, the genetic differentiation between individualsfrom the same species is positively correlated with the distance between their origin; ina heterogeneous habitat, landscape features such as barriers or corridors create spatialstructure in genetic variation. For example, individuals separated by a barrier are lessclosely related, and therefore less genetically similar, than if the barrier were absent.e stepping-stone model can represent such effects because some edges in the popu-lation graph can have high migration rates and others โ€” low. In this thesis we developa Bayesian procedure to estimate the effective migration rates in a fixed grid (๐‘‰, ๐ธ) ofequally sized demes, from geographically indexed genetic data. e function ๐‘€ mea-sures the relative rate at which two connected demes exchange migrants; we call ๐‘€ amigration surface.

To analyze population structure, we will assume that all genotyped sites develop un-der the same evolutionary process which determines the expected structure in geneticcorrelations (or equivalently, genetic distances). In contrast, many methods for asso-ciation testing assume that individuals are independent while sites are correlated. (Inpopulation genetics, the systematic association between loci is called linkage disequilib-

Page 21: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 21

rium.) e problem at hand determines which assumption is appropriate to make. e PCA decomposition of the observedcovariance matrix ๐‘‹๐‘‹โ€ฒ can be used tocorrect for population stratification [Priceet al., 2006] incorporate the leadingeigenvectors in a regression analysis thattests for association between sites anddisease.

Tofind associations between disease status and genetic makeup, it is reasonable to assumethat the disease develops under the same mechanism in all sampled cases but not allsites contribute to the disease and not with equal effect. To analyze population struc-ture, it is reasonable to assume that the same evolution process underlies all genotypedsites but not all sampled individuals are genetically similar to equal degree.

In this chapter, let ๐‘ = (๐‘ง๐‘– โˆถ ๐‘– = 1, โ€ฆ , ๐‘›) be a vector of ๐‘› genotypes at a singlepolymorphic site. We will consider multiple sites in the next chapter. Also let ๐›ผ =(๐›ผ1, โ€ฆ , ๐›ผ๐‘›) denote the sample configuration, in which ๐›ผ๐‘– is the sampling location ofthe ๐‘–th haplotype.

3.1 Mean and covariance of genotype vectors: SNPs

First we consider the simplest case โ€” a haploid population from which we have a sam-ple of ๐‘› individuals genotyped at a single nucleotide polymorphism (SNP). Following[McVean, 2009], we make the following assumptions:

A1. SNPs are identically distributed: Since all sites evolve under the same demographicmodel, the observed genotype ๐‘ง๐‘– at any SNP is a realization of the same random vari-able ๐‘๐‘–.

A2. SNPs segregate in the sample: Since exactly one mutation occurs in every sampledgenealogy, we observe both the ancestral allele '0' and the derived allele '1' at everysite.

A3. e scaled mutation rate ๐œƒ is low: Since A2 and A3 together imply ๐œƒ is a nuisanceparameter [Nielsen, 2000], we can take the limit ๐œƒ = 2๐‘0๐‘ข โ†’ 0 and thus ignoresmall differences in mutation rate across SNPs.

Under these assumptions, the probability that individuals ๐‘– and ๐‘— share the derived mu-tation at a randomly chosen segregating site is given by

Eโˆ—{๐‘๐‘–๐‘๐‘—} =๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž โˆ’ ๐‘‡๐‘–๐‘—

๐‘‡๐‘ก๐‘œ๐‘ก, (3.1)

where ๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž and ๐‘‡๐‘ก๐‘œ๐‘ก are the height and the size of the expected genealogy of the sam-ple, and ๐‘‡๐‘–๐‘— is the expected time for ๐‘– and ๐‘— to coalesce in a sample of size 2 [McVean,2009]. e symbol โˆ— indicates the condition that both 0s and 1s are observed, i.e., theexpectation on the left in equation (3.1) is with respect to all possible genealogies (ob-served or not) with exactly one mutation. e expectations on the right in equation(3.1) are unconditional. e relevance is that for Kimura's stepping-stone model thereis an explicit formula for pairwise coalescence times, ๐‘‡๐‘–๐‘—, and a good approximation interms of effective resistances, ๐‘…๐‘–๐‘—.

Furthermore, since the๐‘๐‘–s are binary random variables and the time to coalescencewith self is always 0,

Eโˆ—{๐‘๐‘–} = Eโˆ—{๐‘2๐‘– } = ๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž

๐‘‡๐‘ก๐‘œ๐‘ก. (3.2)

erefore, the expected genealogy fully specifies the first two moments of the allelecount vector ๐‘ = (๐‘๐‘–) at a particular segregating SNP. In matrix notation,

Eโˆ—{๐‘} = ๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž๐‘‡๐‘ก๐‘œ๐‘ก

1 โ‰ก ๐œ‡1, (3.3a)

varโˆ—{๐‘} = ๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž๐‘‡๐‘ก๐‘œ๐‘ก

(1 โˆ’ ๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž๐‘‡๐‘ก๐‘œ๐‘ก

) โˆ’ 1๐‘‡๐‘ก๐‘œ๐‘ก

๐‘‡ โ‰ก ๐œŽ2(11โ€ฒ โˆ’ ๐œ†๐‘‡). (3.3b)

Page 22: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

22

For sample with configuration ๐›ผ from a population with model ๐บ, the parameters aregiven by

๐œ‡ = ๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž๐‘‡๐‘ก๐‘œ๐‘ก

, ๐œŽ2 = ๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž๐‘‡๐‘ก๐‘œ๐‘ก

(1 โˆ’ ๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž๐‘‡๐‘ก๐‘œ๐‘ก

), ๐œ†๐œŽ2 = 1๐‘‡๐‘ก๐‘œ๐‘ก

, (3.4)

where ๐‘‡ = (๐‘‡๐‘–๐‘—) is the matrix of expected pairwise coalescence times between sampledindividuals. at is, ๐‘‡๐‘–๐‘— is the expected time to coalescence between ๐‘– โˆˆ ๐›ผ๐‘– and ๐‘— โˆˆ ๐›ผ๐‘—in a sample of size 2, regardless of the composition of the entire sample ๐›ผ. Since ๐‘‡๐‘–๐‘—does not depend on the sample configuration or even the sample size ๐‘›, it is completelydetermined by the population model ๐บ. However,

โ€ข e expected height and size of the sample genealogy, ๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž and ๐‘‡๐‘ก๐‘œ๐‘ก, depend onboth the population model ๐บ and the sample configuration ๐›ผ. In particular, they arestrongly influenced by uneven sampling. erefore, ๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž/๐‘‡๐‘ก๐‘œ๐‘ก and 1/๐‘‡๐‘ก๐‘œ๐‘ก are nui-sance parameters because it would be very hard to decouple the effects of populationstructure from the effects of uneven sampling. e confounding of population andsample-specific information also makes it difficult to interpret PCA projections interms of a (historic) demographic process [Novembre et al., 2008, McVean, 2009].

โ€ข e matrix ๐‘‡ = (๐‘‡๐‘–,๐‘— โˆถ individuals ๐‘–, ๐‘—) describes the expected genetic differentiationin the sample and has a block structure which depends on how many individuals, ifany, we observe from each deme. On the other hand, ๐‘‡ = (๐‘‡๐›ผ๐›ฝ โˆถ demes ๐›ผ, ๐›ฝ) spec-ifies how genetic variation increases with geographic distance for all pairs of demes,whether they are sampled from or not. us ๐‘‡ is a dissimilarity matrix that charac-terizes the entire population. Although ๐‘‡ is a function of the sample configuration,it depends on ๐›ผ in a straightforward way:

๐‘‡ = ๐ฝ๐‘‡๐ฝโ€ฒ โˆ’ diag {๐ฝ๐‘‡๐ฝโ€ฒ}, (3.5)

where ๐ฝ โ‰ก ๐ฝ(๐›ผ) = (๐ฝ๐‘–๐›ผ) โˆˆ โ„ค๐‘›ร—๐‘‘ is an indicator matrix such that ๐ฝ๐‘–๐›ผ = 1 if ๐‘– โˆˆ ๐›ผ and0 otherwise. And we remove the diagonal because the coalescence time with self isalways 0.

e demographic model ๐บ, which describes the population, determines the coalescentprocess and hence the expected pairwise coalescence times๐‘‡๐›ผ๐›ฝ for all deme pairs (๐›ผ, ๐›ฝ).On the other hand, both the model ๐บ and the configuration ๐›ผ determine the genealog-ical statistics ๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž and ๐‘‡๐‘ก๐‘œ๐‘ก which are generally not of interest as the goal is to esti-mate population-level features of ๐บ โ€” such as the migration rates between pairs ofconnected demes โ€” while accounting for the sample specific features of ๐›ผ. In this the-sis ๐บ = (๐‘‰, ๐ธ, ๐‘€) is always a population graph (๐‘‰, ๐ธ, ๐‘€) with equally sized demes ๐‘‰,undirected edges ๐ธ and effective migration rates ๐‘€ โˆถ ๐ธ โ†’ โ„+.

We have shown that the expected mean and variance of a genotype vector are com-putable functions of the effectivemigration rates๐‘€. Next we derive similar expressionsfor the mean and the variance as functions of expected coalescence times in the case ofdiploid SNPs and microsatellites.

3.1.1 e case of diploid data

Since a diploid individual is the offspring of a pair of diploid parents, we can representthe genotype of a diploid as the sum of two haploids, each drawn randomly from thesame location, i.e., ๐‘‹๐‘– = ๐‘(1)

๐‘– + ๐‘(2)๐‘– โˆˆ {0, 1, 2} where the superscript indicates one of

two haplotypes. However, since we do not distinguish between the haplotype inherited

Page 23: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 23

from the mother and the haplotype inherited from the father, this assumption is rea-sonable only for autosomal SNPs (and not for sex-linked ones) in outbred individuals.

A sample ๐‘‹1, โ€ฆ , ๐‘‹๐‘› of ๐‘› diploid individuals is polymorphic if

{๐‘‹1, โ€ฆ , ๐‘‹๐‘› โˆถ at least one ๐‘‹๐‘– โ‰ฅ 1}โ‡” {๐‘(1)

1 , ๐‘(2)1 , โ€ฆ , ๐‘(1)

๐‘› , ๐‘(2)๐‘› โˆถ ๐‘(1)

๐‘– = 1 or ๐‘(2)๐‘– = 1}. (3.6)

at is, a segregating SNP in a diploid sample of size ๐‘› is equivalent to exactly one mu-tation in a haploid sample of size 2๐‘›. [is excludes the possibility that all individualscarry the same allele, either ancestral or derived.]

Furthermore, at a segregating site in a diploid sample, the copies ๐‘(1)๐‘– and ๐‘(2)

๐‘– ,which constitute ๐‘‹๐‘–, are not independent โ€” the event that one carries the mutationbut not the other is informative for the time to their most common ancestor. ere-fore,

Eโˆ—{๐‘‹๐‘–} = Eโˆ—{๐‘(1)๐‘– } + Eโˆ—{๐‘(2)

๐‘– } = 2Eโˆ—{๐‘๐‘–} = 2๐œ‡ (3.7a)

varโˆ—{๐‘‹๐‘–} = 2varโˆ—{๐‘๐‘–} + 2covโˆ—{๐‘(1)๐‘– , ๐‘(2)

๐‘– } = 4๐œŽ2 โˆ’ 2๐œ†๐œŽ2๐‘‡๐‘–๐‘– (3.7b)

covโˆ—{๐‘‹๐‘–, ๐‘‹๐‘—} = 4covโˆ—{๐‘๐‘–, ๐‘๐‘—} = 4๐œŽ2 โˆ’ 4๐œ†๐œŽ2๐‘‡๐‘–๐‘— (3.7c)

where the symbol โˆ— indicates the condition that there is exactly one mutation in a sam-ple of 2๐‘› haplotypes [and ๐‘‡๐‘–๐‘– is the expected coalescence time for two distinct lineageswith the same origin as individual ๐‘–]. In matrix notation,

Eโˆ—{๐‘‹} = 2๐œ‡1, varโˆ—{๐‘‹} = 4๐œŽ2(11โ€ฒ โˆ’ ๐œ†๐‘‡2), (3.8)

where

๐‘‡2 = ๐ฝ๐‘‡๐ฝโ€ฒ โˆ’ 12 diag {๐ฝ๐‘‡๐ฝโ€ฒ}. (3.9)

e subscript 2 indicates that the matrix of pairwise coalescence times corresponds toa diploid population. Here the mean does not depend on the location. (is is the casefor haploid data as well.) However, the variance varโˆ—{๐‘‹๐‘–} can vary with location unlessthe demographic model implies ๐‘‡๐›ผ๐›ผ = ๐‘‡0 for all demes ๐›ผ, i.e., isotropic migration.

3.2 Mean and covariance of genotype vectors: microsatellites

Microsatellites (also called short tandem repeats) are repeating sequences of a particularshort DNA segment. Mutation can increase or decrease the number of repeats ๐‘˜, andeach ๐‘˜ corresponds to an allele.

To model microsatellites, we assume that a locus ๐‘  evolves from its ancestral allele๐ด๐‘  according to a symmetric stepwise mechanism where mutations occur with rate ๐œƒ๐‘ and each mutation increases or decreases the number of repeats by exactly one, withequal probability. Here we consider the evolution at a particular site, and for simplicityof notation, we omit the subscript ๐‘  in the rest of this section.

e ancestral allele ๐ด and the mutation rate ๐œƒ are unknown site-specific parameterswhile the genealogy ๐’ฏ has a distribution determined by Kingman's coalescent. As wedid for SNPs, we assume that themicrosatellites are neutral and hence their genealogiesare identically distributed. On the other hand,microsatellites are usually highly variablemarkers (i.e., with high mutation rates), so we cannot take the low-mutation limit.

Conditional on the mutation rate ๐œƒ and the genealogical tree ๐’ฏ of the sample, mu-tations occur independently and the number of mutations on a branch with length ๐‘ก is

Page 24: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

24

a Poisson random variable with mean ๐œƒ๐‘ก. is follows from the assumption that muta-tions are generated by a Poisson process with intensity [mutation rate] ๐œƒ. For example,the total number of mutations is

๐พ๐‘ก๐‘œ๐‘ก | ๐œƒ, ๐’ฏ โˆผ Po(๐œƒ๐‘ก๐‘ก๐‘œ๐‘ก), (3.10)

while the number of mutations carried by individual ๐‘– is๐พ๐‘– | ๐œƒ, ๐’ฏ โˆผ Po(๐œƒ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Ž). (3.11)

All lineages share the samePoissonmeanparameter because every branch froma lineageto the most common ancestor of the entire sample has length ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Ž.

Let ๐’ฆ denote the set of all mutations that occur in the genealogy, with |๐’ฆ| = ๐พ๐‘ก๐‘œ๐‘ก.Also, let ๐’ฆ๐‘– โŠ‚ ๐’ฆ denote the set of mutations carried by individual ๐‘–, with |๐’ฆ๐‘–| = ๐พ๐‘–.Since each mutation is equally likely to decrease or increase the allele length by 1, the๐‘–th allele is

๐‘๐‘– = ๐ด + โˆ‘๐‘˜โˆˆ๐’ฆ๐‘–

๐‘†๐‘˜, (3.12)

where ๐‘†๐‘˜ = ยฑ1 with probability 1/2 and thus E{๐‘†๐‘˜} = 0 and var{๐‘†๐‘˜} = E{๐‘†2๐‘˜} = 1.

First we derive the mean and variance of allele ๐‘๐‘– given the mutation rate, the an-cestral allele and the genealogy. e binary variables, ๐‘†๐‘˜, are independent of the samplehistory, so E{๐‘†๐‘˜ | ๐œƒ, ๐ด, ๐’ฏ } = E{๐‘†๐‘˜} and var{๐‘†๐‘˜ | ๐œƒ, ๐ด, ๐’ฏ } = var{๐‘†๐‘˜}. And furthermore,conditional on the number of mutations, the ๐‘†๐‘˜s are mutually independent. erefore,

E{๐‘๐‘– | ๐œƒ, ๐ด, ๐’ฏ } = ๐ด + E{E{ โˆ‘๐‘˜โˆˆ๐’ฆ๐‘–

๐‘†๐‘˜ |๐พ๐‘–}} = ๐ด + E{๐พ๐‘–โˆ‘๐‘˜=1

E{๐‘†๐‘˜}} = ๐ด, (3.13a)

var{๐‘๐‘– | ๐œƒ, ๐ด, ๐’ฏ } = E{๐พ๐‘–โˆ‘๐‘˜=1

E{๐‘†2๐‘˜}} + E{ โˆ‘

๐‘˜โ‰ ๐‘˜โ€ฒE{๐‘†๐‘˜๐‘†๐‘˜โ€ฒ}} = E{๐พ๐‘–} = ๐œƒ๐‘ก๐‘š๐‘Ÿ๐‘๐‘ŽSince the mutations are independent,

E{๐‘†๐‘˜๐‘†๐‘˜โ€ฒ } = E{๐‘†๐‘˜}E{๐‘†๐‘˜โ€ฒ } = 0 for ๐‘˜ โ‰  ๐‘˜โ€ฒ.

, (3.13b)

because ๐พ๐‘– is a Poisson random variable with mean ๐œƒ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Ž by equation (3.11).Let ๐’ฆ๐‘–โŠ•๐‘— be the set of mutations that occur in one lineage but not the other, with

|๐’ฆ๐‘–โŠ•๐‘—| = ๐พ๐‘–โŠ•๐‘—. Suchmutations occur on the branch from ๐‘– to๐‘š๐‘Ÿ๐‘๐‘Ž(๐‘–, ๐‘—) or on the branchfrom ๐‘— to ๐‘š๐‘Ÿ๐‘๐‘Ž(๐‘–, ๐‘—). erefore, ๐พ๐‘–โŠ•๐‘— has mean 2๐œƒ๐‘ก๐‘–๐‘—. Similarly, let ๐’ฆ๐‘–\๐‘— be the set ofmutations carried by ๐‘– but not ๐‘—.

E{(๐‘๐‘– โˆ’ ๐‘๐‘—)2 | ๐œƒ, ๐ด, ๐’ฏ } = E{( โˆ‘๐‘˜โˆˆ๐’ฆ๐‘–\๐‘—

๐‘†๐‘˜ โˆ’ โˆ‘๐‘˜โˆˆ๐’ฆ๐‘—\๐‘–

๐‘†๐‘˜)2} = E{

๐พ๐‘–โŠ•๐‘—

โˆ‘๐‘˜=1

E{๐‘†2๐‘˜}} = E{๐พ๐‘–๐‘—} = 2๐œƒ๐‘ก๐‘–๐‘—Again, the cross terms are 0 by mutual

independence.

,

(3.14a)

cov{๐‘๐‘–, ๐‘๐‘— | ๐œƒ, ๐ด, ๐’ฏ } = var{๐‘๐‘– | ๐œƒ, ๐ด, ๐’ฏ } โˆ’ 12E{(๐‘๐‘– โˆ’ ๐‘๐‘—)2 | ๐œƒ, ๐ด, ๐’ฏ } = ๐œƒ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Ž โˆ’ ๐œƒ๐‘ก๐‘–๐‘—.

(3.14b)

Now we have expressions for the mean, variance and covariance of the genotypes at aparticular microsatellite, given the site-specific mutation rate ๐œƒ, ancestral allele ๐ด andgenealogy๐’ฏ . We treat๐œƒ and๐ด asnuisance parameters to be estimated andwemarginal-ize the genealogy out. e goal is to express the model in terms of the expected coales-cence times rather than the coalescence times at a particular site. We took the sameapproach for SNP data but in the former case, ๐ด = 0 for every segregating site and ๐œƒ iseliminated in the small mutation limit ๐œƒ โ†’ 0. Finally,

E{๐‘๐‘– | ๐œƒ, ๐ด} = E{๐ด | ๐œƒ, ๐ด} = ๐ด,E{๐‘‹} = E{E{๐‘‹ | ๐‘Œ}} (3.15a)

var{๐‘๐‘– | ๐œƒ, ๐ด} = E{๐œƒ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Ž | ๐œƒ, ๐ด} + var{๐ด | ๐œƒ, ๐ด} = ๐œƒ๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž,var{๐‘‹} = E{var{๐‘‹ | ๐‘Œ}} + var{E{๐‘‹ | ๐‘Œ}} (3.15b)

cov{๐‘๐‘–, ๐‘๐‘— | ๐œƒ, ๐ด} = E{๐œƒ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Ž โˆ’ ๐œƒ๐‘ก๐‘–๐‘— | ๐œƒ, ๐ด} + var{๐ด | ๐œƒ, ๐ด} = ๐œƒ(๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž โˆ’ ๐‘‡๐‘–๐‘—)cov{๐‘‹, ๐‘} = E{cov{๐‘‹, ๐‘ | ๐‘Œ}} +

cov{E{๐‘‹ | ๐‘Œ}, E{๐‘ | ๐‘Œ}}

. (3.15c)

Page 25: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 25

In the case ofmicrosatellites, we donot condition on observing variability in the sample,i.e., on the event {๐พ๐‘ก๐‘œ๐‘ก > 0} as microsatellites have higher mutation rates and we canestimate the parameter rather than take its limit to 0. For SNPs such that we observeexactly one mutation at every site, the "variability" condition is explicitly modeled be-cause it modifies the genealogy distribution. Intuitively, it "stretches" the tree and thuschanges (proportionally) all branches ๐‘ก โˆˆ ๐’ฏ .

erefore, the genotype vector of๐‘› sampled individuals at a particularmicrosatellitehas mean and variance

Eโˆ—{๐‘} = ๐œ‡1, varโˆ—{๐‘} = ๐œŽ2(11โ€ฒ โˆ’ ๐œ†๐‘‡) (3.16)

where the symbol โˆ— indicates conditioning on the ancestral allele ๐ด and the mutationrate ๐œƒ, and the parameters are given by

๐œ‡ = ๐ด, ๐œŽ2 = ๐œƒ๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž, ๐œ† = 1๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž

. (3.17)

As for SNP data, the mean and the variance of genotypes at a particular locus do notdepend on the origin of an individual. However, for microsatellite data, the mean andthe variance vary across sites because the ancestral allele ๐ด and the mutation rate ๐œƒ areboth site-specific parameters. On the other hand, the scale ๐œ† is shared across sites andtherefore every site has the same correlation matrix ฮฃ โ‰ก 11โ€ฒ โˆ’ ๐œ†๐‘‡.

With this parametrization, the demographic parameters are estimable up to a pro-portionality constant. If wemultiply themigration and coalescence rates by 2, we speedup the structured coalescent process by a factor of 2, and hence, we decrease the ex-pected coalescence times by 2. However, the covariance matrix ฮฃ remains unchangedbecause the dissimilarity matrix ๐‘‡ is appropriately scaled.

3.3 Effective migration can explain spatial structure in genetic variation

In the previous section, we discussed how to specify the mapping from the stepping-stone model ๐บ = (๐‘‰, ๐ธ, ๐‘€) to the genetic covariance matrix cor{๐‘} = ฮฃ, for bothSNP and microsatellite data. Briefly, we followed three steps. First, ๐บ = (๐‘‰, ๐ธ, ๐‘€)determines ๐‘‡ = (๐‘‡๐›ผ๐›ฝ) through the system of linear equations (2.15). en, in turn,the expected coalescence times between demes, ๐‘‡, determine the expected coalescencetimes between sampled individuals, ๐‘‡, through equation (3.5). Finally, the distancematrix determines the correlation matrix ฮฃ = 11โ€ฒ โˆ’ ๐œ†๐‘‡ by equation (3.3b) where ๐œ† isan appropriately chosen scalar parameter that guarantees ฮฃ is positive definite.

Our goal is to estimate the effective migration rates ๐‘€ across the habitat; these aresample-independent (population-level) features of the population graph๐บ. emean ๐œ‡and the variance๐œŽ2 of derived alleles as well as the scale factor๐œ† of expected coalescencetimes can be treated as nuisance parameters because they are sample-dependent andshared by all individuals in the sample. For example, for haploid SNPs the overall meanis ๐œ‡ = ๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž/๐‘‡๐‘ก๐‘œ๐‘ก [with ๐œŽ2 = ๐œ‡(1 โˆ’ ๐œ‡)] and the scale factor is ๐œ† = 1/๐‘‡๐‘ก๐‘œ๐‘ก, so (๐œ‡, ๐œŽ2, ๐œ†)contain some information about ๐บ. Although the scalars ๐‘‡๐‘ก๐‘œ๐‘ก and ๐‘‡๐‘š๐‘Ÿ๐‘๐‘Ž are, formally,functions of the effective migration rates ๐‘€ they are very difficult to compute.

On the other hand, the matrix ๐‘‡ = (๐‘‡๐‘–๐‘—) of pairwise coalescence times is a com-putable function of ๐‘€. is matrix is also a pairwise dissimilarity (distance) matrix[and formally, a semivariogram]: the more genetically dissimilar two individuals are,the longer the time to their most recent common ancestor because the probability thatthe branch ๐‘‡๐‘–๐‘— accumulates a mutation is proportional to its relative length in the aver-age genealogy tree. e property that ๐‘‡ is a distance matrix is important because it can

Page 26: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

26

explain genetic dissimilarities (correlations) as a linear function of distances betweenlocations. Expected coalescence time is a particular choice of distance metric motivatedby coalescent theory [McVean, 2009]. We can consider other metrics such as effectiveresistance [McRae, 2006].

๐‘Š โ‰กโŽง{โŽจ{โŽฉ

๐‘€ = { migration rates ๐‘š(๐‘’) }๐ถ = { conductances ๐‘(๐‘’) }

โˆถ โˆ€๐‘’ โˆˆ ๐ธโŽซ}โŽฌ}โŽญ

๐‘Š โˆˆ ๐•Š๐‘‘ is a symmetric matrix of

weights.

(1)โŸถ โˆ† โ‰กโŽง{โŽจ{โŽฉ

๐‘‡ = { coalescence times ๐‘‡๐›ผ๐›ฝ }๐‘… = { effective resistances ๐‘…๐›ผ๐›ฝ }

โˆถ โˆ€(๐›ผ, ๐›ฝ) โˆˆ ๐‘‰ ร— ๐‘‰โŽซ}โŽฌ}โŽญ

โˆ† โˆˆ ๐”ป๐‘‘ is the population distance

matrix.

(2)โŸถ โˆ† โ‰กโŽง{โŽจ{โŽฉ

๐‘‡ = { coalescence times ๐‘‡๐‘–๐‘— }๐‘… = { effective resistances ๐‘…๐‘–๐‘— }

โˆถ โˆ€(๐‘–, ๐‘—) โˆˆ ๐›ผโŽซ}โŽฌ}โŽญ

โˆ† โˆˆ ๐”ป๐‘› is the sample distance matrix.

(3)โŸถ ฮฃ โ‰ก 11โ€ฒ โˆ’ ๐œ†โˆ†ฮฃ โˆˆ ๐•๐‘› is the sample covariance matrix.

e first step, denoted by(1)โŸถ, is to compute all ๐‘‘(๐‘‘ + 1)/2 pairwise distances between

๐‘‘ demes. is operation is expensive even for medium-size grids. However, the covari-ancematrixฮฃ is a function of the sample distancematrixฮ”, not the population distancematrix ฮ”. at is, in principle, we could avoid computing the full ๐‘‘ ร— ๐‘‘ dissimilarity ma-trix, especially for sparsely sampled habitats. [is is the advantage of ๐‘… over ๐‘‡.]

In a certain sense, ๐‘‡ is an "appropriate" dissimilarity measure for population struc-ture as genetically similar individuals are likely to have a recent common ancestor andthus shorter coalescence time. For the stepping-stone model we can obtain the matrixof pairwise coalescence times๐‘‡ exactly or approximate it with thematrix of effective re-sistances, ๐‘…. However, the stepping-stone model itself does not represent the true his-tory of the populationโ€” the grid is placed arbitrarily and there are underlying assump-tions, including equilibrium in time, lowmutation rate and no selection. erefore, in amanner similar to McRae's definition of the effective migration rate, ๐‘š๐›ผ๐›ฝ, for a pair ofdemes, we should interpret the migration rate function ๐‘€ = {๐‘š๐›ผ๐›ฝ โˆถ (๐›ผ, ๐›ฝ) โˆˆ ๐‘‰ ร— ๐‘‰}as effective migration surface because it would produce the observed patterns of geneticdifferentiation if the population were evolving under the stepping-stone model.

3.4 Related methods for analyzing population structure

We have shown that genetic correlations can be modeled in terms of a distance ma-trix. is representation is motivated by the relationship between genetic similaritiesand expected coalescence times. However, we can consider other distance metrics (onthe population graph) as long as they capture relevant features of a spatially heteroge-neous habitat, and effective resistance is particularly useful because it approximates thecoalescent-based metric and is efficient to compute.

Here we discuss briefly two related methods for analyzing spatially distributed pop-ulations.

3.4.1 MIGRATE

[Beerli and Felsenstein, 2001] develop an approach to estimate migration rates amongdemes, and more generally, to compare and rank structured population models. eir

Page 27: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 27

method MIGRATE is also based on the structured coalescent but it makes different as-sumptions about the spatial distribution and the migration pattern.

In MIGRATE the demes are sampling locations and all demes potentially exchangemigrants, so the population graph is constructed without explicit geographic informa-tion. [Some edges can be excluded to test and compare various migration patterns.]Every deme in the resulting graph has a size parameter and every edge has two migra-tion parameters. [MIGRATE allows asymmetric gene flow.] us for a graph with ๐‘‘demes, the most complex model to test has ๐‘‘(๐‘‘ โˆ’ 1) migration rates and ๐‘‘ deme sizes.

In contrast, our method uses a regular triangular grid constructed independently ofthe sampling configuration [or an a priori grouping of individuals into subpopulations].Migration is symmetric and constrained to occur only between neighboring demes butnot all demes need to be sampled. A Voronoi tessellation of a Euclidean

space is a partition into ๐‘‡ convex poly-gons (tiles) generated by ๐‘‡ distinctpoints (centers). e region associatedwith the ๐‘กth center ๐‘ข is the set of pointscloser to ๐‘ข than any other center. Bound-ary points are equidistant to two centers.[Okabe et al., 2000].

And edges are grouped via a Voronoi tessellation ofthe habitat to encourage parameter sharing and locally constant migration. is repre-sentation is flexible and the number of (unique) migration rates varies with the numberof tiles.

3.4.2 GENELAND

[Guillot et al., 2005] also uses Voronoi tiling to model the spatial structure in geneticvariation but their method GENELAND is cluster-based and thus best suited to ana-lyze discrete structure. Since individuals sampled from geographically close locationsare more likely to come from the same subpopulation, GENELAND attempts to findclusters that are both genetically and geographically coherent. Compared with a spatialrepresentation in terms of a population graph, such clusters can correspond to singledemes in the graph (e.g., if migration is low and even demes close in space are clearlydifferentiated); or they can correspond to groups of demes where allele frequency dis-tributions are indistinguishable (e.g., if gene flow is high so that a mutation that arisesin one deme can quickly ''spread'' to nearby locations).

Page 28: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

4

Estimating Effective Rates of Migration

In this chapter we introduce a likelihood function and prior distributions to performBayesian inference for the effective migration surface ๐‘€ based on the similarities ob-served in georeferenced genetic data. e posterior estimate of๐‘€ can represent graphi-cally population-level features such as barriers tomigration, ormore generally, the com-bined effect of evolutionary processes on genetic differentiation.

Our method assumes that we have data for ๐‘› individuals sampled from a spatiallydistributed population at locations (๐‘ฅ1, ๐‘ฆ1), โ€ฆ , (๐‘ฅ๐‘›, ๐‘ฆ๐‘›) and genotyped at ๐‘ loci, eitherSNPs or microsatellites. e geographic information is used to assign individuals tothe closest deme in the population graph (๐‘‰, ๐ธ); this defines the sample configuration๐›ผ = (๐›ผ1, โ€ฆ , ๐›ผ๐‘›). Given๐บ = (๐‘‰, ๐ธ, ๐‘€)with symmetric migration rates๐‘€ = (๐‘š๐›ผ๐›ฝ)wecan compute the pairwise distancematrix for entire populationฮ” = (ฮ”๐›ผ๐›ฝ); givenฮ” andthe deme indicators ๐›ผ we can obtain the expected pairwise distances for the observedsample ฮ” = (ฮ”๐‘–๐‘—). Notation: Here we discuss the likelihood of the sample, so we willwrite simply ฮ” throughout as there is no need to distinguish between the populationand the sample distance matrices.

In the previous chapter we derived expressions for the mean and variance of theallele count vector ๐‘ = (๐‘๐‘–) at a segregating site [eq. (3.3) for single nucleotide poly-morphisms; eq. (3.16) for microsatellites]. Recall that

E{๐‘} = ๐œ‡1, var{๐‘} = ๐œŽ2(11โ€ฒ โˆ’ ๐œ†ฮ”), (4.1)

where ๐œ‡ is the allele frequency and ๐œŽ2 is the variance in allele frequency [in the sample,not the population]. It is convenient to normalize ฮ” so that 1โ€ฒฮ”โˆ’11 = 1; then thecorrelation matrix ฮฃ = 11โ€ฒ โˆ’ ๐œ†ฮ” is positive definite for ๐œ† โˆˆ (0, 1) [Appendix 7.2].

Recall further that neutral sites (not under selection) develop under the same co-alescent process, and therefore, the genotype vectors ๐‘ = (๐‘1, โ€ฆ , ๐‘๐‘) โˆˆ โ„ค๐‘›ร—๐‘ at ๐‘segregating sites have the same correlation matrix ฮฃ. e scalar parameters ๐œ‡, ๐œŽ2 canvary across sites. For microsatellites ๐œ‡ is the ancestral allele and ๐œŽ2 depends on themu-tation rate ๐œƒ, and both are site specific. For SNPs๐œ‡ is the expected allele frequency if thederived allele is coded as 1; but the labels might not be consistent as usually the minorallele is coded as 1.

Our aim here is to incorporate these expressions for the mean and variance intoa likelihood function in order to infer effective migration rates from observed data.Note that every individual has mean ๐œ‡ regardless of location; intuitively, the sharedparameter ๐œ‡ contains little information about patterns of genetic differentiation be-tween individuals, as we discuss in Section 4.4. So, to simplify, assume that we ob-serve the pairwise differences, ๐‘๐‘– โˆ’ ๐‘๐‘—, rather than the allele counts ๐‘๐‘–. Equivalently,assume that we observe ๐ฟ๐‘ where ๐ฟ โˆˆ โ„(๐‘›โˆ’1)ร—๐‘› is a basis for contrasts, e.g., ๐ฟ =

Page 29: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 29

(๐‘’2 โˆ’ ๐‘’1, ๐‘’3 โˆ’ ๐‘’1, โ€ฆ , ๐‘’๐‘› โˆ’ ๐‘’1)โ€ฒ where ๐‘’๐‘– is the standard basis vector with 1 in the ๐‘–thcoordinate and 0 otherwise. Note that

E{๐ฟ๐‘} = 0, var{๐ฟ๐‘} = โˆ’๐œŽโˆ—๐ฟฮ”๐ฟโ€ฒ, (4.2)

where we define ๐œŽโˆ— = ๐œ†๐œŽ2 because the variance and the scale are longer identifiable.e matrix โˆ’๐ฟฮ”๐ฟโ€ฒ is positive definite, and thus a valid covariance matrix, because thedistance matrix ฮ” is nonnegative definite on contrasts and ๐ฟ๐‘ฃ is a contrast for every๐‘ฃ โˆˆ โ„๐‘›โˆ’1.

erefore, it might be natural to assume a Normal likelihood for the pairwise differ-ences,

๐ฟ๐‘ | ๐œŽโˆ—, ฮ” โˆผ N๐‘›โˆ’1(0, โˆ’๐œŽโˆ—๐ฟฮ”๐ฟโ€ฒ). (4.3)

Suppose further that the genotypedmarkers are independent; then it is straightforwardto extend the Normal likelihood (4.3) for one locus to multiple loci. In particular, forSNP data where usually there are many more SNPs than individuals and mutation ratesare low, let ๐‘† = ๐‘๐‘โ€ฒ/๐‘ be the observed similarity matrix averaged across ๐‘ SNPs. en๐ฟ๐‘†๐ฟโ€ฒ is a scatter matrix of pairwise differences and

๐ฟ๐‘†๐ฟโ€ฒ | ๐œŽโˆ—, ฮ” โˆผ W๐‘›โˆ’1(๐‘, โˆ’๐œŽโˆ—

๐‘ (๐ฟฮ”๐ฟโ€ฒ)), (4.4)

where the degrees of freedom are the number of independent SNPs and and the scaleparameter ๐œŽโˆ— is shared. erefore, by considering the pairwise differences, we avoidestimating a nuisance parameter ๐œ‡ with dimensionality that grows with the number ofmarkers ๐‘. In practice we also gain efficiency with faster MCMC convergence.

4.1 Effective degrees of freedom for SNP data

So far we have considered the case where the ๐‘ genotyped markers are independent(unlinked). e assumption of independence between loci is very strong and likely tobe violated. In particular, SNPs in close proximity are often associated (in linkage dis-equilibrium) because individuals inherit long segments of unbroken DNA from theirparents. For this reason, SNPs data is often ''thinned'' by removing SNPs in high LD.We propose an alternative method to correct for model mis-specification due to bothdependence between SNPs and non-normality of genotypes.

In the Wishart likelihood (4.3) the scatter matrix of contrasts, ๐ฟ๐‘†๐ฟโ€ฒ, has known de-grees of freedom ๐‘. However, instead of fixing the degrees of freedom to the number ofgenotyped SNPs, we can estimate this parameter. e likelihood for the scatter matrixbecomes

๐ฟ๐‘†๐ฟโ€ฒ | ๐‘˜, ๐œŽโˆ—, ฮ” โˆผ W๐‘›โˆ’1(๐‘˜, โˆ’๐œŽโˆ—

๐‘˜ (๐ฟฮ”๐ฟโ€ฒ)), (4.5)

with degrees of freedom ๐‘˜ โˆˆ (๐‘›, ๐‘). Both Wishart likelihoods (4.3) and (4.5) implyE{๐ฟ๐‘†๐ฟโ€ฒ} = โˆ’๐œŽโˆ—๐ฟฮ”๐ฟโ€ฒ. erefore, estimating the degrees of freedom does not affect theexpected pairwise differences as a function of effectivemigration. However, theWishartvariance is proportional to (๐œŽโˆ—)2/๐‘˜, so it we infer ๐‘˜ โˆˆ (๐‘›, ๐‘) rather than set ๐‘˜ = ๐‘,the model variance increases as we would expect if the data contain less informationthan the sample size suggests, or more generally, if the model is mis-specified. Undernormality, ๐‘˜ = ๐‘ implies that all sites are independent; otherwise, the variance increasesby a factor of ๐‘/๐‘˜.

Page 30: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

30

4.2 Prior on migration surface represented as a Voronoi tessellation

We have proposed a model for population structure in terms of expected pairwise dis-tances on a population graph ๐บ = (๐‘‰, ๐ธ, ๐‘€) where (๐‘‰, ๐ธ) is a rectangular grid and๐‘€ assigns effective migration rates to edges in the graph. e goal is to estimate theeffective migration surface ๐‘€ so that the demographic model ๐บ explains the observedgenetic dissimilarities. e grid is fixed; the likelihood is defined in the previous section.Here we consider prior specification for ๐‘€.

e regular grid (๐‘‰, ๐ธ) is not determined by the sampling locations and it yields ahigh-dimensional, flexible representation so that fine features in the effectivemigrationsurface can emerge if supported by the data. To take advantage of this flexibility, weorganize the edges in terms of a Voronoi tessellation of the habitat. Statistically, theVoronoi decomposition offers the advantages of parameter sharing and a locally smoothmigration surface. Previous applications ofVoronoi tiling in population genetics include[Guillot et al., 2005] and [Wasser et al., 2004].

A Voronoi tessellation of the migration surface ๐‘€ is fully specified by the numberof tiles ๐‘‡, their locations ๐‘ข and migration rates ๐‘š. us ๐‘š = {๐‘š๐‘ก โˆถ ๐‘ก = 1, โ€ฆ , ๐‘‡} is theset of effective migration rates for the ๐‘‡ tiles in the partition. Furthermore, let edge(๐›ผ, ๐›ฝ) โˆˆ ๐ธ have migration rate

๐‘š๐›ผ๐›ฝ = 12๐‘š๐‘ก๐›ผ + 1

2๐‘š๐‘ก๐›ฝ , (4.6)

where ๐‘ก๐›ผ denotes the tile deme ๐›ผ falls into. at is, the rate of an edge is the averagerate of the two tiles it connects.

Migration rates are naturally positive and therefore we parametrize them on the logscale as differences from the overall mean rate โ„“ s๐‘š,

log10(๐‘š๐‘ก) = โ„“ s๐‘š + ๐‘’๐‘ก. (4.7)

If the effect of distance on differentiation is space-homogeneous and the tile-specificeffects ๐‘’๐‘ก are (close to) 0, the migration pattern thus produced would correspond toisolation by distance.

erefore, our model has the following parameters:

1. parameters of interest ฮ˜1 that determine the effective migration rates ๐‘€ and thusthe effective pairwise distances ฮ”. ese are

โ€ข (๐‘‡, โ„“ s๐‘š, ๐œŽ2๐‘š): number of tiles, mean and variance of tile migration rates on the log(base 10) scale.

โ€ข {(๐‘’๐‘ก, ๐‘ข๐‘ก) โˆถ ๐‘ก = 1, โ€ฆ , ๐‘‡}: relative effect and center location for each Voronoi tile๐‘ก. e dimensionality of this group of parameters changes with the number ofVoronoi tiles ๐‘‡.

โ€ข ๐‘˜: effective degrees of freedom for SNP data where we observe more sites thanindividuals, i.e., ๐‘ > ๐‘›.

2. nuisance parameters ฮ˜0 that do not depend on the demographic model. For SNPdata this is the scale parameter ๐œŽโˆ—; for microsatellite data each site has its own scaleparameter ๐œŽโˆ—๐‘  because mutation rates vary across sites and under the stepwise mu-tation model the scale ๐œŽโˆ—๐‘  is the mutation rate ๐œƒ๐‘ .

Using the Voronoi tessellation๐’ฑ(๐‘‡, ๐‘ข, ๐‘’) to represent๐‘€, we can have fewer than โˆฃ๐ธโˆฃ rateparameters to estimate but we do not know how many tiles we need and where theircenters are. is depends on the patterns of genetic differentiation across the habitat.

Page 31: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 31

To complete the Bayesian specification we place priors on the model parameters:

(number of Voronoi tiles) ๐‘‡ | ๐œˆ โˆผ Po(๐œˆ), (4.8a)

(tile locations) ๐‘ข | ๐‘‡ iidโˆผ U(โ„‹), (4.8b)

(tile effects) ๐‘’ | ๐œŽ2๐‘’ , ๐‘‡ iidโˆผ N(0, ๐œŽ2๐‘’ ). (4.8c)

e hyperparameter ๐œˆ controls howmuch spatial heterogeneity the effective migrationsurface exhibits. e rate hyperparameters are

(overall migration rate) โ„“ s๐‘š โˆผ U(๐‘™๐‘œ๐‘, ๐‘ข๐‘๐‘), (4.9a)

(tile variance) ๐œŽ2๐‘š โˆผ Inv-G(๐‘Ž/2, ๐‘/2). (4.9b)

[For all results we report here ๐‘Ž = 6, ๐‘ = 3.] e lower and upper bounds on the meanlog rate are chosen so that the mean migration rate varies in the range [1/300, 300]on the original scale. e bounds are somewhat arbitrary but based on simulations ofgenetic data with ms [Hudson, 2002]. Restricting the support is necessary because themodel is not numerically stable at the two extremes:

โ€ข When migration rates are very small (relatively to coalescence rates), it takes verylong time on average for two lineages from different demes to coalesce. In the limit,the population is a collection of unrelated subpopulations that evolve independently.

โ€ข Whenmigration rates are very large (relatively to coalescence rates), the time it takestomove fromonedeme to another is negligible compared to the coalescence times. Inthe limit, the population behaves like a panmictic population without any structure.

e prior on the effective degrees of freedom is uniform on the log scale:

(degrees of freedom) ๐œ‹(๐‘˜) โˆ 1๐‘˜ . (4.10)

e prior is proper because ๐‘˜ is bounded: ๐‘˜ > ๐‘› because ๐‘˜ is the degrees of freedomparameter in aWishart distribution, and ๐‘˜ < ๐‘ because ๐‘˜ should not exceed the numberof observed sites (features). [e normalizing constant is therefore log(๐‘) โˆ’ log(๐‘›).]

(scale parameter) ๐œŽโˆ— โˆผ Inv-G(๐‘/2, ๐‘‘/2). (4.11)

[For all results we report here ๐‘ = 1, ๐‘‘ = 1.]We use reversible-jump MCMC to estimate ๐‘‡ as the dimension of both ๐‘ข and ๐‘’

changes as the number of tiles ๐‘‡ increases or decreases. Full details about the MCMCimplementation are given in Appendix 7.6.

4.3 Likelihood for distance matrices

e Wishart likelihood (4.5) is given in terms of the contrast basis ๐ฟ but it does notdepend on the choice of๐ฟ. Instead, it can bewritten in terms of the observed similarities๐‘† = ๐‘๐‘โ€ฒ/๐‘, the model distances ฮ” and its orthogonal projection ๐‘„ given by

๐‘„ = ๐ผ โˆ’ 11โ€ฒฮ”โˆ’1

1โ€ฒฮ”โˆ’11 . (4.12)

Page 32: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

32

In Appendix 7.4 we show that the Wishart log likelihood that corresponds to themodel (4.5) can be written as

โ„“(๐‘˜, ๐œŽโˆ—, ฮ”) =

โŽง{{{{{{{{{{{{{{โŽจ{{{{{{{{{{{{{{โŽฉ

+ [๐‘˜/2] logdet { โˆ’ (๐ฟฮ”๐ฟโ€ฒ)โˆ’1/๐œŽโˆ—}

โˆ’ [๐‘˜/2] tr { โˆ’ (๐ฟฮ”๐ฟโ€ฒ)โˆ’1๐ฟ๐‘†๐ฟโ€ฒ/๐œŽโˆ—}

+ [(๐‘˜ โˆ’ ๐‘›)/2] logdet {๐ฟ๐‘†๐ฟโ€ฒ}

+ [๐‘˜(๐‘› โˆ’ 1)/2] log (๐‘˜/2)

โˆ’ log ฮ“๐‘›โˆ’1(๐‘˜/2)

=

โŽง{{{{{{{{{{{{{{{โŽจ{{{{{{{{{{{{{{{โŽฉ

+ [๐‘˜/2] log Det { โˆ’ ฮ”โˆ’1๐‘„/๐œŽโˆ—}

โˆ’ [๐‘˜/2] tr { โˆ’ (ฮ”โˆ’1๐‘„)๐‘†/๐œŽโˆ—}

+ [(๐‘˜ โˆ’ ๐‘›)/2] logdet {๐‘†}

+ [๐‘˜(๐‘› โˆ’ 1)/2] log (๐‘˜/2)

โˆ’ log ฮ“๐‘›โˆ’1(๐‘˜/2)

+ [(๐‘˜ โˆ’ ๐‘›)/2] log (1โ€ฒ๐‘†โˆ’111โ€ฒ1 )

โˆ’ [๐‘›/2] logdet {๐ฟ๐ฟโ€ฒ}

(4.13)

e only term that involves the residual basis ๐ฟ is (๐‘›/2) logdet {๐ฟ๐ฟโ€ฒ}. Regardless of thechoice for ๐ฟ, this terms does not depend on the parameters (๐‘˜, ๐œŽโˆ—, ฮ”). Full details aboutthe likelihood computation are given in Appendix 7.5.

4.3.1 Related model

is is the marginal likelihood for distance matrices introduced in [McCullagh, 2009].Let ๐ท the ๐‘› ร— ๐‘› pairwise dissimilarity matrix given by

๐ท = 1 diag(๐‘†)โ€ฒ + diag(๐‘†)1โ€ฒ โˆ’ 2๐‘†. (4.14)

e orthogonal projection ๐‘„ = ๐ผ โˆ’ 11โ€ฒฮฃโˆ’1/(1โ€ฒฮฃโˆ’11) satisfies

๐‘„โ€ฒฮฃโˆ’1 = ๐‘„โ€ฒฮฃโˆ’1๐‘„ = โˆ’๐œ†๐‘„โ€ฒฮ”โˆ’1๐‘„ = โˆ’๐œ†๐‘„ฮ”โˆ’1 (4.15)

since ker {๐‘„} = {1} and thus ๐‘„1 = 0. Similarly, ๐‘„๐ท๐‘„โ€ฒ = โˆ’2๐‘„๐‘†๐‘„โ€ฒ. erefore, forfixed ๐‘˜ = ๐‘ and after we ignore all terms that do not involve ฮ” or ๐œŽโˆ—,

โ„“(๐œŽโˆ—, ฮ” ; ๐‘†) โˆ โ„“(๐œŽ2, ฮฃ ; ๐ท) โˆ ๐‘2 log Det { โˆ’ ฮ”โˆ’1๐‘„/๐œŽโˆ—} โˆ’ ๐‘

2 tr { โˆ’ ฮ”โˆ’1๐‘„๐‘†/๐œŽโˆ—}(4.16a)

= ๐‘2 log Det {ฮฃโˆ’1๐‘„/๐œŽ2} + ๐‘

4 tr {ฮฃโˆ’1๐‘„๐ท/๐œŽ2} (4.16b)

where ๐œŽโˆ— = ๐œ†๐œŽ2.Recently, [Hanks and Hooten, 2013] build this likelihood into a parametric model

for isolation by resistance [McRae, 2006]. Briefly, the genetic data is generated by aGaussian Markov random field on an undirected graph (circuit). e covariance struc-ture is given by an intrinsic conditional autoregressive model, i.e., the conditional dis-tribution of each node given the rest of the graph is normal withmean and variance thatdepend on its first-degree neighbors only. [Hanks and Hooten, 2013] specify the modelso that the expected square differences between nodes are exactly effective resistancedistances on the population graph. In our notation, let ฮ” = ๐‘… be the matrix of effective

Page 33: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 33

resistances. [Note that this is slightly different from the IBR-based approximation toexpected coalescence times ฮ” = ๐‘‡.] [Bapat, 2004] shows that

๐‘…โˆ’1 = โˆ’12๐ฟ + ๐œ๐œโ€ฒ (4.17)

where ๐ฟ is the Laplacian of the graph๐บ = (๐‘‰, ๐ธ, ๐‘€) and ๐œ๐œโ€ฒ is a rank-one update. en

๐‘…โˆ’1๐‘„ = ๐‘…โˆ’1 โˆ’ ๐‘…โˆ’111โ€ฒ๐‘…โˆ’1

1โ€ฒ๐‘…โˆ’11 = ( โˆ’ 12๐ฟ + ๐œ๐œโ€ฒ) โˆ’ (๐œ(1โ€ฒ๐œ))(๐œ(1โ€ฒ๐œ))โ€ฒ

(1โ€ฒ๐œ)2 = โˆ’12๐ฟ (4.18)

(๐‘… + 11โ€ฒ)โˆ’1 = ๐‘…โˆ’1๐‘„ + ๐‘…โˆ’111โ€ฒ๐‘…โˆ’1

1โ€ฒ๐‘…โˆ’11 โˆ’ ๐‘…โˆ’111โ€ฒ๐‘…โˆ’1

1 + 1โ€ฒ๐‘…โˆ’11

= ๐‘…โˆ’1๐‘„ + ๐‘…โˆ’111โ€ฒ๐‘…โˆ’1

(1โ€ฒ๐‘…โˆ’11)(1 + 1โ€ฒ๐‘…โˆ’11)

= โˆ’12๐ฟ + ๐œ๐œโ€ฒ

1 + (1โ€ฒ๐œ)2 (4.19)

๐‘„[๐‘… + 11โ€ฒ] = ๐ผ โˆ’ 1๐œโ€ฒ(1โ€ฒ๐œ)1 + (1โ€ฒ๐œ)2

1 + (1โ€ฒ๐œ)2

(1โ€ฒ๐œ)2 = ๐ผ โˆ’ 1๐œโ€ฒ

1โ€ฒ๐œ (4.20)

(๐‘… + 11โ€ฒ)โˆ’1๐‘„ = โˆ’12๐ฟ + ๐œ๐œโ€ฒ

1 + (1โ€ฒ๐œ)2 โˆ’ ๐œ๐œโ€ฒ

1 + (1โ€ฒ๐œ)2 = โˆ’12๐ฟ (4.21)

at is, ๐ตโˆ’1๐‘„[๐ต]/4 = ๐‘…โˆ’1๐‘„[๐‘…] where ๐ต = ๐‘…/4 + 11โ€ฒ. [Hanks and Hooten, 2013] rep-resent conductances between connected nodes as a function of landscape features, e.g.,elevation. Instead we represent conductances [i.e., migration rates] through a coloredVoronoi tessellation and estimate edge weights without reference to available ecologicalvariables.

Modeling the dissimilarity matrix๐ท instead of the raw allele counts๐‘ is convenientbecause

โ€ข Suppose that ๐‘‚ is an orthogonal transformation (rotation or reflection). en

๐‘†๐‘‚ = (๐‘๐‘‚)(๐‘๐‘‚)โ€ฒ = ๐‘๐‘‚๐‘‚โ€ฒ๐‘โ€ฒ = ๐‘๐‘โ€ฒ = ๐‘†

โ€ข Suppose ๐‘‡ is a translation by ๐œ‡ = (๐œ‡1, โ€ฆ , ๐œ‡๐‘)โ€ฒ. en

๐ท๐‘‡๐‘–๐‘— = โŸจ(๐‘ง๐‘– โˆ’ ๐œ‡) โˆ’ (๐‘ง๐‘— โˆ’ ๐œ‡)โŸฉ2 = โŸจ๐‘ง๐‘– โˆ’ ๐‘ง๐‘—โŸฉ2 = ๐ท๐‘–๐‘—

Although the transformation from the entire data ๐‘ to the summary ๐ท is not a one-to-one transformation, wedonot lose information about relative distances, i.e., populationstructure. Instead we lose information about some nuisance parameters. For example,๐‘† โ†’ ๐ท makes the labeling of the alleles irrelevant.

4.4 What do we lose from ignoring the means?

We can use themarginal likelihood (4.3) because sampled individuals are equally distantfrom the most recent common ancestor of the sample [the root of the genealogy tree],and therefore, share a commonmean. us ๐พ = 1 is a basis for the mean space. [Recallthat ๐ฟ is a basis for the residual space of pairwise differences.] erefore,

โŽ›โŽœโŽœโŽœโŽœโŽ

1โ€ฒ

๐ฟโŽžโŽŸโŽŸโŽŸโŽŸโŽ 

๐‘ โˆผ N๐‘›โŽ›โŽœโŽœโŽœโŽœโŽ

โŽ›โŽœโŽœโŽœโŽœโŽ

๐œ‡0โŽžโŽŸโŽŸโŽŸโŽŸโŽ 

, ๐œŽ2โŽ›โŽœโŽœโŽœโŽœโŽ

1โ€ฒฮฃ1 1โ€ฒฮฃ๐ฟโ€ฒ

๐ฟฮฃ1 ๐ฟฮฃ๐ฟโ€ฒ

โŽžโŽŸโŽŸโŽŸโŽŸโŽ 

โŽžโŽŸโŽŸโŽŸโŽŸโŽ 

(4.22)

Page 34: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

34

Let ๐‘‡ = 1โ€ฒ๐‘ = โˆ‘๐‘›๐‘–=1 ๐‘๐‘– and ๐‘Œ = ๐ฟ๐‘.

๐‘Œ = ๐ฟ๐‘ | ๐œ‡, ฮฃ โˆผ N๐‘›โˆ’1(0, ๐œŽ2๐ฟฮฃ๐ฟโ€ฒ) (4.23a)

๐‘‡ | ๐‘, ฮฃ โˆผ N(๐œ‡ + 1โ€ฒฮฃ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐‘Œ, ๐œŽ2[1โ€ฒฮฃ1 โˆ’ 1โ€ฒฮฃ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟฮฃ1])๐‘„ = ฮฃ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ

= N(๐œ‡ + 1โ€ฒ๐‘„๐‘, ๐œŽ21โ€ฒ1(1โ€ฒฮฃโˆ’11)โˆ’11โ€ฒ1)= ๐ผ โˆ’ 1(1โ€ฒฮฃโˆ’11)โˆ’11โ€ฒฮฃโˆ’1

= N(๐œ‡ + 1โ€ฒ๐‘„๐‘, (1 โˆ’ ๐œ†)๐‘›2๐œŽ2)1โ€ฒฮฃโˆ’11 = 1โ€ฒโˆ†โˆ’11/(1 โˆ’ ๐œ†) (4.23b)

e conditional distribution of๐‘‡ depends onฮ” only through the bias term 1โ€ฒ๐‘„๐‘. ere-fore we choose to ignore it and use only the marginal distribution of ๐‘Œ to infer ฮ”.

4.5 Standardizing genotype data

Before performing PCA analysis for population structure it is common to standardizeSNPs and to set themissing alleles to the observed average at the correspondingmarker[McVean, 2009, Price et al., 2006]. e motivation is that without normalization SNPswith higher variance contribute more to the scatter matrix ๐‘๐‘โ€ฒ. erefore, the proce-dure tends to up-weight the influence of rare variants. Here we discuss why neithercentering the genotypes to have mean 0 nor standardizing the variance is appropriatewhen analyzing population structure.

In matrix notation, let ๐ถ = ๐ผ โˆ’ 11โ€ฒ/๐‘› be the centering matrix for ๐‘› observations.en multiplying by ๐ถ removes the mean:

๐‘‹ = ๐ถ๐‘ iidโˆผ N๐‘›(๐œ‡๐ถ1, ๐œŽ2๐ถฮฃ๐ถ) = N๐‘›(0, โˆ’๐œŽโˆ—๐ถฮ”๐ถ), (4.24)

is operation is convenient because๐‘‹๐‘‹โ€ฒ has centralWishart distribution. It alsomakesthe labelling of alleles as ancestral or derived ['0' or '1'] irrelevant, up to a change insign. Suppose that we "flip" the 0/1 labels at a particular site, i.e., ๐‘งโˆ— = 1 โˆ’ ๐‘ง. en๐‘ฅโˆ— = ๐ถ(1 โˆ’ ๐‘ง) = โˆ’๐ถ๐‘ง = โˆ’๐‘ฅ because ๐ถ1 = (๐ผ โˆ’ 11โ€ฒ/๐‘›)1 = 0.

However, centering with ๐ถ assumes the individuals are independent and identicallydistributed, i.e., no population structure: Ifฮฃ = ๐ผ, then the projection๐‘„ onto the spaceof contrasts is ๐‘„ = ๐ผ โˆ’ 11โ€ฒฮฃโˆ’1/(1โ€ฒฮฃโˆ’11) = ๐ถ. Since our model assumes the individu-als are coupled with correlation given by 11โ€ฒ โˆ’๐œ†ฮ”, it is not appropriate to naively centerthe genotypes to have mean 0 or to substitute the average allele frequency for missingvalues. For SNP datasets, it is better to impute missing SNP values before analyzingpopulation structure. ere are various imputation algorithms but they all would takeinto account similarities across observed alleles to imputemissing ones. Formicrosatel-lite datasets, which are usuallymuch smaller and harder to impute, we use the likelihoodfor the observed pairwise distances only. [So that the sample configuration ๐›ผ is reallysite-specific.]

Furthermore, itmight not be appropriate to standardize SPNs to have the same vari-ance precisely because this up-weights the contribution of rate alleles [McVean, 2009].A mutation in effect splits sampled individuals into two groups that are slightly differ-ent genetically โ€” those that carry the mutation versus those that do not. Intuitively,newer and especially private mutations, which are carried by a single individual, are in-formative for structure that is too fine to represent with a model at the level of demes.

Page 35: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

5

Simulations of Structured Genetic Data

In this chapter we describe several simulated scenarios that we use to evaluate the per-formance of our method for estimating effective migration as well as to illustrate someof its properties. We use the program ms [Hudson, 2002] to simulate genetic data un-der the structured coalescent. Given the model parameters (deme sizes and migrationrates) and the sample configuration, ms first generates a random genealogy, which de-scribes the history of the sample from the present to its most recent common ancestor,and then places a Poisson number of mutations uniformly (and independently of eachother) on the tree.

Weusems to simulate independent and identically distributed genealogies under thestepping-stone model ๐บ = (๐‘‰, ๐ธ, ๐‘€) with conservative migration ๐‘€ = (๐‘š๐›ผ๐›ฝ). ere-

fore, To generate histories with exactly onemutation, we choose a small mutationrate ๐œƒ and discard genealogies that carryzero or multiple mutations.

the iid assumption across sites holds but the normality assumption is violated. Inall examples, we generate ๐‘ = 3000 single nucleotide polymorphisms for ๐‘› = 300 hap-loid individuals on a 12 ร— 8 regular triangular grid. [e corresponding ms commands,with detailed explanations, are given in Appendix 7.8.]

5.1 Spatial structure due to constant migration

First we generate data under different patterns of migration โ€” either uniform or witha barrier โ€” to confirm that the method performs accurately when the underlying de-mographic model is correct. at is, the population does evolve on a known grid (๐‘‰, ๐ธ)of equally sized demes and unknown migration rates. In these simulations, therefore,effective migration rates are true migration rates [up to a constant of proportionalitythat depends on the coalescent timescale ๐‘0. We set up the simulations so that thisconstant is 1.]. We report migration rates, as they are parametrized, on the log (base10) scale, and the blue/brown color scheme is based on [Brewer et al., 2003].

truth (uniform migration) inferred migration surface log10

(m)

-1.8

-1.2

-0.6

0

0.6

1.2

1.8

Figure 5.1: Uniform migration rates andequal deme sizes: ๐‘ž๐›ผ = 1 for all ๐›ผ โˆˆ ๐‘‰and ๐‘š๐›ผ๐›ฝ = 1 for all (๐›ผ, ๐›ฝ) โˆˆ ๐ธ. e size ofthe gray circles indicates the number of in-dividuals sampled from the correspondingdeme.

In Figures 5.1 and 5.2 we directly compare the truth (left) with the posterior mean(right). Not every deme in the population graph is necessarily observed but sampling is

Page 36: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

36

truth (sharp barrier to migration) inferred migration surface log10

(m)

-1.8

-1.2

-0.6

0

0.6

1.2

1.8

truth (smooth barrier to migration) inferred migration surface log10

(m)

-1.8

-1.2

-0.6

0

0.6

1.2

1.8

Figure 5.2: Barrier to migration and equaldeme sizes: migration rates vary betweenhigh, ๐‘š๐›ผ๐›ฝ = 3, and low, ๐‘š๐›ผ๐›ฝ = 1/3, ineither a sharp or smooth pattern.

a)

b)

Figure 5.3: Draws from the posterior dis-tribution of effective migration. a) sharpbarrier to migration; b) smooth barrier tomigration.

Page 37: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 37

.......................

balanced because areas with higher migration are sampled with higher probability. Inall three cases our method correctly captures the qualitative pattern of migration. Andin Figure 5.3 we show samples from the posterior distribution on the colored Voronoitessellation, to illustrate the uncertainty in the estimated effective migration surface.

5.2 Spatial structure due to variation in diversity

e next set of simulations demonstrate that effective migration reflects the combinedeffect of demographic processes on genetic differentiation. In particular, we use twoexamples to show how differences in effective population size can influence effectivemigration rates. In the first example, lower migration rates cancel the effect of biggerdemes sizes, to produce uniform effective migration. In the second example, only demesizes vary to produce the effect of a barrier to migration.

To describe the simulations, consider the example graph with two groups of demes,๐ด (circles, smaller in size) and ๐ต (squares, bigger in size), with deme sizes ๐‘๐ด and ๐‘๐ต,respectively. Let๐‘š๐ด๐ด be themigration rate of all๐ดโˆ’๐ด edges and๐‘š๐ต๐ต be themigrationrate of all ๐ต โˆ’ ๐ต edges. We assign migration rates to the ''across'' edges ๐ด โˆ’ ๐ต and ๐ต โˆ’ ๐ดso that migration is conservative and deme sizes are constant over time, as required bythe stepping-stone model. Formally,

โˆ‘๐›พโˆถ๐›พโˆผ๐›ผ,๐›พโˆˆ๐ด

๐‘š๐ด๐ด๐‘๐ด + โˆ‘๐›พโˆถ๐›พโˆผ๐›ผ,๐›พโˆˆ๐ต

๐‘š๐ด๐ต๐‘๐ด = โˆ‘๐›พโˆถ๐›พโˆผ๐›ฝ,๐›พโˆˆ๐ด

๐‘š๐ต๐ด๐‘๐ต + โˆ‘๐›พโˆถ๐›พโˆผ๐›ฝ,๐›พโˆˆ๐ต

๐‘š๐ต๐ต๐‘๐ต (5.1)

by definition (2.14) where the coalescence rate is ๐‘ž๐ด = 1/๐‘๐ด. A sufficient condition forconservative migration is that

๐‘š๐ด๐ต๐‘๐ด = ๐‘š๐ต๐ด๐‘๐ต. (5.2)

is condition preserves the symmetry as much as possible because the number of mi-grants from ๐›ผ โˆˆ ๐ด to ๐›ฝ โˆˆ ๐ต is equal to the number of migrants from ๐›ฝ to ๐›ผ, i.e.,migration is balanced across every edge. erefore, given the deme sizes ๐‘๐ด and ๐‘๐ต,we let ๐‘š๐ด๐ต = 1/๐‘๐ด, ๐‘š๐ต๐ด = 1/๐‘๐ต. [Or more generally, we can let ๐‘š๐ด๐ต = ๐‘š๐ถ/๐‘๐ด,๐‘š๐ต๐ด = ๐‘š๐ถ/๐‘๐ต for a given between-group rate ๐‘š๐ถ.]

In the first example, bigger demes exchange the same number ofmigrants as smallerdemes. To achieve this, we set ๐‘๐ต = 5๐‘๐ด, ๐‘š๐ต๐ต = ๐‘š๐ด๐ด/5 and thus ๐‘š๐ด๐ด๐‘๐ด = ๐‘š๐ต๐ต๐‘๐ต.All demographic parameters are scaled by the coalescent timescale ๐‘0, so the effectivemigration rate of both ๐ด โˆ’ ๐ด and ๐ต โˆ’ ๐ต edges is approximately ๐‘š๐ด๐ด๐‘๐ด/๐‘0 = 1/๐‘0.at is, differences in population size are canceled by differences inmigration rate. Con-sequently, we expect the migration surface to be uniform, and indeed, this is what weobserve in Figure 5.5 a).

In the second example, bigger deme exchange more migrants. To achieve this, weset ๐‘๐ต = 5๐‘๐ด, ๐‘š๐ต๐ต = ๐‘š๐ด๐ด and thus ๐‘š๐ต๐ต๐‘๐ต > ๐‘š๐ด๐ด๐‘๐ด. Since migration rates arerelative to deme sizes, at the samemigration rate bigger demes exchange a higher num-ber of migrants which results in higher effectivemigration. erefore, coalescence timesbetween ๐ต demes [on the same side of the barrier but not across it] are shorter than co-alescence times between ๐ด demes. Genealogies with such topology are consistent withhigher migration at equal coalescence rates because lineages that transition more oftenbetween demes have fewer chances to coalesce. Consequently, we expect a barrier toeffective migration, and indeed, this is what we observe in Figure 5.5 b).

Page 38: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

38

.......................

5.3 Spatial structure due to a split event

e final sequence of examples produces a barrier effect from a past demographic eventthat removes edges in the graph and thus splits the habitat into two regions that nolonger exchange migrants.

To describe the simulations, consider the example graph with two groups of demes,๐ด (circles) and ๐ต (squares), with the same deme size. [e demes in the middle, ๐ถ, arepart of the population but we collect no samples from that area which remains ''unob-served''.] ere are also two types of edges: the solid ones have constant migration rate๐‘š; the dashed ones have migration rate 0 for ๐‘ฅ units of time (measured in ๐‘0 genera-tions) and migration rate ๐‘š from then on. Since Kingman's coalescent develops back-wards in time, this setup simulates a recent barrier to migration from the present topoint ๐‘ฅ in the past. Beyond time ๐‘ฅ the population graph is connected and has uniformmigration rate ๐‘š.

In Figure 5.6 we increase the time of the split event from ๐‘ฅ = 0.3 to ๐‘ฅ = 9 unitsof time. If the split is too recent on the relative scale of the other parameters, its effectis hard to detect and the effective migration surface is uniform. Otherwise, the splitis detected as a barrier to effective migration. [e truth is a temporary barrier, themethod infers a constant barrier.] In these simulations, an equilibrium phase of highmigration followed by a recent interval of no migration produces genealogies that aredominated by a long branch between the common ancestor of ๐ด lineages and that of ๐ตlineages. Such topology is consistent with constant migration at low rates across thearea separating ๐ด and ๐ต.

5.4 e effect of SNP ascertainment

In this example we simulate the effect of ascertainment bias due to a very small dis-covery panel (Figure 5.4). In this case there is a true barrier to migration but the dis-covery panel comes from a very limited area on one side of the barrier. is skews theobserved genealogies as we observe only sites that are polymorphic in both the ascer-tainment sample (in red) and the analysis sample (in black). is example shows thatascertainment โ€” which is not an evolutionary process โ€” can have an effect of the in-ferred effective migration, especially if the discovery panel is not representative of thepopulation.

a) b) log10

(m)

-1.8

-1.2

-0.6

0

0.6

1.2

1.8

Figure 5.4: Barrier to migration with as-certainment bias. a) True migration pat-tern and the discovery panel in red; b) Es-timated effective migration and the sam-ple in black.

Page 39: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 39

Figure 5.5: Population structure due todifferences in deme size. In a) biggerdemes exchange migrants at a lower rateand hence there is no variation in effectivemigration. In b) smaller demes exchangefewer migrants and hence there is an ef-fective barrier to migration.

a)

-1.8

-1.2

-0.6

0

0.6

1.2

1.8log10(m)

b)

-1.8

-1.2

-0.6

0

0.6

1.2

1.8log10(m)

Figure 5.6: A past demographic event re-sults in a barrier to effective migration.Here an ancestral population splits intosubpopulations ๐ด (east) and ๐ต (west) atpoint ๐‘ฅ in the past. e further back intime this event occurs, the more differen-tiated ๐ด and ๐ต are. a) ๐‘ฅ = 0.3; b) ๐‘ฅ = 3; c)๐‘ฅ = 9 units of time which is measured in๐‘0 generations.

a)

-1.8

-1.2

-0.6

0

0.6

1.2

1.8log10(m)

b)

-1.8

-1.2

-0.6

0

0.6

1.2

1.8log10(m)

c)

-1.8

-1.2

-0.6

0

0.6

1.2

1.8log10(m)

Page 40: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

40

5.5 e effect of uneven sampling on PCA projection and effective migration

It iswell known that PCAprojections areheavily influencedby irregular sampling [McVean,2009]. To examine the impact of sample composition on effective migration, we simu-late genetic data under the same barrier pattern as in Figure 5.2 but with various sam-pling schemes. We compare ourmethod of estimating effectivemigration and PCA anal-ysis of the observed covariance matrix in Figure 5.7. Even if sampling is biased towardsone area of the habitat, the presence and location of the barrier are correctly detectedas long as there are observations on both sides. On the other hand, the overall pat-tern of the PCA projections changes considerably. [Our method can be sensitive to theplacement and coarseness of the population grid.]

Figure 5.7: Barrier to migration with un-even sampling; colors indicate samplinglocation. e final example illustrates thatnaturally the method cannot detect vari-ation from uniform migration in areaswhere no genetic data is observed.

Page 41: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 41

5.6 Summary

esimulations in this chapter illustrate that effectivemigration can represent the com-bined effect of various demographic processes and events on genetic similarity and thatour method is robust to uneven sampling but not ascertainment bias. However, effec-tive migration does not help us to distinguish among possible histories as in this frame-work population structure is always explained with a stepping-stone model on a fixedgrid of equally sized demes.

e examples also underline why it is difficult to interpret effective migration interms of actual evolutionary history. As [McVean, 2009] emphasizes, very different de-mographic processes can produce very similar average genealogies. Our method, justlike PCA, uses the information contained in pairwise comparisons averaged across sitesand discards the sequential information contained in the ordering of sites along chro-mosomes, which can be helpful in selecting between possible histories.

Page 42: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

6

Empirical Results

In this chapter we apply out method on four empirical datasets [three consist of SNPsand one of microsatellites] and we further demonstrate that effective migration ratescan explain the spatial structure in genetic variation.

6.1 Red-backed fairywrens in Australia

First we present results for a sample of red-backed fairywrens (Malurusmelanocephalus),a small passerine bird endemic toAustralia [Figure 6.1]. eRBFWdatasetwas collectedto study its population structure and demographic history across the Carpentarian bar-rier. Sampling and genotyping procedures aswell as cluster-based analysis of populationstructure are described in [Lee and Edwards, 2008].

Figure 6.1: Habitat of the red-backedfairywren (Malurus melanocephalus), withtheCarpentarian barrier (the black bar), innorthern and eastern Australia. e mapshows the ranges of two subspecies: M.m.cruentatus in yellow,M.m. melanocephalusin pink, and a broad hybrid zone in orange.e map also shows three major biogeo-graphic regions: Top End (TE), Cape York(CY) and Eastern Forest (EF). e map ismodified from [Lee and Edwards, 2008].

e Carpentarian barrier in northern Australia is a semi-arid region, roughly 150kmwide and extremely poor in vegetation [Lee and Edwards, 2008]. It has been arguedthat this region has had a primary effect on species distribution in northern and east-ernAustralia by acting as a barrier tomigration, with secondary barriers along the coast.[Lee and Edwards, 2008] choose to study the demographic structure of the red-backedfairywren because its taxonomy, which is based mainly on plumage color, is not consis-tent with the Carpentarian hypothesis. e species has been traditionally categorizedinto two subspecies but their ranges do not lie on either side of the Carpentarian barrier,as we would expect if it has been the major barrier contributing to their divergence.

e dataset wasmade available to us by S. Edwards.After initial data processing, theRBFW dataset consists of ๐‘› = 27 diploid individuals genotyped at ๐‘ = 1190 bi-allelic,polymorphic SNPs. [As a reference to the original data, we remove 3 out of 30 individu-als because most of their genotypes are missing and we also exclude monomorphic andtri-allelic SNPs.]

Page 43: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 43

roughout we will refer to three subpopulations of red-backed fairywrens as iden-tified according to location in [Lee and Edwards, 2008]: Top End (TE) in northern Aus-tralia to the west of the Carpentarian barrier, Cape York (CY) in northeastern Australiato the east of the Carpentarian barrier and including the hybrid zone, and Eastern For-est (EF) in eastern Australia to the south of the hybrid zone [Figure 6.1].

Top End (TE)Cape York (CY)Eastern Forest (EF)

Top E

nd

Cape Y

ork

Easte

rn F

ore

st

Top E

nd

Cape Y

ork

Easte

rn F

ore

st

Figure 6.2: PCA and STRUCTURE analysisof the red-backed fairywren (RBFW) data.

First we perform principal components analysis (PCA) and cluster-based analysis(STRUCTURE). In Figure 6.2 (left) we plot the leading two principal components of thegenetic covariancematrix, which explain 55% and 10%of the variance, repectively. PCAdetects population structure but the results are difficult to interpret in terms of the evo-lutionary history of the species: there is some differentiation between the three sub-populations [in particular, Top End (TE) is better separated from Cape York (CY) thanEastern Forest (EF)] but there are no clearly delineated clusters. Although the threebiogeographic groups are about equally represented, the sample is very small and muchof the observed variation is between individuals within groups.

In Figure 6.2 (right) we report STRUCTURE results with two and three clusters, andusing the sampling locations as prior information [Pritchard et al., 2000, Hubisz et al.,2009]. As observed in [Lee and Edwards, 2008], if we use STRUCTURE to assign thesamples into two distinct clusters, Cape York (CY) and Eastern Forest (EF) are groupedtogether, which possibly indicates that the Carpentarian barrier has played a role inshaping the genetic differentiation of the red-backed fairy wren. In our analysis the data has a slightly

higher likelihood with three clustersrather than two as in [Lee and Edwards,2008].

When we use STRUC-TURE to assign the samples into three distinct clusters, CY and EF individuals are es-timated to be admixtures (with different proportions) of two ''ancestral'' populations.is suggests migration across the hybrid zone.

While both STRUCTURE plots might be interpreted to provide support for the Car-pentarian hypothesis, STRUCTURE does notmodel the geographic distribution of sam-ples across the habitat and thus cannot account for isolation by distance. In a homo-geneous habitat, where the population is spatially distributed with uniform migration,genetic differentiation tends to increase with geographic distance. e RBFW data ex-hibits the isolation by distance property, at least at short tomedium distances. e rela-tionship between geographic and genetic distances appears to plateau as the Euclideandistance increases [Figure 6.3].

Cluster-basedmethods, such as STRUCTURE [Pritchard et al., 2000] andGENELAND[Guillot et al., 2005], attempt to find sharp boundaries between clusters, to maximizesimilarity within versus between clusters, in terms of allele frequency distributions.[esemethods can assign individuals tomultiple clusters according to individual-specificfractional membership, but again the differences between clusters must be sharp in or-

Page 44: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

44

Euclidean distance

Gen

etic

dis

tanc

e

0 5 10 15 20 25

0.3

0.4

0.5

0.6

CY,CYEF,EFTE,TE

CY,EFTE,CYTE,EF

Figure 6.3: Observed genetic distance vs.Euclidean distance for the (๐‘›

2) = 351 pairsin the RBFW dataset. Each point is col-ored according to group membership toemphasize that on average Cape York (CY)is closer to Eastern Forest (EF) than to TopEnd (TE).

der to estimate these proportions with certainty.] erefore, cluster-basedmethods arebetter suited to analyzing discrete population structure. However, genetic variation canexhibit continuous structure as genetic similarities tend to decay with distance and thedecay can be gradual rather than sharp as in Figure 6.3. In this case STRUCTURE effec-tively separates those individuals that are farthest apart in space as Top End (TE) andEastern Forest (EF) are assigned to different clusters.

e spatial structure of genetic variation in the RBFW data is continuous and there-fore it can be partially explained with isolation by distance. However, since the Carpen-tarian barrier may reduce gene flow between the TE and CY groups, we estimate thepatterns of migration rather than assume genetic differentiation increases as a func-tion of the Euclidean distance between sampling locations [or equivalently, migrationis uniform].

Figure 6.4: Irregular triangular grid (๐‘‰, ๐ธ)spanning the habitat of the red-backedfairywren. Samples are assigned to theclosest deme. e method allows thatsampling be both sparse and uneven. Ifthe geographic information is coarse, it isappropriate to choose a coarse grid.

1-3

4-6

7 8,9

10,1112-14

15,16

17,1819,20

21,22

2627

28,29

30

To apply our method for estimating effective migration rates, we first construct anirregular triangular grid (๐‘‰, ๐ธ) to cover the known range of the red-backed fairywren[Figure 6.4]. After running the MCMC chain from multiple random starting points tomonitor convergence and averaging the runs, we report the posterior mean of the effec-tive migration rates ๐‘€ = (๐‘š๐›ผ๐›ฝ) in Figure 6.5, on the log base 10 scale, with low migra-tion in blue and high migration in brown. For this small dataset, it is computationallyfeasible to use the coalescent-based distancematrix (i.e., the expected coalescence times๐‘‡๐›ผ๐›ฝ) as well as its approximation in terms of effective resistances ๐‘…๐›ผ๐›ฝ. e two metrics

Page 45: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 45

produce very similar posterior estimates of the effective migration surface, shown inFigure 6.5 a) and b), respectively.

a) b) log10

(m

sm)

ฮ”=T

TE

EF

CY

ฮ”=R-1.8

-1.2

-0.6

0

0.6

1.2

1.8

Figure 6.5: Estimated relative rates of ef-fectivemigration for theRBFWdataset us-ing two distance metric on the graph: a)expected coalescence time ๐‘‡ = (๐‘‡๐›ผ๐›ฝ); b)effective resistance ๐‘… = (๐‘…๐›ผ๐›ฝ).

6.1.1 What is the effect of the Carpentarian barrier on genetic differentiation?

Sincewe plot relativemigration rates, a completelywhitemigration surfacewould corre-spond to uniform migration; the colors indicate deviations from the expectation underuniform migration.

For the RBFW dataset, the most interesting feature is the area of lower effectivemigration that roughly covers the Cape York (CY) biogeographic region and the Car-pentarian barrier. is result is consistent with the hypothesis that the Carpentarianbarrier affects genetic differentiation. It is also consistent with the hypothesis that theCY group has a slightly lower effective population size (similar to the simulations inSection 5.2). Furthermore, CY is relatively less similar to TE than it is to EF as CY andTE are separated by longer effective distance [i.e., darker blue color]. Although this canalso be inferred from the PCA and STRUCTURE analysis, the effective migration plotcombines information about genetic dissimilarities and geographic distances and thusis an intuitive representation of spatial patterns in genetic variation.

Finally, we showdraws from the posterior distribution of effectivemigration [Figure6.6]. Although in most instances the region of the Carpentarian barrier falls inside atile of lower effective migration [relative to the rest of the habitat], there is a lot ofvariability in the location, shape and rate of the ''barrier''. One possible explanation isthat the Carpentarian barrier does not have a strong effect on the genetic structure ofthis species. However, the RBFWdataset is small and it is also possible that ourmethod

Figure 6.6: Draws from the posterior dis-tribution of effective migration in red-backed fairywrens.

Page 46: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

46

cannot detect a strong barrier effect with certainty.

6.2 Forest and savanna elephants in Africa

Herewe present results for a dataset of African elephants sampled throughout the rangeof the species in Sub-Saharan Africa. e sample includes both forest elephants (Lox-odonta africana cyclotis) and savanna elephants (Loxodonta africana africana). Both sub-species are under threat, partly frompoaching, and thedatawere collected tohelp assigncontraband tusks to their location of origin [Wasser et al., 2004, 2007].

ere is observational and genetic evidence that forest and savanna elephants hy-bridize in the areas where their rangesmeet [Wasser et al., 2004]. erefore, we removeputative hybrids so that the dataset we analyze consists of 223 forest elephants and 896savanna elephants genotyped at 16 microsatellites. ese genetic markers were chosenin part because they can be isolated and amplified in samples of low quality and thusmicrosatellite DNA can be extracted from a small piece of tusk [Wasser et al., 2004].

Figure 6.7: Irregular triangular grid (๐‘‰, ๐ธ)spanning the habitat of the African ele-phant. emap showsfive regions as iden-tified in [Wasser et al., 2004]. e westand central regions comprise the range ofthe forest elephant. e north, east andsouth regions comprise the range of thesavanna elephant. Samples are assignedto the closest deme in the grid.

[Wasser et al., 2004] show that forest and savanna elephants can be accurately dis-criminated. is is also evident in the PC scatterplot of the sample covariance matrix[Figure 6.8] where the leading principal component separates forest (West, Central) andsavanna (North, East, South) and explains 29% of the observed genetic variation. PCAanalysis also indicates that there is more genetic diversity in forest than in savanna ele-phants and suggests no further population structure within the two subspecies. How-ever, the sample configuration is very uneven with about 4 times savanna than forestelephants, so the PCA results might be biased [Section 4.5].

We applied our method to the data provided by [Wasser et al., 2004]. e resultsconfirm that forest and savanna elephants are genetically differentiated enough to dis-tinguish between the two subspecies. In Figure 6.9 we observe a prominent barrier ineffective migration that curves through the habitat to separate the west and central re-gions (the range of forest elephants) from the north, east and south regions (the rangeof savanna) elephants.

Our model estimates migration rates to explain the overall sample structure. How-ever, each genotyped site has its own genealogy and thus observed genetic distancesvary across sites. With microsatellites, mutation rates are higher, and since more mu-tations meanmore information about relative branch lengths, we can also fit the modelat each microsatellite separately [Figure 6.10]. ere is great variability in effective mi-

Page 47: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 47

Figure 6.8: PCA analysis of the forest andsavanna elephants (FS) data.

a) b) log10

(m

sm)

-1.8

-1.2

-0.6

0

0.6

1.2

1.8

Figure 6.9: Effective migration rates forforest and savanna elephants (a) using all16 microsatellites and (b) excluding themost variable locus.

gration across microsatellites. And the pattern of effective migration at the sixth locusproduces most of the overall pattern, except for the relationship between the west andcentral regions โ€” essentially, the relationship among forest elephants. is suggeststhat elephants can be categorized with high accuracy as forest or savanna based on justthis one microsatellite.

We also split the sample into only forest and only savanna elephants to explore sub-tler population structure in each subspecies. e genetic variation in forest elephantsis consistent with isolation by resistance with a very small bridge of higher effectivemigration between the west and central regions. e genetic variation in savanna ele-phants deviates from isolation by distance considerably: the central region is separatedfrom the rest with a barrier of low effective migration while the south and east regionsare more genetically similar than the large area they span would suggest.

6.2.1 STRUCTURE and GENELAND results

As a clusteringmethod, GENELAND [Guillot et al., 2005] looks for distinct clusters andtherefore sharp boundaries between them. Removing putative hybrids makes differ-ences in allele frequencies between biogeographic regions easy to detect [Figure 6.12].However, GENELAND does not explain the relationship between the regions โ€” theyare all distinct from each other.

Page 48: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

48

Figure 6.10: Effective migration rates ateach of sixteen microsatellites. e 6thlocus is most variable, presumably due tohighest mutation rate.

a) b) log10

(m

sm)

-1.8

-1.2

-0.6

0

0.6

1.2

1.8

Figure 6.11: Effective migration rates for(a) only forest elephants and (b) only sa-vanna elephants, using the same triangu-lar grid as in Figure 6.7 and all 16 mi-crosatellites.

Page 49: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 49

On the other hand, STRUCTURE [Pritchard et al., 2000] with sampling locationprior [Hubisz et al., 2009] provides intuition for the relationship between the five bio-geographic regions [Figure 6.13]. It clearly detects the difference between forest ele-phants (west, central) and savanna elephants (north, east, south) as they fall into dif-ference clusters. Furthermore, STRUCTURE shows some evidence for isolation by dis-tance, particularly in savanna elephants. Most of these individuals are represented asweighted mixtures of four clusters that do not correspond to distinct geographic areas.

Figure 6.12: GENELAND posterior prob-abilities for belonging in each of five clus-ters, which correspond directly to the fivebiogeographic regions.

Figure 6.13: STRUCTURE membershipproportions in six clusters when usingsampling locations (not regions) to pro-vide prior information for cluster assign-ments.

Page 50: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

50

both savanna and forest elephants(without hybrids)

only forest elephants

only savannah elephants

Figure 6.14: Scatterplots of genetic dis-similarities versus distances on the pop-ulation graph with a) uniform migrationand b) estimated effective migration.

Page 51: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 51

6.3 Human populations from Europe and Africa

egenetic structure of humanpopulationshas been extensively studied since [Menozziet al., 1978] first used PCA to summarize human genetic variation across continents.Here we analyze two large-scale genome-wide datasets. e European dataset is part ofthe POPRES collection [Nelson et al., 2008] and consists of 1387 individuals genotypedat 197,146 autosomal SNPs. Most samples were collected in Western Europe, so we an-alyze a subset of 1208 individuals from 15 countries. e Sub-Saharan African datasetconsists of 314 individuals from21 ethnic groups genotyped at 27,922 autosomal SNPs.

[Novembre et al., 2008, Lao et al., 2008] use PCA analysis to characterize the spatialstructure of genetic variation within Europe and find a close correspondence betweengenetic and geographic distances, and hence, evidence for isolation by distance. In fact,[Novembre et al., 2008] shows that the two leading principal components are stronglycorrelated with latitude and longitude, respectively. [Wang et al., 2012] use their Pro-crustes method to analyze the population structure within Africa [as well as within Eu-rope, Asia and world-wide] and also observe similarity between genetic and geographicmaps, after excluding hunter-gatherer populations.

e PCA plots reveal that the human population structure in Europe and Africa iscontinuous: while individuals from the same group tend to cluster together, the over-all arrangement qualitatively resembles the configuration of sampling locations [Figure6.15]. e correspondence between the PCA projections and the geographic map can beimproved with a rotation transformation such as Procrustes [Wang et al., 2010] but thiscannot improve the PCA analysis โ€” for example, by correcting for biased sampling.

a) b)

Figure 6.15: Sample configuration andPCA analysis of the European and Africandatasets. In the bottom row font size in-dicates relative sample size.

Page 52: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

52

Figure 6.15 also illustrates the limitations of the sampling scheme. In both cases it isbiased but, more importantly, geographic locations are implied as individuals from thesame populations are automatically assigned to the same coordinates. [For the Euro-pean dataset, populationmembership is determined based on grandparents' country oforigin or self-repored country of birth [Novembre et al., 2008].] is is clearly not rep-resentative of human spatial distribution in either Europe or Africa and, furthermore,the geographic information might be too coarse to detect substructure within popula-tions. [On the other hand, the stepping-stone model is discrete. Since observations areassigned to the nearest deme, the grid itself implies a limit on how much geographicresolution our method can represent.]

As [Novembre et al., 2008, Wang et al., 2012] have shown, the spatial structure ofhuman genetic variation in both Europe and Africa exhibits broad isolation by distanceas genetic differentiation tends to increase gradually with geographic distance [Figure6.16; top row]. However, while geography explains some patterns of genetic differen-tiation, a homogeneous habitat (i.e., uniform migration) might not provide the bestexplanation for the observed data.

We applied our method to estimate the effective human migration in both Europeand Africa and plotted genetic differentiation against the inferred effective distances[Figure 6.16; bottom row]. e linear relationship between genetic dissimilarity andeffective distance is stronger [r2 increases from 33% to 85% for the European data, andfrom 24% to 91% for the African data]. However, since there are so many pairwisecomparisons in the scatterplots, it is more instructive to analyze the inferred effectivemigration surface [Figure 6.17].

a) b)

Figure 6.16: Genetic differentiation (lin-earized ๐น๐‘†๐‘‡) versus resistance distance(๐‘…๐›ผ๐›ฝ) with either uniform migration ratesor estimated effective migration rates onthe population graph (๐‘‰, ๐ธ). e colors,whichmatch those in Figure 6.17, are cho-sen to emphasize the difference betweenthe populations in red and those in blue.

Page 53: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 53

a) b) log10 ( ๐‘šs๐‘š )

-1.8

-1.2

-0.6

0

0.6

1.2

1.8

Figure 6.17: Inferred effective migrationsurfaces for the West European and Sub-Saharan African datasets.

From the inferred effectivemigration surface in Europe [Figure 6.17 a)] we canmakeseveral observations that are difficult to make from the PCA plot or the distance scat-terplot. For example, the northern countries (Ireland, the UK, Scotland, Denmark, theNetherlands, Germany โ€” in blue) are more genetically similar that we would expectbased on the geographic distance alone. e same is true for the three southern coun-tries (Portugal, Spain and Italy โ€” in red). On the other hand, a barrier to effectivemigration separates the British Isles and the Iberian peninsula, and another barrierseparates Italy and France [roughly where the Alps are]. However, we cannot concludethat the effect is due only to lower migration rates across bodies of water or mountainranges. e observed patterns can also be influenced by differences in effective popula-tion size and other evolutionary processes. Finally, the inferred migration surface alsosuggests that there is more differentiation in the north/south direction rather than theeast/west direction as the north and the south are separated by two areas of lower effec-tive migration. is results is consistent with the hypothesis that a north/south clineis a distinguishing feature of population structure within Europe [Tian et al., 2008].

We can also make interesting observations from the inferred effective migrationsurface in Africa [Figure 6.17 b)]. ere is higher effective migration along the Atlanticcoast than in the interior of the continent, and therefore, inland populations (in blue)are more genetically dissimilar than coastal populations (in red). Consequently, there ismore differentiation in the east/west direction than in the north/ south direction. ispattern can be observed in the PCA plot, as noted by [Wang et al., 2012], where pop-ulations along the coast cluster closer together, inland populations form more isolatedclusters and the E/W-associated principal component explains twice as much variationas the N/S-associated one. On the other hand, the four Bantu speaking groups at thesouthern tip cluster together in the PCA plot but not in the effective migration sur-face. However, this might be the result of lower geographic resolution in that region:Pedi and Sotho/Tswana are assigned to the same deme, and similarly, Nguni and Xhosaare assigned to another deme. We use the program GENEPOP [Rousset,

2008] to compute pairwise ๐น๐‘†๐‘‡s.e first pair has lower genetic differentiation than the

second [๐น๐‘†๐‘‡(๐‘ƒ๐‘’, ๐‘†๐‘‡) = 0.0012, ๐น๐‘†๐‘‡(๐‘๐‘”, ๐‘‹โ„Ž) = 0.0019].ese patterns are present, to some extent, in the PCA plot and the distance scatter-

plot. But they are easy to observe only if we categorize the locations and color the points

Page 54: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

54

appropriately. In contrast, the pattern is clear after the analysis of effective migration.

6.4 Arabidopsis thaliana in Europe and North America

Arabidopsis thaliana is a small flowering plant and a commonly studied model organismin population genetics. It has a broad natural rangeโ€”Europe, Asia andNorth Africaโ€”and now grows in North America as well. Although A. thaliana is a selfing plant with lowgene flow, its genetic variation has significant spatial structure [Nordborg et al., 2005,Platt et al., 2010]. On the continental scale, in Europe there is broad isolation by dis-tance with east-west gradient that has been interpreted as evidence for post-glaciationcolonization [Nordborg et al., 2005]. On the other hand, in North America there isgenome-wide linkage disequilibrium and haplotype sharing that have been interpretedas evidence for recent human introduction from Europe [Nordborg et al., 2005].

A large geographically referenced dataset is available from the Regional Mapping(RegMap) project [Horton et al., 2012]. We split the full dataset (1193 accessions, โˆผ220, 000 SNPs) into two subsetsโ€”North America (180 plants) and Europe (823 plants)โ€”whichwe analyze both separately and together. [We exclude plants fromAsia becausethe continent is very sparsely sampled.]

a) b)

Figure 6.18: Sample configuration andPCA analysis of Arabidopsis thaliana datafrom the RegMap project; a) North Amer-ica, b) Europe. First we perform principal components analysis [Figure 6.18]. As we would expect

if A. thaliana has different history in Europe and North America, there are differencesin genetic variation on the two continents [Nordborg et al., 2005]. ere is little pop-

Page 55: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 55

ulation structure in the North American subset: the samples are separated to someextent in a north/south direction, with no obvious separation between samples fromaround Lake Michigan and those from the Atlantic coast despite the geographic dis-tance. On the other hand, the population structure in the European subset is continu-ous, with some correspondence between genetic variation and geographic distribution,as we would expect under isolation by distance [Platt et al., 2010].

Next we apply our method to estimate the effective migration for A. thaliana inNorth America and Europe [Figure 6.19]. In North America, the two sampled regionsโ€” Lake Michigan and the Atlantic coast โ€” are connected by a strip of high effectivemigration. is indicates that the regions are similar genetically even though they arefar apart in space. [is is the opposite of what we expect under isolation by distance.]ere is an area of higher effective migration at the south tip of Lake Michigan wheremost of the North American samples are collected. erefore, our results are consistentwith the observation that there is extensive haplotype sharing (which indicates identityby descent) not onlywithin but also between sampling locations [Nordborg et al., 2005].

Figure 6.19: Inferred effective migrationsurfaces for Arabidopsis thaliana from twoRegMap subsets; a) North America; b) Eu-rope; c) North America and Europe com-bined.

In continental Europe, the overall pattern in broad isolation by distance, with smallvariability in effective migration rates. On the other hand, the north of the British Islesis separated from the rest of Britain which in turn is connected to northern France byan area of high migration. Our results are consistent with previous studies of the pop-ulation structure of A. thaliana. [Platt et al., 2010] find that in Eurasia there is a strongtrend of isolation by distance (at three distance scales) while in North America thereis no relationship between geographic distance and allelic similarity (except at fine dis-tance scale). And [Horton et al., 2012] observe that in the PCA plot most accessionsfrom the British Isles are projected closest to France but some plants from Britain clus-ter with lines from Sweden. Our method summarizes and visualizes these patterns.

Page 56: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

56

Figure 6.20: Genetic differentiation (lin-earized ๐น๐‘†๐‘‡) versus resistance distance(๐‘…๐›ผ๐›ฝ) with either uniform migration ratesor estimated effective migration rates onthe population graph (๐‘‰, ๐ธ). e sam-ples are assigned to a regular triangulargrid [because the designation of nationsas populations is not relevant] and thecolors are chosen to emphasize interest-ing patterns [that correspond to regionswith strong deviation from isolation bydistance].

Page 57: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 57

In the combined analysis, the overall pattern is a prominent corridor of high effec-tivemigration across the Atlantic ocean. While effectivemigration rates are symmetric,this supports the hypothesis that recent directedmigration introduced A. thaliana fromEurope to the New World. On this larger scale, effective migration rates within NorthAmerica and Europe are low [as we would expect in a plant species with low gene flow].

[Platt et al., 2010] considers two models of the continuous population structure ofA. thalianaโ€” amodel with constant uniform migration and another with uniform mi-gration and a single shift in dispersal rate โ€” and concludes that neither version is agood fit for the observed patterns of genetic dissimilarities. at is, even though theoverall pattern is consistent with isolation by distance, there might be deviations fromuniform migration (or too much noise). We can observe such complex details in theeffective migration surface, e.g., the British Isles in Figure 6.19. When we combine thetwo samples the strongest signal in the data is the genetic similarity between the twocontinents even though they are separated by the expanse of the Atlantic ocean. isillustrates howwe can observe finer details at smaller scales because effectivemigrationrates are parametrized relative to the overall mean log rate.

6.5 Conclusion

Genetic variation in natural populations often exhibits spatial structure as genetic sim-ilarity tends to decay with geographic distance. However, this relationship is often nothomogeneous and the distribution of similarities across the habitat contains informa-tion about the evolutionary and ecological history of the population.

Visualization is an important tool for detecting and understanding patterns of pop-ulation structure. We have developed a model-based method for estimating and visu-alizing effective migration to explain observed deviations from homogeneous dispersaland isolation by distance. It represents the population as a triangular grid of discretecomponents and its effectivemigration as a colored partition of a two-dimensionalmap.

Our method is particularly useful for characterizing continuous population struc-ture (even though the underlying stepping-stone model is discrete) because it models aspatially distributed population instead of a collection of distinct and isolated subpop-ulations. Neither is it necessary to categorize samples into regional groups to computemeasures of differentiation such as ๐น๐‘†๐‘‡ . [Nevertheless, assigning colors and names togroups of genetically similar individuals, in areas of high effective migration, can behelpful for subsequent analysis.] If spatial structure is continuous, clustering samplesinto biogeographic regionsmight not be awell-definedproblem. In this scenario cluster-based methods such as STRUCTURE and GENELANDmay be inappropriate.

Our method also offers some advantages over PCA analysis. PCA produces two-dimensional visual summaries of observed genetic variation and can capture both con-tinuous and discrete population structure at the sample level. In contrast, our methodproduces a visual representation of geographic and genetic information at the popula-tion level. Consequently, it is easier to make qualitative comparisons between popula-tions or between geographic regions, in terms of both geographic and genetic distances.And our method can detect deviations from uniformmigration (and hence, isolation bydistance) that are not clearly evident in PC projections because PCA is strongly affectedby sampling bias and does not estimate relevant demographic parameters.

Page 58: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

7

Appendices

7.1 Properties of the stepping-stone model of population subdivision

egoal of this section is to derive the systemof linear equations (2.15) for the expectedpairwise coalescence times ๐‘‡ = (๐‘‡๐›ผ๐›ฝ) as a function of the migration rates ๐‘€ = (๐‘š๐›ผ๐›ฝ)and the coalescence rates ๐‘ž = (๐‘ž๐›ผ) in the stepping-stone model.

7.1.1 Probabilities of identity by descent

In population genetics, the probability of identity is a measure of relatedness due toshared ancestry. e concept of identity can be defined as either the event that the lin-eages have the same ancestor in a reference population at a specified time in the past orthe event that no mutations have occurred since the lineages diverged from their mostrecent common ancestor. We use the second definition of identity known as identity bystate [versus identity by descent].

Let ๐œ™๐›ผ๐›ฝ(๐œƒ) be the probability of identity by descent for two distinct lineages drawnat random from demes ๐›ผ and ๐›ฝ. e parameter ๐œƒ = 2๐‘0๐‘ข is the mutation rate per2๐‘0 generations for a single lineage, or equivalently, the total mutation rate per ๐‘0generations for a pair of lineages.

To derive expressions for ๐œ™๐›ผ๐›ฝ(๐œƒ) for every pair (๐›ผ, ๐›ฝ), consider the history of a sam-ple of size 2 backwards in time. Let ๐‘ฅ(๐‘ก) = {๐‘ฅ(๐‘ก)

๐›ผ } be the state of the ancestral process๐‘ก generations ago when the sample has ๐‘ฅ(๐‘ก)

๐›ผ ancestors in deme ๐›ผ. It is convenient toconsider time in units of ๐‘0 generations. On this timescale and under certain assump-tions about reproduction and migration, the discrete-time ancestral process {๐‘ฅ(๐‘ก) โˆถ ๐‘ก =0, 1, โ€ฆ } converges to a continuous-time ancestral process {๐‘ฅ(๐‘ก) โˆถ ๐‘ก โ‰ฅ 0}, called thestructured coalescent [Notohara, 1990, 1993]. Mutations are generated by a Poisson pro-cess with intensity ๐œƒ such that in ๐‘ก units of time a lineage accummulates ๐พ โˆผ Po(๐œƒ๐‘ก)mutations.

To derive the probability of identity for the pair (๐›ผ, ๐›ฝ), consider the first event thatresults in a change of state. e initial state is ๐‘ฅ(0) = {๐‘ฅ(0)

๐›ผ = 1, ๐‘ฅ(0)๐›ฝ = 1, ๐‘ฅ(0)

๐›พ = 0 โˆถ๐›พ โ‰  ๐›ผ, ๐›พ โ‰  ๐›ฝ}. If the two lineages are drawn from the same deme, i.e., ๐›ผ = ๐›ฝ, the firstevent can be a coalescence with rate ๐‘ž๐›ผ, a mutation with rate ๐œƒ, or a migration to deme๐›พ with rate 2๐‘š๐›ผ๐›พ .Since the process starts with two lineages

in ๐›ผ and the migration rate from ๐›ผ to๐›พ is ๐‘š๐›ผ๐›พ for a single lineage, the totalrate of movement is 2๐‘š๐›ผ๐›พ . Similarly, thecombined mutation rate is ๐œƒ.

If a mutation occurs, the lineages are no longer identical by descent.erefore, under equilibrium,

๐œ™๐›ผ๐›ผ(๐œƒ) = ๐‘ž๐›ผ๐œƒ + ๐‘ž๐›ผ + 2๐‘š๐›ผ

+ โˆ‘๐›พโˆถ๐›พโ‰ ๐›ผ

2๐‘š๐›ผ๐›พ๐œƒ + ๐‘ž๐›ผ + 2๐‘š๐›ผ

๐œ™๐›ผ๐›พ(๐œƒ), (7.1)

where ๐‘š๐›ผ = โˆ‘๐›พโˆถ๐›พโ‰ ๐›ผ ๐‘š๐›ผ๐›พ is the total rate of migration out of ๐›ผ. More precisely, sincethe coalescent proceeds backwards in time, ๐‘š๐›ผ๐›พ is the rate at which offspring in ๐›ผ have

Page 59: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 59

parents from ๐›พ and ๐‘š๐›ผ is the total rate of ''outside-deme'' parentage.When the two lineages are drawn from two different demes, i.e., ๐›ผ โ‰  ๐›ฝ, they cannot

coalesce in a single step. In this case, the first event can be a mutation with rate ๐œƒ, amigration from deme ๐›ผ to deme ๐›พ with rate ๐‘š๐›ผ๐›พ , or a migration from deme ๐›ฝ to deme๐›พ with rate ๐‘š๐›ฝ๐›พ . Under equilibrium,

๐œ™๐›ผ๐›ฝ(๐œƒ) = โˆ‘๐›พโˆถ๐›พโ‰ ๐›ผ

๐‘š๐›ผ๐›พ๐œƒ + ๐‘š๐›ผ + ๐‘š๐›ฝ

๐œ™๐›พ๐›ฝ(๐œƒ) + โˆ‘๐›พโˆถ๐›พโ‰ ๐›ฝ

๐‘š๐›ฝ๐›พ๐œƒ + ๐‘š๐›ผ + ๐‘š๐›ฝ

๐œ™๐›ผ๐›พ(๐œƒ). (7.2)

Equations (7.1) and (7.2) represent a system of linear equations for the probabilities ofidentity by descent in terms of the mutation rate ๐œƒ, the coalescence rates ๐‘ž๐›ผ and themigration rates ๐‘š๐›ผ๐›ฝ. In matrix notation,

diag {๐‘ž}[ diag {ฮฆ} โˆ’ ๐ผ] = [๐‘€ โˆ’ (๐œƒ/2)๐ผ]ฮฆ + ฮฆ[๐‘€ โˆ’ (๐œƒ/2)๐ผ]โ€ฒ. (7.3)

Here ๐‘€ = (๐‘š๐›ผ๐›ฝ) is the infinitesimal generator of the migration process with diagonalentries โˆ’๐‘š๐›ผ = โˆ’ โˆ‘๐›พโˆถ๐›พโ‰ ๐›ผ ๐‘š๐›ผ๐›พ , ฮฆ โ‰ก ฮฆ(๐œƒ) = (๐œ™๐›ผ๐›ฝ(๐œƒ)) is the matrix of probabilities ofidentity at fixed mutation rate ๐œƒ, and ๐‘ž = (๐‘ž๐›ผ) is the vector of coalescence rates.

7.1.2 Expected pairwise coalescence times

A linear system for the expected pairwise coalescence times can be derived correspond-ingly. Since by definition ๐œ™ is the probability that no mutation occurs in either lineagebefore coalescence at time ๐‘ก,

๐œ™(๐œƒ) = P{๐พ = 0} = E{๐‘’โˆ’๐œƒ๐‘ก}. (7.4)

at is, the probability of identity ๐œ™ is the Laplace transform of the coalescence time ๐‘ก[Hudson, 1990]. erefore,

E{๐‘ก} = โˆ’๐œ™โ€ฒ(0) where ๐œ™โ€ฒ =โˆ‚

โˆ‚๐œƒ๐œ™. (7.5)

To obtain a system for the expected coalescence times, differentiate equations (7.1) and(7.2) with respect to the mutation rate ๐œƒ and evaluate at ๐œƒ = 0. e result is

1 = (๐‘ž๐›ผ + 2๐‘š๐›ผ)๐‘‡๐›ผ๐›ผ โˆ’ โˆ‘๐›พโˆถ๐›พโ‰ ๐›ผ

2๐‘š๐›ผ๐›พ๐‘‡๐›ผ๐›พ and (7.6a)

1 = (๐‘š๐›ผ + ๐‘š๐›ฝ)๐‘‡๐›ผ๐›ฝ โˆ’ โˆ‘๐›พโˆถ๐›พโ‰ ๐›ผ

๐‘š๐›ผ๐›พ๐‘‡๐›พ๐›ฝ โˆ’ โˆ‘๐›พโˆถ๐›พโ‰ ๐›ฝ

๐‘š๐›ฝ๐›พ๐‘‡๐›ผ๐›พ . (7.6b)

Equivalently, in matrix notation,

diag {๐‘ž} diag {๐‘‡} โˆ’ ๐‘€๐‘‡ โˆ’ ๐‘‡๐‘€โ€ฒ = 11โ€ฒ. (7.7)

[Here 11โ€ฒ is a ๐‘‘ ร— ๐‘‘ matrix of 1s.] is method for deriving equations (7.6a) and (7.6b)is developed in [Bahlo and Griffiths, 2001]. Alternatively, [Hey, 1991] constructs aMarkov chain with ๐‘‘(๐‘‘ + 1)/2 non-absorbing states for each unique pair (๐›ผ, ๐›ฝ). eset of states includes ๐‘‘ homoallelic states, when the two lineages are in the same deme,and ๐‘‘(๐‘‘ โˆ’ 1)/2 heteroallelic states, when two lineages are in different demes. ereis also an absorbing state, which corresponds to coalescence. Transition probabilitiesbetween all these states reflect the migration rates ๐‘š๐›ผ๐›ฝ and coalescence rates ๐‘ž๐›ผ.

Furthermore, since the population evolves under equilibrium, migration is conser-vative and ๐‘€โ€ฒ๐‘žโˆ’1 = 0 by definition. If we multiply equation (7.7) by (๐‘žโˆ’1)โ€ฒ on the left

Page 60: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

60

and by ๐‘žโˆ’1 on the right, we obtain

1โ€ฒ diag {๐‘‡}๐‘žโˆ’1 = (1โ€ฒ๐‘žโˆ’1)2 โ‡” โˆ‘๐›ผ

๐‘‡๐›ผ๐›ผ(๐‘๐›ผ/๐‘0) = ( โˆ‘๐›ผ

๐‘๐›ผ/๐‘0)2

๐‘‡0 โ‰ก โˆ‘๐›ผ

๐‘‡๐›ผ๐›ผ(๐‘๐›ผ/๐‘๐‘‡) = ๐‘๐‘‡/๐‘0 = ๐‘‘ (7.8)

where ๐‘‡0 is the [weighted] average within-deme coalescence time, ๐‘๐‘‡ = โˆ‘๐›ผ ๐‘๐›ผ is thetotal population size and ๐‘0 = ๐‘๐‘‡/๐‘‘ is the coalescent timescale. erefore, underconservative migration, the average within-deme coalescence time does not depend onthe exact pattern and rates of migration [Strobeck, 1987]. If migration is isotropic โ€” amuch stronger assumption โ€” the within-deme coalescence times ๐‘‡๐›ผ๐›ผ for all demes ๐›ผdo not depend on the migration process.

7.2 Distance matrices

Here we discuss distance matrices, also called dissimilarity matrices, and review somerelevant properties. Two examples of a distance matrix are the matrix of expected pair-wise coalescence times, ๐‘‡, and the matrix of effective resistance distances, ๐‘….

First we state two equivalent definitions of a distance matrix.

Definition 7.1 e matrix ๐ท = (๐‘‘2๐‘–๐‘—) is a distance matrix if there exist squared lengths

โ„“ = (โ„“2๐‘– ) โˆˆ โ„๐‘›+ such that

โ„“1โ€ฒ + 1โ„“โ€ฒ โˆ’ ๐ท โ‰ฝ 0. (7.9)

Definition 7.2 ematrix๐ท = (๐‘‘2๐‘–๐‘—)๐•Š๐‘› is the set of symmetric ๐‘› ร— ๐‘› matrices;

๐•Š๐‘›+ is the set of symmetric ๐‘› ร— ๐‘› matriceswith nonnegative elements.

is a distancematrix if there exists pairwise similarities๐‘† = (๐‘†๐‘–๐‘—) โˆˆ ๐•Š๐‘›+ such that

๐‘‘2๐‘–๐‘— = ๐‘†๐‘–๐‘– + ๐‘†๐‘—๐‘— โˆ’ 2๐‘†๐‘–๐‘—. (7.10)

Let ๐‘‹ = (๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘›)โ€ฒ โˆˆ โ„๐‘›ร—๐‘ represent ๐‘› points in ๐‘-dimensional inner product space.For example, in the setting of analyzing population structure, ๐‘ฅ๐‘– is a genotype vector of๐‘ polymorphic sites. en the squared distance between points ๐‘– and ๐‘— is given by

๐‘‘2๐‘–๐‘— = โŸจ๐‘ฅ๐‘– โˆ’ ๐‘ฅ๐‘—, ๐‘ฅ๐‘– โˆ’ ๐‘ฅ๐‘—โŸฉ = โŸจ๐‘ฅ๐‘–, ๐‘ฅ๐‘–โŸฉ + โŸจ๐‘ฅ๐‘—, ๐‘ฅ๐‘—โŸฉ โˆ’ 2โŸจ๐‘ฅ๐‘–, ๐‘ฅ๐‘—โŸฉ โ‰ก โ„“2

๐‘– + โ„“2๐‘— โˆ’ 2๐‘†๐‘–๐‘—, (7.11)

where ๐‘†๐‘–๐‘— = โŸจ๐‘ฅ๐‘–, ๐‘ฅ๐‘—โŸฉ is the inner product between two vectors in โ„๐‘, and ๐‘† = ๐‘‹๐‘‹โ€ฒ ispositive definite as a Gram matrix. In matrix notation,

๐ท = diag {๐‘†}1 + 1 diag {๐‘†}โ€ฒ โˆ’ 2๐‘†. (7.12)

Clearly the similarity matrix ๐‘† contains more information about ๐‘‹ than the dissimilar-ity matrix ๐ท: diag {๐‘†} = โ„“while diag {๐ท} = 0๐ท is nonnegative with 0s on the main

diagonal because the dissimilarity of apoint with itself is trivially 0.

. at is, ๐‘† captures the absolute positionof each point in the space (the length โ„“๐‘– is the distance to the center ๐‘‚) while ๐ท reflectsonly the relative difference for each pair of points.

eorem 7.1 ematrix ๐ท โˆˆ ๐”ป๐‘› is a distance matrix if and only if it is conditionally nega-tive definite.

Sketch of proof.

โ€ข Suppose that๐ท is a distancematrix. For every vector ๐›ผ โˆˆ โ„๐‘› such that 1โ€ฒ๐›ผ = 0 (thatis, ๐›ผ is a contrast)

0 โ‰ค 2๐›ผโ€ฒ๐ท๐›ผ = ๐›ผโ€ฒ(โ„“1โ€ฒ + 1โ„“โ€ฒ โˆ’ ๐ท)๐›ผ (7.13a)

= ๐›ผโ€ฒโ„“(1โ€ฒ๐›ผ) + (๐›ผโ€ฒ1)โ„“โ€ฒ๐›ผ โˆ’ ๐›ผโ€ฒ๐ท๐›ผ = โˆ’๐›ผโ€ฒ๐ท๐›ผ. (7.13b)

erefore, ๐ท is conditionally negative definite.

Page 61: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 61

โ€ข Suppose that ๐ท is conditionally negative definite. Choose a vector ๐‘ค โˆˆ โ„๐‘› such that1โ€ฒ๐‘ค = 1 and define

๐‘ƒ = ๐ผ โˆ’ 1๐‘คโ€ฒ, (7.14a)

๐‘† = โˆ’1

2๐‘ƒ๐ท๐‘ƒโ€ฒ = โˆ’

1

2(๐ผ โˆ’ 1๐‘คโ€ฒ)๐ท(๐ผ โˆ’ ๐‘ค1โ€ฒ). (7.14b)

en ๐‘ƒ is a centering matrix such that

๐‘ƒ๐‘ƒ = ๐ผ โˆ’ 1๐‘คโ€ฒ โˆ’ 1๐‘คโ€ฒ + 1(๐‘คโ€ฒ1)๐‘คโ€ฒ = ๐ผ โˆ’ 1๐‘คโ€ฒ = ๐‘ƒ, (7.15a)

๐‘คโ€ฒ๐‘ƒ๐‘ฅ = ๐‘คโ€ฒ๐‘ฅ โˆ’ (๐‘คโ€ฒ1)๐‘คโ€ฒ๐‘ฅ = ๐‘คโ€ฒ๐‘ฅ โˆ’ ๐‘คโ€ฒ๐‘ฅ = 0 (7.15b)

for every ๐‘ฅ โˆˆ โ„๐‘›. at is, ๐‘ƒ is an orthogonal projection onto the hyperplane {๐‘ค}โŸ‚.Furthermore, (๐ผ โˆ’ ๐‘ค1โ€ฒ)๐‘ฅ at is, 1โ€ฒ(๐ผ โˆ’ ๐‘ค1โ€ฒ)๐‘ฅ = 0.is a contrast and

๐‘ฅโ€ฒ(๐ผ โˆ’ 1๐‘คโ€ฒ)๐ท(๐ผ โˆ’ ๐‘ค1โ€ฒ)๐‘ฅ = โˆ’2๐‘ฅโ€ฒ๐‘†๐‘ฅ โ‰ค 0 (7.16)

since ๐ท is conditionally negative definite. erefore, Using equation (7.12) the (๐‘–, ๐‘—)-th element is

โˆ’1

2๐‘’๐‘–(๐‘ƒ๐ท๐‘ƒโ€ฒ)๐‘’๐‘– โˆ’

1

2๐‘’๐‘—(๐‘ƒ๐ท๐‘ƒโ€ฒ)๐‘’๐‘— + ๐‘’๐‘–(๐‘ƒ๐ท๐‘ƒโ€ฒ)๐‘’๐‘—

= โˆ’1

2[๐‘คโ€ฒ๐ท๐‘ค โˆ’ ๐‘คโ€ฒ๐ท๐‘’๐‘– โˆ’ ๐‘’โ€ฒ

๐‘–๐ท๐‘ค]

โˆ’1

2[๐‘คโ€ฒ๐ท๐‘ค โˆ’ ๐‘คโ€ฒ๐ท๐‘’๐‘— โˆ’ ๐‘’โ€ฒ

๐‘—๐ท๐‘ค]

+ ๐‘คโ€ฒ๐ท๐‘ค โˆ’ ๐‘คโ€ฒ๐ท๐‘’๐‘– โˆ’ ๐‘’โ€ฒ๐‘–๐ท๐‘ค + ๐ท๐‘–๐‘— = ๐ท๐‘–๐‘—

where ๐ท๐‘–๐‘– = 0 and ๐‘’๐‘– is the ๐‘–-th standard basisvector.

๐‘† is a positive definite matrixand it has a decomposition ๐‘† = ๐‘Œ๐‘Œโ€ฒ. It is straightforward to show that

๐ท = diag { โˆ’1

2๐‘ƒ๐ท๐‘ƒโ€ฒ}1โ€ฒ + 1 diag { โˆ’

1

2๐‘ƒ๐ท๐‘ƒโ€ฒ}โ€ฒ + ๐‘ƒ๐ท๐‘ƒโ€ฒ. (7.17)

at is, the vectors๐‘Œ = (๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘›)โ€ฒ generate the distance matrix๐ท. However, notethat the similarity matrix ๐‘† depends on the choice of ๐‘ค. It is not surprising that๐ท does not determine ๐‘† (nor ๐‘Œ) uniquely since it contains information only aboutrelative differences.

e vector ๐‘ค determines the position of the origin ๐‘‚. e condition 1โ€ฒ๐‘ค = 1 implies that ๐‘ค is avector of weights.

For example, ๐‘ค = 1/๐‘› cor-responds to placing the origin at the centroid (the center of mass) 1โ€ฒ๐‘Œ/๐‘› = s๐‘ฆ and๐‘ค = ๐‘’๐‘– โ€”at the ๐‘–th point ๐‘’โ€ฒ

๐‘–๐‘Œ = ๐‘ฆ๐‘–. Different decompositions ๐‘† = ๐‘Œ๐‘Œโ€ฒ give differentorientations about the origin ๐‘‚.

l

Nowwe consider the special case A covariance matrix is a circumhyper-

sphere with radius ๐œŽ2 and a correlationmatrix โ€“ with radius 1.

when the lengths โ„“๐‘– are all equal to ๐‘Ÿ for some ๐‘Ÿ > 0 andthus the points ๐‘ฅ๐‘– are the same distance from the center ๐‘‚. Geometrically, the pointslie on the circumference of a sphere with radius ๐‘Ÿ in โ„๐‘ [Gower, 1985] and so ๐ท is calleda spherical distance metric. is puts a constraint on the choice of ๐‘ค. In general,

diag {๐‘†} =1

2diag {1๐‘คโ€ฒ๐ท + ๐ท๐‘ค1โ€ฒ โˆ’ 1๐‘คโ€ฒ๐ท๐‘ค1โ€ฒ โˆ’ ๐ท} diag {๐ท} = 0 and diag {๐‘ค1โ€ฒ} = ๐‘ค.(7.18a)

= ๐ท๐‘ค โˆ’1

2(๐‘คโ€ฒ๐ท๐‘ค)1. (7.18b)

If โ„“ = ๐‘Ÿ21, then diag {๐‘†} = โ„“ = ๐‘Ÿ21. erefore,

๐ท๐‘ค โˆ’1

2(๐‘คโ€ฒ๐ท๐‘ค)1 = ๐‘Ÿ21. (7.19)

If ๐ท is nonsingular, ๐ทโˆ’1 exists and we can right-multiply by ๐ทโˆ’1. en

๐‘ค = (12๐‘คโ€ฒ๐ท๐‘ค + ๐‘Ÿ2)๐ทโˆ’11. (7.20)

Recall that ๐‘ค satisfies ๐‘คโ€ฒ1 = 1, so that

1โ€ฒ๐‘ค = (12๐‘คโ€ฒ๐ท๐‘ค + ๐‘Ÿ2)1โ€ฒ๐ทโˆ’11 = 1 (7.21)

is implies 1โ€ฒ๐ทโˆ’11 โ‰  0 and

๐‘ค = ๐ทโˆ’111โ€ฒ๐ทโˆ’11 , In this case ๐‘ƒ = ๐ผ โˆ’ 11โ€ฒ๐ทโˆ’1/1โ€ฒ๐ทโˆ’11

is the orthogonal projection onto the

hyperplane {๐ทโˆ’11}โŸ‚.

๐‘Ÿ2 = 1/21โ€ฒ๐ทโˆ’11 . (7.22)

us we have proved the following

Page 62: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

62

Corollary 7.1 Suppose that ๐ท โˆˆ ๐”ป๐‘› is a distance matrix such that det {๐ท} โ‰  0. en1โ€ฒ๐ทโˆ’11 > 0.

7.3 Conditional definite matrices

Here we derive a sufficient condition for positive definiteness of the covariance matrixฮฃ = 11โ€ฒ โˆ’ ๐œ†ฮ” where ฮ” โˆˆ ๐”ป๐‘› is a distance matrix [or more generally, a conditionallynegative definite matrix] and ๐œ† is a positive constant. e derivation is based on [Bapatand Raghavan, 1997] and uses the Spectraleoremwhich states thatฮฃ โ‰ป 0 if and onlyif all its eigenvalues are positive.

Definition 7.3 A matrix ฮ” โˆˆ ๐•Š๐‘› is conditionally negative definite if

๐›ผโ€ฒฮ”๐›ผ โ‰ค 0 (7.23)

for all ๐›ผ โˆˆ โ„๐‘› such that โˆ‘๐‘– ๐›ผ๐‘– = ๐›ผโ€ฒ1 = 0. us, conditional negative definiteness isequivalent to negative definiteness on the subspace {1}โŸ‚.

eorem 7.2 A conditionally negative definite (c.n.d) matrix ฮ” โˆˆ ๐•Š๐‘› has at most one posi-tive eigenvalue.

Sketch of proof. We consider the c.n.d. case. Suppose to the contrary that ฮ” hastwo positive eigenvalues ๐‘ข1 > 0 and ๐‘ข2 > 0 with corresponding eigenvectors ๐‘ฅ and ๐‘ฆ.Without loss of generality, we can assume that the eigenvectors are normalized so thatโˆ‘๐‘– ๐‘ฅ๐‘– = โˆ‘๐‘– ๐‘ฆ๐‘– โ‡” โˆ‘๐‘–(๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘–) = 0. at is, (๐‘ฅ โˆ’ ๐‘ฆ)โ€ฒ1 = 0 and ๐‘ฅ โˆ’ ๐‘ฆ is a contrast.Furthermore,

(๐‘ฅ โˆ’ ๐‘ฆ)โ€ฒฮ”(๐‘ฅ โˆ’ ๐‘ฆ) = ๐‘ฅโ€ฒฮ”๐‘ฅ + ๐‘ฆโ€ฒฮ”๐‘ฆ โˆ’ ๐‘ฆโ€ฒฮ”๐‘ฅ โˆ’ ๐‘ฅโ€ฒฮ”๐‘ฆ (7.24a)

= ๐‘ข1๐‘ฅโ€ฒ๐‘ฅ + ๐‘ข2๐‘ฆโ€ฒ๐‘ฆ โˆ’ ๐‘ข1๐‘ฆโ€ฒ๐‘ฅ โˆ’ ๐‘ข2๐‘ฅโ€ฒ๐‘ฆ๐‘ฅ โŸ‚ ๐‘ฆ โ‡” ๐‘ฅโ€ฒ๐‘ฆ = 0 (7.24b)

= ๐‘ข1๐‘ฅโ€ฒ๐‘ฅ + ๐‘ข2๐‘ฆโ€ฒ๐‘ฆ > 0 (7.24c)

since ๐‘ข1, ๐‘ข2 > 0. is contradicts the definition of conditionally negative definite ma-trices. l

A distance matrix ฮ” is nonnegative, with main diagonal of 0s, and is also conditionallynegative definite byeorem 7.1.

Corollary 7.2 Suppose that ฮ” is a nonnegative, nonzero symmetric matrix. en ฮ” has atleast one positive eigenvalue.

Sketch of proof. Since ฮ” is symmetric, by the Spectral eorem it has real eigenvalues๐‘ข = {๐‘ข๐‘–}. Furthermore, tr {ฮ”} = โˆ‘๐‘›

๐‘–=1 ๐‘ข๐‘– โ‰ฅ 0. e trace of ฮ” is nonnegative because ฮ”is nonnegative; its eigenvalues are not all zero because ฮ” is nonzero. Since โˆ‘๐‘– ๐‘ข๐‘– โ‰ฅ 0,at least one of the eigenvalues is positive. l

Corollary 7.3 Suppose thatฮ” is a nonnegative, nonzero, conditionally negative definite ma-trix. en ฮ” has exactly one positive eigenvalue.

Sketch of proof. Byeorem 7.2 ฮ” has at most one positive eigenvalue while by Corol-lary 7.3 it has at least one positive eigenvalue. erefore, it has exactly one positiveeigenvalue. If ฮ” is strictly c.n.d, its other ๐‘› โˆ’ 1 eigenvalues are negative. l

Page 63: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 63

So far we know that ฮ” is both conditionally negative definite and nonnegative, andtherefore, it has exactly one positive eigenvalue. On the other hand, ฮฃ = 11โ€ฒ โˆ’ ๐œ†ฮ”is conditionally positive definite: for all ๐›ผ such that ๐›ผโ€ฒ1 = 0, we have

๐›ผโ€ฒฮฃ๐›ผ = ๐›ผโ€ฒ(11โ€ฒ โˆ’ ๐œ†ฮ”)๐›ผ = (๐›ผโ€ฒ1)2 โˆ’ ๐œ†(๐›ผโ€ฒฮ”๐›ผ) โ‰ฅ 0. (7.25)

erefore, ฮฃ has at most one negative eigenvalue. Finally, by the matrix-determinantlemma for a rank-one update,

๐‘›โˆ๐‘–=1

๐‘ขโˆ—๐‘– = det {ฮฃ} = (1 โˆ’ 1โ€ฒฮ”โˆ’11

๐œ† ) det { โˆ’ ๐œ†ฮ”} = (1 โˆ’ 1โ€ฒฮ”โˆ’11๐œ† )

๐‘›โˆ๐‘–=1

(โˆ’๐œ†)๐‘ข๐‘–, (7.26)

where ๐‘ข = {๐‘ข๐‘–} are the eigenvalues of ฮ”, ๐‘ขโˆ— = {๐‘ขโˆ—๐‘– } are the eigenvalues of ฮฃ.

To ensure that the ๐‘ขโˆ—๐‘– s are positive, we use the fact that the product on the left in

equation (7.26) has atmost one negative termwhile the product on the right has exactlyone negative term. erefore, a necessary and sufficient condition for ฮฃ โ‰ฝ 0 is that ๐œ†satisfies

1 โˆ’ 1โ€ฒฮ”โˆ’11๐œ† โ‰ค 0. (7.27)

7.4 Restricted maximum likelihood (REML) in the general case

Consider the model ๐‘Œ โˆผ N(๐‘‹๐›ฝ, ฮฃ) with design matrix ๐‘‹ and covariance matrix ฮฃ. Let๐พ โˆˆ โ„๐‘›ร—๐‘ be a basis for the mean space and ๐ฟ โˆˆ โ„(๐‘›โˆ’๐‘)ร—๐‘› be a basis for the residualspace. For example, ๐พ = ๐‘‹ if the design matrix has full rank, or otherwise, ๐พ is ๐‘linearly independent columns of ๐‘‹. By construction ๐ฟ๐พ = 0 and ker {๐ฟ} = span {๐พ}.Also let ๐‘„[ฮฃ] be the unique orthogonal projection with kernel ๐พ given by

๐‘„ = ๐ผ โˆ’ ๐พ(๐พโ€ฒฮฃโˆ’1๐พ)โˆ’1๐พโ€ฒฮฃโˆ’1. (7.28)

[McCullagh, 2009] shows that regardless of the choice for ๐ฟ is, ๐‘„ has an equivalentcharacterization given by

๐‘„ = ฮฃ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ. (7.29)

To prove this, it is sufficient to show that

โ€ข ฮฃ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ is a projection:๐‘„๐‘„ = (ฮฃ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ)(ฮฃ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ) = ฮฃ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ = ๐‘„

โ€ข ฮฃ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ is self-adjoint with respect to the inner product โŸจ๐‘ข, ๐‘ฃโŸฉ = ๐‘ขฮฃโˆ’1๐‘ฃ:๐‘„โ€ฒฮฃโˆ’1 = (ฮฃ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ)โ€ฒฮฃโˆ’1 = ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ = ฮฃโˆ’1(ฮฃ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ) = ฮฃโˆ’1๐‘„

โ€ข ker {ฮฃ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ} = {๐พ}: ๐‘„๐พ = (ฮฃ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ)๐พ = ฮฃ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ(๐ฟ๐พ) = 0

e orthogonal projection with kernel ๐พ (i.e., the orthogonal projection onto the resid-ual space) is unique, so (7.28) = (7.29).

To rewrite the Wishart log-likelihood in equation (4.13), we derive the followingexpressions involving ฮฃ, ๐‘„, ๐ฟ and ๐พ.

det {๐ฟฮฃ๐ฟโ€ฒ}โˆ’1 det {๐ฟ๐ฟโ€ฒ} = det {(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1(๐ฟ๐ฟโ€ฒ)}= Det {๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ}= Det {ฮฃโˆ’1ฮฃ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ}= Det {ฮฃโˆ’1๐‘„} (7.30)

Page 64: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

64

where the standard determinant, denoted by det, is the product of all eigenvalues andthe generalized determinant, denoted by Det, is the product of the nonzero eigenval-ues. e first equality holds because (๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ๐ฟโ€ฒ is (๐‘›โˆ’๐‘)ร—(๐‘›โˆ’๐‘)with ๐‘›โˆ’๐‘ nonzeroeigenvalues and ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ is ๐‘› ร— ๐‘› with ๐‘› โˆ’ ๐‘ nonzero eigenvalues and the two ma-trices have the same nonzero eigenvalues:

If (๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ๐ฟโ€ฒ = ๐œ†๐‘ข, then ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ(๐ฟโ€ฒ๐‘ข) = ๐œ†(๐ฟโ€ฒ๐‘ข).

If ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ๐‘ฃ = ๐œ†๐‘ฆ, then ๐ฟ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ๐‘ฃ = ๐œ†๐ฟ๐‘ฆ,(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟ๐‘ฃ = (๐ฟ๐ฟโ€ฒ)โˆ’1๐ฟ๐‘ฆ,

(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1(๐ฟ๐ฟโ€ฒ)(๐ฟ๐ฟโ€ฒ)โˆ’1๐ฟ๐‘ฆ = (๐ฟ๐ฟโ€ฒ)โˆ’1๐ฟ๐‘ฆ.

Similarly,

det {๐พโ€ฒฮฃโˆ’1๐พ}โˆ’1 det {๐พโ€ฒ๐พ} = Det {ฮฃ(๐ผ โˆ’ ๐‘„)โ€ฒ}. (7.31)

Following [Verbyla, 1990], let ๐ด = [๐พ, ๐ฟโ€ฒ]. Using both characterizations of the projec-tion ๐‘„ and the formula for the determinant of a block matrix,

det {๐ดโ€ฒฮฃ๐ด} = detโŽ›โŽœโŽœโŽœโŽœโŽ

๐พโ€ฒฮฃ๐พ ๐พโ€ฒฮฃ๐ฟโ€ฒ

๐ฟฮฃ๐พ ๐ฟฮฃ๐ฟโ€ฒ

โŽžโŽŸโŽŸโŽŸโŽŸโŽ 

= det {๐ฟฮฃ๐ฟโ€ฒ} det {๐พโ€ฒฮฃ๐พ โˆ’ ๐พโ€ฒฮฃ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟฮฃ๐พ}

= det {๐ฟฮฃ๐ฟโ€ฒ} det {๐พโ€ฒ[๐ผ โˆ’ ๐ฟโ€ฒ(๐ฟฮฃ๐ฟโ€ฒ)โˆ’1๐ฟฮฃ]ฮฃ๐พ}= det {๐ฟฮฃ๐ฟโ€ฒ} det {๐พโ€ฒ[๐ผ โˆ’ ๐‘„]ฮฃ๐พ}= det {๐ฟฮฃ๐ฟโ€ฒ} det {๐พโ€ฒ[๐พ(๐พโ€ฒฮฃโˆ’1๐พ)โˆ’1๐พโ€ฒฮฃโˆ’1]ฮฃ๐พ}= det {๐ฟฮฃ๐ฟโ€ฒ} det {๐พโ€ฒ๐พ(๐พโ€ฒฮฃโˆ’1๐พ)โˆ’1๐พโ€ฒ๐พ}= det {๐ฟฮฃ๐ฟโ€ฒ} det {๐พโ€ฒ๐พ} det {๐พโ€ฒฮฃโˆ’1๐พ}โˆ’1 det {๐พโ€ฒ๐พ};

det {๐ดโ€ฒ๐ด} = detโŽ›โŽœโŽœโŽœโŽœโŽ

๐พโ€ฒ๐พ ๐พโ€ฒ๐ฟโ€ฒ

๐ฟ๐พ ๐ฟ๐ฟโ€ฒ

โŽžโŽŸโŽŸโŽŸโŽŸโŽ 

= detโŽ›โŽœโŽœโŽœโŽœโŽ

๐พโ€ฒ๐พ 00 ๐ฟ๐ฟโ€ฒ

โŽžโŽŸโŽŸโŽŸโŽŸโŽ 

= det {๐พโ€ฒ๐พ} det {๐ฟ๐ฟโ€ฒ}.

Since ๐ด is full-rank by construction,

det {ฮฃ} = det {๐ดโ€ฒฮฃ๐ด}det {๐ดโ€ฒ๐ด} = det {๐พโ€ฒ๐พ} det {๐ฟฮฃ๐ฟโ€ฒ}

det {๐ฟ๐ฟโ€ฒ} det {๐พโ€ฒฮฃโˆ’1๐พ} . (7.32)

Finally, by applying first (7.30) and then (7.31),

det {ฮฃ} = det {๐พโ€ฒ๐พ}[ det {๐พโ€ฒฮฃโˆ’1๐พ} Det {ฮฃโˆ’1๐‘„}]โˆ’1(7.33a)

= det {๐ฟ๐ฟโ€ฒ}โˆ’1 det {๐ฟฮฃ๐ฟโ€ฒ} Det {ฮฃ(๐ผ โˆ’ ๐‘„)โ€ฒ}. (7.33b)

7.4.1 Restricted maximum likelihood (REML) in a special case

Rather than a general covariancematrixฮฃ, our model for population structure in termsof distances on a population graph specifies

ฮฃ = 11โ€ฒ โˆ’ ๐œ†ฮ”, (7.34)

where ฮ” is a conditionally negative definite matrix such that 1โ€ฒฮ”โˆ’11 = 1. e normal-ization simplifies notation and defines equivalence classes {ฮ”โˆ— โˆถ (1โ€ฒฮ”โˆ’1โˆ— 1)ฮ”โˆ— = ฮ”}. It is

Page 65: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 65

also convenient because under this parametrization ๐œ† โˆˆ (0, 1) is a sufficient conditionfor ฮฃ โ‰ป 0, as we show in Appendix 7.3.

Because the covariancematrix has the form (7.34) we can avoid explicitly construct-ing ฮฃ and instead work with ฮ”. Using the Sherman-Morrison formula for the inverseof a rank-one update,

ฮฃโˆ’1 = โˆ’ 1๐œ†(ฮ”โˆ’1 โˆ’ ฮ”โˆ’111โ€ฒฮ”โˆ’1

1 โˆ’ ๐œ† ) (7.35a)

ฮฃโˆ’11 = โˆ’ 1๐œ†(ฮ”โˆ’11 โˆ’ ฮ”โˆ’11

1 โˆ’ ๐œ†) = 11 โˆ’ ๐œ†ฮ”โˆ’11 (7.35b)

1โ€ฒฮฃโˆ’11 = 11 โˆ’ ๐œ† (7.35c)

e orthogonal projection ๐‘„ = ๐‘„[ฮฃ] with kernel ๐พ = 1 is given by

๐‘„ = ๐ผ โˆ’ 11โ€ฒฮฃโˆ’1

1โ€ฒฮฃโˆ’11 = ๐ผ โˆ’ 11โ€ฒฮ”โˆ’1 (7.36a)

ฮฃโˆ’1๐‘„ = โˆ’ 1๐œ†ฮ”โˆ’1(๐ผ โˆ’ 1

1 โˆ’ ๐œ†11โ€ฒฮ”โˆ’1)๐‘„ = โˆ’ 1๐œ†ฮ”โˆ’1๐‘„ (7.36b)

e projection matrix ๐‘„ is not symmetric in general but for every ๐‘„,

๐‘„โ€ฒฮฃโˆ’1 = ๐‘„โ€ฒฮฃโˆ’1๐‘„ = ฮฃโˆ’1๐‘„ and ๐‘„โ€ฒฮ”โˆ’1 = ๐‘„โ€ฒฮ”โˆ’1๐‘„ = ฮ”โˆ’1๐‘„. (7.37)

at is, ฮฃโˆ’1๐‘„ and ฮ”โˆ’1๐‘„ are symmetric.Now let's express Det {ฮฃโˆ’1๐‘„} as a function of Det {ฮ”โˆ’1๐‘„} where the generalized

determinant Det is the product of the nonzero eigenvalues. Since

ฮ”โˆ’1๐‘„๐‘ฃ = ๐›ผ๐‘ฃ โ‡” ฮฃโˆ’1๐‘„ = โˆ’ 1๐œ†(๐›ผ๐‘ฃ), (7.38)

the two generalized eigenvalue problems are equivalent up to a proportionality constantand

Det {ฮฃโˆ’1๐‘„} = ( 1๐œ†)

๐‘›โˆ’1Det { โˆ’ ฮ”โˆ’1๐‘„} = Det { โˆ’ ฮ”โˆ’1๐‘„/๐œ†}. (7.39)

where rank {ฮฃโˆ’1๐‘„} = rank { โˆ’ ฮ”โˆ’1๐‘„} = ๐‘› โˆ’ 1. Finally, we apply equation (7.32) withฮฃ = ๐‘† and ๐พ = 1 to obtain

det {๐ฟ๐‘†๐ฟโ€ฒ} = det {๐‘†} det {1โ€ฒ๐‘†โˆ’11}det {๐ฟ๐ฟโ€ฒ}det {1โ€ฒ1} , (7.40)

and equation (7.30) with ฮฃ = 11โ€ฒ โˆ’ ๐œ†ฮ” and ๐ฟฮฃ๐ฟโ€ฒ = โˆ’๐œ†(๐ฟฮ”๐ฟโ€ฒ) to obtain

det { โˆ’ (๐ฟฮ”๐ฟโ€ฒ)โˆ’1/๐œŽโˆ—} = Det {ฮฃโˆ’1๐‘„/๐œŽ2}/ det {๐ฟ๐ฟโ€ฒ}= Det { โˆ’ ฮ”โˆ’1๐‘„/๐œŽโˆ—}/ det {๐ฟ๐ฟโ€ฒ} (7.41a)

tr { โˆ’ (๐ฟฮ”๐ฟโ€ฒ)โˆ’1๐ฟ๐‘†๐ฟโ€ฒ/๐œŽโˆ—} = tr { โˆ’ (๐ฟโ€ฒ(๐ฟฮ”๐ฟโ€ฒ)โˆ’1๐ฟ)๐‘†/๐œŽโˆ—}= tr { โˆ’ ฮ”โˆ’1ฮ”(๐ฟโ€ฒ(๐ฟฮ”๐ฟโ€ฒ)โˆ’1๐ฟ)๐‘†/๐œŽโˆ—}= tr { โˆ’ (ฮ”โˆ’1๐‘„)๐‘†/๐œŽโˆ—}. (7.41b)

7.5 Efficient computation

Let ๐‘… = (๐‘…๐›ผ๐›ฝ) be the matrix of effective resistances between pairs of observed demes(๐›ผ, ๐›ฝ) in the population graph๐บ = (๐‘‰, ๐ธ, ๐‘€). From [McRae, 2006]we know that๐‘‡๐›ผ๐›ผ โ‰ˆ

Page 66: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

66

๐‘‘ and ๐‘‡๐›ผ๐›ฝ โ‰ˆ ๐‘‘(1+๐‘…๐›ผ๐›ฝ/4)where ๐‘‘ is the number of demes in the population graph and๐‘œ is the number of observed demes. With this motivation, let

(ฮ”๐›ผ๐›ฝ) = ๐‘‘(1๐‘œ1โ€ฒ๐‘œ + (๐‘…๐›ผ๐›ฝ)/4 โˆ’ ๐ผ๐‘œ) (7.42)

be the matrix of (expected) pairwise distances between observed demes.If individuals are exchangeable within demes, we can model distances between in-

dividuals in terms of distances between demes. For a pair (๐‘– โˆˆ ๐›ผ, ๐‘— โˆˆ ๐›ฝ),

ฮ”๐‘–๐‘— = ๐‘‘(1 + ๐‘…๐›ผ๐›ฝ/4 โˆ’ ๐Ÿ™{๐‘–=๐‘—}). (7.43)

Equivalently, in matrix notation,

(ฮ”๐‘–๐‘—) = ๐‘‘(1๐‘›1โ€ฒ๐‘› + ๐ฝ๐‘…๐ฝโ€ฒ/4 โˆ’ ๐ผ๐‘›) (7.44)

where ๐ฝ = (๐ฝ๐‘–๐›ผ) โˆˆ โ„ค๐‘›ร—๐‘œ is an indicator matrix such that

๐ฝ๐‘–๐›ผ =โŽง{โŽจ{โŽฉ

1 if ๐‘– โˆˆ ๐›ผ0 if ๐‘– โˆ‰ ๐›ผ

. (7.45)

To simplify the notation, we will drop the subscripts and write plainly 1 for the vectorof ones and ๐ผ for the identity matrix. e dimension will be clear from the context if wekeep in mind that ๐‘… = (๐‘…๐›ผ๐›ฝ) is an ๐‘œ ร— ๐‘œ matrix and ฮ” = (ฮ”๐‘–๐‘—) is an ๐‘› ร— ๐‘› matrix.

To evaluate the Wishart log-likelihood in equation (4.13) we need to compute theterms tr {ฮ”โˆ’1๐‘„๐‘†} and Det { โˆ’ ฮ”โˆ’1๐‘„} where

๐‘„ = ๐ผ โˆ’ 11โ€ฒฮ”โˆ’1

1โ€ฒฮ”โˆ’11 (7.46)

is a projection matrix, which removes the common mean, and ๐‘† is the observed sim-ilarity matrix. We also standardize the distance matrix ฮ” so that 1โ€ฒฮ”โˆ’11 = 1. Withthis normalization, multiplying ฮ” by a (positive) constant has no effect on the productฮ”โˆ’1๐‘„, so we can ignore the scale ๐‘‘ in equation (7.44).

e distance matrix ฮ” = 11โ€ฒ + ๐ฝ๐‘…๐ฝโ€ฒ/4 โˆ’ ๐ผ has an ''almost-block'' structure, exceptfor the diagonal of zeros: specifically, ฮ” = ๐ฝ๐ต๐ฝโ€ฒ โˆ’๐ผ where ๐ต = ๐‘…/4+11โ€ฒ is a known ๐‘œร—๐‘œmatrix. [๐ต is a function of the migration rates.] e inverse ฮ”โˆ’1 is also an almost-blockmatrix:

ฮ”โˆ’1 = ๐ฝ๐‘‹๐ฝโ€ฒ โˆ’ ๐ผ, (7.47)

where ๐‘‹ is an unknown ๐‘œ ร— ๐‘œ matrix. Since ฮ”ฮ”โˆ’1 = ๐ผ, the solution ๐‘‹ must satisfy

๐ฝ๐ต๐ถ๐‘‹๐ฝโ€ฒ โˆ’ ๐ฝ๐ต๐ฝโ€ฒ โˆ’ ๐ฝ๐‘‹๐ฝโ€ฒ + ๐ผ = ๐ผ, (7.48a)

๐ฝโ€ฒ(๐ต๐ถ โˆ’ ๐ผ)๐‘‹๐ฝโ€ฒ = ๐ฝ๐ต๐ฝโ€ฒ, (7.48b)

Where ๐ถ = ๐ฝ๐ฝโ€ฒ = diag {๐‘›๐›ผ} is the diagonal matrix of sample counts.Since every term in equation (7.48b) has an exact block structure which depends

on the sample configuration through ๐ฝ, it is sufficient to solve the lower-dimensionalproblem

(๐ต๐ถ โˆ’ ๐ผ)๐‘‹ = ๐ต โ‡” (๐ถ โˆ’ ๐ตโˆ’1)๐‘‹ = ๐ผ. (7.49)

is is a system of linear equations for the unknown ๐‘‹ in terms of the effective resis-tances ๐‘… and the counts ๐ถ, and therefore, it can be solved efficiently without matrix

Page 67: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 67

inversions. e diagonal matrix ๐ถ is invertible because here we consider only observeddemes, i.e., locations with at least one observation; the auxiliary matrix ๐ต = ๐‘…/4 + 11โ€ฒ

is invertible because the matrix of effective resistances ๐‘… is invertible.Once we solve for๐‘‹, we could explicitly constructฮ”โˆ’1 from๐‘‹ according to equation

(7.47). However, this is not necessary because we only need to compute Det { โˆ’ ฮ”โˆ’1๐‘„}and tr {ฮ”โˆ’1๐‘„๐‘†}where ๐‘† is the (average) observed similarity. Using the definition of theorthogonal projection ๐‘„ (equation 7.46) and the properties of the trace,

tr {ฮ”โˆ’1๐‘„๐‘†} = tr {ฮ”โˆ’1๐‘†} โˆ’ 11โ€ฒฮ”โˆ’11 tr {11โ€ฒฮ”โˆ’1๐‘†ฮ”โˆ’1}. (7.50)

We consider each of these terms in turn:

1โ€ฒฮ”โˆ’11 = 1โ€ฒ(๐ฝ๐‘‹๐ฝโ€ฒ โˆ’ ๐ผ)1 = tr {๐‘‹(๐ฝโ€ฒ11โ€ฒ๐ฝ)}โˆ’๐‘›, (7.51a)

tr {ฮ”โˆ’1๐‘†} = tr {(๐ฝ๐‘‹๐ฝโ€ฒ โˆ’ ๐ผ)๐‘†}= tr {๐‘‹(๐ฝโ€ฒ๐‘†๐ฝ)}โˆ’ tr {๐‘†}, tr {๐‘Œ๐‘‡} = โˆ‘๐›ผ,๐›ฝ ๐‘Œ๐›ผ๐›ฝ๐‘‡๐›ผ๐›ฝ. So the trace

can be computed as sum(sum(Y.*T)).

(7.51b)

tr {11โ€ฒฮ”โˆ’1๐‘†ฮ”โˆ’1} = 1โ€ฒ๐ถ๐‘‹(๐ฝโ€ฒ๐‘†๐ฝ)๐‘‹๐ถ1+1โ€ฒ๐‘†1โˆ’ 2 tr {๐‘‹(๐ฝโ€ฒ๐‘†11โ€ฒ๐ฝ)}. (7.51c)

All the terms in red are constants and can be precomputed and stored for easy access.e point is that there is no need to construct the ๐‘› ร— ๐‘› matrix ฮ”โˆ’1 in order to computetr {ฮ”โˆ’1๐‘„๐‘†}; we can work with the ๐‘œ ร— ๐‘œ matrix ๐‘‹ instead.

Nextwe showhow to compute efficiently the generalizeddeterminantDet {โˆ’ฮ”โˆ’1๐‘„}.Since ฮ” โˆˆ ๐”ป๐‘› is conditionally negative definite (and nonnegative),

Det {ฮฃโˆ’1๐‘„} = (1โ€ฒ1โ€ฒ)/(1โ€ฒฮฃโˆ’11)det {ฮฃ}

and

det {ฮฃ} = (1 โˆ’ 1โ€ฒโˆ†โˆ’11/๐œ†) det { โˆ’ ๐œ†โˆ†}

1โ€ฒฮฃโˆ’11 = (1 โˆ’ ๐œ†/1โ€ฒโˆ†โˆ’11)โˆ’1

Det { โˆ’ ฮ”โˆ’1๐‘„} = (1โ€ฒ1)/(1โ€ฒฮ”โˆ’11)โˆ’ det { โˆ’ ฮ”} . (7.52)

Furthermore, ฮ” has one positive eigenvalue and ๐‘› โˆ’ 1 negative eigenvalues, as we showin Appendix 7.3. erefore, โˆ’ det { โˆ’ ฮ”} is guaranteed to be positive and it is sufficientto compute โˆฃ det {ฮ”}โˆฃ, or equivalently, find the eigenvalues of ฮ”. Since ฮ” = ๐ฝ๐ต๐ฝโ€ฒ โˆ’ ๐ผ,

eig {ฮ”} = eig {๐ฝ๐ต๐ฝโ€ฒ} โˆ’ 1, (7.53)

where ๐ฝ๐ต๐ฝโ€ฒ is a block matrix and thus it has ๐‘œ nontrivial eigenvalues besides 0, whichhas multiplicity ๐‘› โˆ’ ๐‘œ. Furthermore, for any vector ๐‘ฃ โˆˆ โ„๐‘œ,

ฮ”(๐ฝ๐‘ฃ) = ๐ฝ๐ต๐ถ๐‘ฃ โˆ’ ๐ฝ๐‘ฃ = ๐ฝ(๐ต๐ถ โˆ’ ๐ผ)๐ถ๐‘ฃ. (7.54)

erefore, if (๐‘ฃ, ๐œ†) is an eigenpair for ๐ต๐ถ โˆ’ ๐ผ, then (๐ฝ๐‘ฃ, ๐œ†) is an eigenpair for ฮ”. at is,the ๐‘œ nontrivial eigenvalues of ฮ” are equal to the eigenvalues of ๐ต๐ถ โˆ’ ๐ผ.

7.6 Markov chain Monte Carlo

Expected coalescence times in a stepping-stone model are determined by the migra-tion rates between demes and the coalescence rates within demes according to equation(2.15). roughout, we assume that the coalescence rate is the same for all demes andmigration is symmetric. e approximation in terms of effective resistances on an undi-rected graph given by equation (2.28) makes the symmetry assumption explicitly andthe equal size assumption implicitly. To use either expression for computing effectivedistances, we need to specify a migration rate for each undirected edge (๐›ผ, ๐›ฝ) in the grid(๐‘‰, ๐ธ). We assume that the migration rates are piecewise constant and we model them

Page 68: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

68

in terms of a colored Voronoi tiling ๐’ฑ of the habitat โ„‹ . Under the tessellation ๐’ฑ , eachtile has its own migration log rate and all edges within a tile share this parameter.

Since the spatial structure of the population is unknown, an appropriate Voronoitessellation of the habitat must be estimated given the data. We use a version of themethod based on colored Voronoi tessellations implemented in GENELAND [Guillotet al., 2005]. e main difference is that the ''colors'' in GENELAND are cluster indices;in our framework the ''colors'' are log (base 10)migration rates as edges within the sametile share a common rate to encourage locally smooth migration surfaces.

7.6.1 Updating the number of tiles ๐‘‡ with birth/death moves

Unlike the log rates and locations of tiles, thenumber of tiles present a transdimensionalinference problem because adding or removing a tile changes the dimensionality of theparameter space. For such a problem we can use the birth-death Markov chain MonteCarlo algorithm (BD-MCMC) which has been applied to other variable dimension prob-lems such as a mixture model with unknown number of components [Stephens, 2000,van Lieshout, 2000].

Assume that the Markov chain is currently in state (๐‘ก, ฮ˜๐‘ก) with ๐‘ก Voronoi tiles andparameters ฮ˜๐‘ก, and that there are two options for the next move: with probability ๐‘Ž(๐‘ก)the proposed move is (๐‘ก + 1, ฮ˜๐‘ก+1), i.e., the birth of a tile; with probability 1 โˆ’ ๐‘Ž(๐‘ก) theproposedmove is (๐‘กโˆ’1, ฮ˜๐‘กโˆ’1), i.e., the death of a tile. Since we consider only these twomoves, we assume that they occur with equal probability: ๐‘Ž(1) = 1A death event is impossible with only one

tile.and ๐‘Ž(๐‘ก) = 1โˆ’๐‘Ž(๐‘ก) =

12 for ๐‘ก > 1. For a given number of tiles ๐‘ก, the model parameters include the migrationlog rates {โ„“๐‘š1, โ€ฆ , โ„“๐‘š๐‘ก} and locations {๐‘ข1, โ€ฆ , ๐‘ข๐‘ก}, as well as common parameters ๐œƒ thatdo not depend on the tessellation and are not updated during a birth/death move. Let

ฮ˜๐‘ก = (โ„“๐‘š1, โ€ฆ , โ„“๐‘š๐‘ก, ๐‘ข1, โ€ฆ , ๐‘ข๐‘ก, ๐œƒ). (7.55)

A full Bayesian model for the Voronoi tiling is specified by the likelihood on pairwisedistances (4.4) together with the following prior distributions for the number of tilesand tile-specific parameters:

๐‘‡ | ๐œˆ โˆผ Po(๐œˆ), (7.56a)

๐‘ข | ๐‘‡ iidโˆผ U(โ„‹), (7.56b)

โ„“๐‘š | ๐œ”, ๐‘‡ iidโˆผ N(โ„“ s๐‘š, ๐œŽ2๐‘š). (7.56c)

where ๐‘‡ is the number of tiles, (๐‘ข, โ„“๐‘š) are the tile centers and log rates, respectively,and ๐œ” = (โ„“ s๐‘š, ๐œŽ2๐‘š) are hyperparameters: the mean log rate โ„“ s๐‘š and the variance ๐œŽ2๐‘š .e intensity (Poisson rate) ๐œˆ controls the spatial organization. is prior specificationimplies that rates and locations are a priori independent.

It is convenient to denote the component parameters (location and log rate) by๐œ™๐‘ก =(๐‘ข๐‘ก, โ„“๐‘š๐‘ก). Since the tiles are not ordered,

๐œ‹(๐‘‡, ๐‘ข, โ„“๐‘š | ๐œˆ, ๐œ”) โ‰ก ๐œ‹(๐‘‡, ๐œ™1, โ€ฆ , ๐œ™๐‘‡ | ๐œˆ, ๐œ”) (7.57a)

= ๐œ‹(๐‘‡ | ๐œˆ) ร— ๐‘‡! ร— ๐œ‹(๐œ™1 |๐œ”) โ‹ฏ ๐œ‹(๐œ™๐‘‡ |๐œ”) (7.57b)

at is, conditional on the number of tiles ๐‘‡, the ๐œ™๐‘กs are independent and identicallydistributed from a product distribution with density

๐œ‹(๐œ™ | ๐œ”) โˆ ๐Ÿ™{๐œ™(1) โˆˆ โ„‹} โ‹… N(๐œ™(2) ; โ„“ s๐‘š, ๐œŽ2๐‘š). (7.58)

Page 69: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 69

Note that the prior is invariant under relabeling of the tiles, i.e.,

๐œ‹(๐‘‡, ๐œ™1, โ€ฆ , ๐œ™๐‘‡ | ๐œˆ, ๐œ”) = ๐œ‹(๐‘‡, ๐œ™๐œŽ(1), โ€ฆ , ๐œ™๐œŽ(๐‘‡) | ๐œˆ, ๐œ”) (7.59)

for every permutation ๐œŽ of the indices 1, โ€ฆ , ๐‘‡. at is, the tile parameters are ex-changeable.

Next we construct a birth-death MCMC that allows only two types of moves: thebirth of a new tile and the death of an existing tile (when ๐‘‡ > 1). Suppose that thecurrent state is (๐‘ก, ๐œ™1, โ€ฆ , ๐œ™๐‘ก). If the proposal is a birth, we add a new tile with lograte โ„“๐‘š๐‘ก+1 โˆผ N(โ„“ s๐‘š, ๐œŽ2๐‘š) and location ๐‘ข๐‘ก+1 โˆผ U(โ„‹). We denote the birth density by๐‘(๐‘ก) = ๐œ‹(๐œ™๐‘ก+1 |๐œ”). If the proposal is a death, we select a tile to remove uniformly atrandom, i.e., with probability ๐‘‘(๐‘ก) = 1

๐‘ก .To guarantee that the birth-death chain is reversible and the stationary distribu-

tion is the posterior ๐œ‹(๐‘‡, ๐œ™1, โ€ฆ , ๐œ™๐‘‡ | ๐‘ง, ๐œˆ, ๐œ”, ๐œƒ) given observed data ๐‘ง, we choose theacceptance probabilities ๐›ผ(โ‹…, โ‹…) so that they satisfy the detailed balance condition:

๐‘Ž(๐‘ก)๐‘(๐‘ก)๐œ‹(๐‘ก, ๐œ™1, โ€ฆ , ๐œ™๐‘ก | ๐‘ง, ๐œˆ, ๐œ”, ๐œƒ)๐›ผ(๐‘ก, ๐‘ก + 1)= [1 โˆ’ ๐‘Ž(๐‘ก + 1)]๐‘‘(๐‘ก + 1)๐œ‹(๐‘ก + 1, ๐œ™1, โ€ฆ , ๐œ™๐‘ก+1 | ๐‘ง, ๐œˆ, ๐œ”, ๐œƒ)๐›ผ(๐‘ก + 1, ๐‘ก). (7.60)

Since ๐‘Ž(๐‘ก) = ๐‘Ž(๐‘ก + 1) = 12 ,

๐‘Ÿ(๐‘ก) = ๐›ผ(๐‘ก, ๐‘ก + 1)๐›ผ(๐‘ก + 1, ๐‘ก) = ๐‘‘(๐‘ก + 1)

๐‘(๐‘ก)๐œ‹(๐‘ก + 1, ๐œ™1, โ€ฆ , ๐œ™๐‘ก+1 | ๐‘ง, ๐œˆ, ๐œ”, ๐œƒ)

๐œ‹(๐‘ก, ๐œ™1, โ€ฆ , ๐œ™๐‘ก | ๐‘ง, ๐œˆ, ๐œ”, ๐œƒ) (7.61a)

= ๐‘‘(๐‘ก + 1)๐‘(๐‘ก)

๐œ‹(๐‘ก + 1 | ๐œˆ)๐œ‹(๐‘ก | ๐œˆ)

๐œ‹(๐œ™1, โ€ฆ , ๐œ™๐‘ก+1 | ๐‘ก + 1, ๐œ”)๐œ‹(๐œ™1, โ€ฆ , ๐œ™๐‘ก | ๐‘ก, ๐œ”)

๐‘“๐‘ก+1(๐‘ง ; ๐“๐‘ก+1, ๐œƒ)๐‘“๐‘ก(๐‘ง ; ๐“๐‘ก, ๐œƒ) (7.61b)

= ๐œˆ๐‘ก + 1

๐‘“๐‘ก+1(๐‘ง ; ๐“๐‘ก+1, ๐œƒ)๐‘“๐‘ก(๐‘ง ; ๐“๐‘ก, ๐œƒ) Apply equation (7.57) with ๐œ‹(๐‘ก | ๐œˆ) =

๐œˆ๐‘ก๐‘’โˆ’๐œˆ/๐‘ก!(7.61c)

and ๐›ผ(๐‘ก, ๐‘ก + 1) = min {๐‘Ÿ(๐‘ก), 1}. erefore, the following algorithm simulates a Markovchain with stationary distribution๐œ‹(๐‘‡, ๐“๐‘‡ | ๐‘ง, ๐œˆ, ๐œ”, ๐œƒ)where, for simplicity of notation,we write ๐“๐‘ก = (๐œ™1, โ€ฆ , ๐œ™๐‘ก).1. Choose between a birth event and a death event, with equal probability.

2. If a birth is proposed, its location, migration log rate and coalescence log rate aresampled from the priors, and the acceptance probability is

๐›ผ(๐‘‡, ๐‘‡ + 1) = min { ๐œ†๐‘‡ + 1

๐‘“๐‘‡+1(๐‘ง ; ฮ˜๐‘‡+1)๐‘“๐‘‡(๐‘ง ; ฮ˜๐‘‡) , 1} (7.62)

3. If a death is proposed, a tile to be removed is selected uniformly at random, and theacceptance probability is

๐›ผ(๐‘‡ + 1, ๐‘‡) = min {๐‘‡ + 1๐œ†

๐‘“๐‘‡(๐‘ง ; ฮ˜๐‘‡)๐‘“๐‘‡+1(๐‘ง ; ฮ˜๐‘‡+1) , 1} (7.63)

because a deletionmove is the reverse of an additionmove [Byers andRaftery, 2002].

7.6.2 Updating the Voronoi centers (for a fixed number of tiles ๐‘‡)

is is a Metropolis-Hastings symmetric random-walk update. Sequentially, for eachtile ๐‘ก, we propose a new center๐‘ขโˆ—

๐‘ก . eproposal distribution is bivariate normal centeredat the current value ๐‘ข๐‘

๐‘ก = (๐‘ฅ๐‘ก, ๐‘ฆ๐‘ก) [with correlation 0]. e proposal is accepted withprobability

๐›ผ = min {๐œ‹(๐‘ขโˆ—

๐‘ก |๐‘, ฮ˜\๐‘ข๐‘ก)๐œ‹(๐‘ข๐‘

๐‘ก |๐‘, ฮ˜\๐‘ข๐‘ก), 1} = min {๐‘“ (๐‘ ; ฮ˜โˆ—)๐œ‹(๐‘ขโˆ—)

๐‘“ (๐‘ ; ฮ˜๐‘)๐œ‹(๐‘ข๐‘) , 1}. (7.64)

Page 70: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

70

e prior distribution of the center locations ๐‘ข = (๐‘ข๐‘ก โˆถ ๐‘ก = 1, โ€ฆ , ๐‘‡) is uniform overthe domain โ„‹ ,

๐œ‹(๐‘ข) โˆ ๐Ÿ™ {๐‘ข๐‘ก โˆˆ โ„‹ โˆถ ๐‘ก = 1, โ€ฆ , ๐‘‡} . (7.65)

On the log scale, log(๐›ผ) = โˆ’โˆž if at least one component of ๐‘ขโˆ— falls outside of thedomain โ„‹ . Otherwise,

log(๐›ผ) = min {โ„“(ฮ˜โˆ— |๐‘) โˆ’ โ„“(ฮ˜๐‘ |๐‘), 0}. (7.66)

7.6.3 Updating the log-transformed migration rates โ„“๐‘šWeassume that themigration log (base 10) rates are normally distributedwith commonmean โ„“ s๐‘š and variance ๐œŽ2๐‘š ,

โ„“๐‘š๐‘ก | โ„“ s๐‘š, ๐œŽ2๐‘šiidโˆผ N(โ„“ s๐‘š, ๐œŽ2๐‘š), (7.67)

or in an equivalent parametrization,

โ„“๐‘š๐‘ก = โ„“ s๐‘š + ๐‘’๐‘ก, ๐‘’๐‘กiidโˆผ N(0, ๐œŽ2๐‘š). (7.68)

where โ„“ s๐‘š is the mean log rate and ๐‘’๐‘ก is the effect of tile ๐‘ก, relative to the mean. esecond parametrization is more convenient because it allows scaling all migration ratessimultaneously by adjusting โ„“ s๐‘š.

We choose a vague prior for the hyperparameters โ„“ s๐‘š, ๐œŽ2๐‘š assuming prior indepen-dence of location and scale,

โ„“ s๐‘š โˆผ U(๐‘™๐‘œ๐‘, ๐‘ข๐‘๐‘), (7.69)

๐œŽ2๐‘š โˆผ Inv-G( ๐‘Ž2 , ๐‘

2). (7.70)

at is, the hyperprior on (โ„“ s๐‘š, ๐œŽ2๐‘š) is is semi-conjugate.To simulate a Markov chain with stationary distribution ๐œ‹(๐‘‡, ๐‘ข, โ„“๐‘š | ๐‘ง, ๐œˆ, ๐œ”),

1. Update each error in turn (or all errors at once) with a Metropolis-Hastings stepand a random-walk proposal. at is, we draw a new migration log rate parameterโ„“๐‘šโˆ—

๐‘ก โˆผ N(โ„“๐‘š๐‘๐‘ก , ) for each tile in the current Voronoi decomposition and accept the

proposal โ„“๐‘šโˆ— = {โ„“๐‘šโˆ—๐‘ก โˆถ ๐‘ก = 1, โ€ฆ , ๐‘‡} with probability

๐›ผ = min {๐‘“ (๐‘ ; ฮ˜\โ„“๐‘š, โ„“๐‘šโˆ—)๐œ‹(โ„“๐‘šโˆ— | โ„“ s๐‘š, ๐œŽ2๐‘š)๐‘“ (๐‘ ; ฮ˜\โ„“๐‘š, โ„“๐‘š๐‘)๐œ‹(โ„“๐‘š๐‘ | โ„“ s๐‘š, ๐œŽ2๐‘š) , 1}. (7.71)

2. Update themeanmigration log rate โ„“ s๐‘š with aMetropolis-Hastings step and a random-walk proposal.

3. Update the common log rate variance ๐œŽ2๐‘š with a Gibbs step by sampling from its fullconditional distribution:

๐œ‹(๐œŽ2๐‘š |๐‘, ฮ˜) โˆ Inv-G( ๐‘Ž2 , ๐‘

2)๐‘‡

โˆ๐‘ก=1

N(๐‘’๐‘ก ; 0, ๐œŽ2๐‘š) (7.72a)

โˆ {1

๐œŽ2๐‘š}

๐‘Ž/2+1exp { โˆ’

๐‘

2๐œŽ2๐‘š} ร—

๐‘‡โˆ๐‘ก=1

{1

๐œŽ2๐‘š}

1/2exp { โˆ’

๐‘’2๐‘ก

2๐œŽ2๐‘š} (7.72b)

โˆ {1

๐œŽ2๐‘š}

๐‘Ž/2+๐‘‡/2+1exp { โˆ’

1

2๐œŽ2๐‘š(๐‘ + ๐‘ 2๐‘š)}, (7.72c)

Page 71: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 71

where ๐‘ 2๐‘’ = โˆ‘๐‘‡๐‘ก=1 ๐‘’2

๐‘ก is the sum of squares for the relative tile effects on the log scale.Because we conveniently choose the conjugate inverse-gamma prior for ๐œŽ2๐‘š , we canupdate this parameter by drawing

๐œŽ2๐‘š โˆผ Inv-G((๐‘Ž + ๐‘‡)/2, (๐‘ + ๐‘ 2๐‘’ /2)). (7.73)

7.6.4 Updating the degrees of freedom ๐‘˜Here we consider updating the degrees of freedom ๐‘˜. e proposal distribution is

๐‘˜โˆ— โˆผ N(๐‘˜๐‘, ๐‘ฃ๐‘›๐‘’๐‘ค) (7.74)

where ๐‘˜๐‘ is the current value and ๐‘ฃ is the proposal variance.Since the Wishart degrees of freedom for a ๐‘› ร— ๐‘› matrix is a real number ๐œˆ that

satisfies ๐œˆ > ๐‘› โˆ’ 1, the support of this parameter is (๐‘›, ๐‘). If the proposed value ๐‘˜โˆ— isnot valid,

log {๐œ‹(๐‘˜โˆ—)๐œ‹(๐‘˜๐‘) } = โˆ’โˆž (7.75)

and the proposal is rejected. Otherwise, it is accepted with probability

๐›ผ = min {๐œ‹(๐‘˜โˆ—)๐‘“ (๐‘ ; ฮ˜\๐‘˜, ๐‘˜โˆ—)๐œ‹(๐‘˜๐‘)๐‘“ (๐‘ ; ฮ˜\๐‘˜, ๐‘˜๐‘) , 1} (7.76)

Here ๐‘“ (๐‘ ; ฮ˜\๐‘˜, ๐‘˜โˆ—) is the likelihood for the given value of ๐‘˜ with the rest of the param-eters ฮ˜\๐‘˜ fixed to their current values. e prior on the degrees of freedom is uniformon the log scale, i.e.,

๐œ‹(๐‘˜) โˆ 1๐‘˜ . (7.77)

Since ๐‘˜ is bounded, the prior is proper with normalizing constant log(๐‘) โˆ’ log(๐‘›).

7.6.5 Updating the scale nuisance parameter ๐œŽโˆ—

e nuisance parameter ๐œŽโˆ— = ๐œ†๐œŽ2 can be efficiently updated with a Gibbs step if wechoose the conjugate prior distribution, Inv-G(๐‘/2, ๐‘‘/2). en the full conditional isalso Inverse Gamma with shape and scale parameters given by

๐‘โˆ— = ๐‘ + ๐‘˜(๐‘› โˆ’ 1), (7.78a)

๐‘‘โˆ— = ๐‘‘ + ๐‘˜ tr {ฮ”โˆ’1๐‘„๐‘†}. (7.78b)

For microsatellites, ๐œ‹(๐œŽโˆ—1 , โ€ฆ , ๐œŽโˆ—๐‘ |๐‘, ฮ˜) factorizes into the full conditional of each site-

specific scale parameter ๐œŽโˆ—๐‘  , so there is no loss of efficiency to estimate a small numberof microsatellites.

7.7 MATLAB implementation

7.7.1 Triangular (isometric) grid

Suppose that the genotypes individuals are sampledwithin a rectangular regionโ„‹ By convention, ๐‘ฅ denotes longitudes and๐‘ฆ latitudes.

boundedby (๐‘ฅ0, ๐‘ฆ0) on the bottom right and (๐‘ฅ1, ๐‘ฆ1) on the top right.

To initialize the program,we specify the dimensions โ„“๐‘ฅร—โ„“๐‘ฆ By definition, a triangular grid is formedby dividing the plane regularly intoequilateral triangles.

of a triangular grid (๐‘‰, ๐ธ)to tile the habitat โ„‹ . e resulting grid is regular but not strictly isometric, unless โ„“๐‘ฅand โ„“๐‘ฆ are chosen to match the size of the habitat.

Page 72: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

72

...๐‘Š . ๐ธ.

๐‘๐‘Š

.

๐‘๐ธ

.

๐‘†๐‘Š

.

๐‘†๐ธ

7.7.2 Data structures

Here I describe the MATLAB implementation and data structures. e problem is spec-ified in terms of

โ€ข โ„“๐‘ฅ ร— โ„“๐‘ฆ triangular grid (๐‘‰, ๐ธ) which spans the habitat โ„‹ ;

โ€ข (โ„“๐‘ฅโ„“๐‘ฆ) ร— (โ„“๐‘ฅโ„“๐‘ฆ) symmetric matrix ๐‘€ of migration rates.

e order of (๐‘‰, ๐ธ) is |๐‘‰| = โ„“๐‘ฅโ„“๐‘ฆ and the size is ๐‘›๐‘’ โ‰ก |๐ธ| = (โ„“๐‘ฅ โˆ’1)โ„“๐‘ฆ +(2โ„“๐‘ฅ โˆ’1)(โ„“๐‘ฆ โˆ’1).Both the grid (๐‘‰, ๐ธ) and the migration matrix ๐‘€ are very sparse because each vertex๐‘ฃ โˆˆ ๐‘‰ has at most six neighbors and

๐‘€ = (๐‘€๐›ผ๐›ฝ) =โŽง{{โŽจ{{โŽฉ

๐‘š๐›ผ๐›ฝ if (๐›ผ, ๐›ฝ) โˆˆ ๐ธ0 otherwise.

โŽซ}}โŽฌ}}โŽญ

๐‘š๐›ผ๐›ฝ = 2๐‘0๏ฟฝฬ‡๏ฟฝ๐›ผ๐›ฝ where ๐‘0 is the

coalescent timescale.

(7.79)

at is, (๐‘‰, ๐ธ) and ๐‘€ together describe a weighted matrix ๐บ = (๐‘‰, ๐ธ, ๐‘€). It is notrequired that ๐‘€ be symmetric; the linear system for ฮ” is valid as long as (๐‘‰, ๐ธ) is con-nected: If all demes communicate, the sample will eventually coalesce, i.e., the distancebetween lineages is finite. is guarantees that

ฮ” = (๐‘‘2๐›ผ๐›ฝ) = {๐‘‘2

๐›ผ๐›ฝ < โˆž for (๐›ผ, ๐›ฝ) โˆˆ ๐‘‰ ร— ๐‘‰.} (7.80)

Although the twomatrices have the same size,๐‘€ is sparse butฮ” is full and hencemightbe expensive to compute. With a denser grid (๐‘‰, ๐ธ), few of the demes are sampled fromand computing the entire distancematrixฮ” is not necessary. To compute the likelihoodof the data, we need only the sample distance matrix ฮ”.

In the rest of this section, let ๐‘›๐‘ฃ = โ„“๐‘ฅโ„“๐‘ฆ be the number of demes and ๐‘›๐‘ = (๐‘›๐‘ฃ2 ) =

๐‘›๐‘ฃ(๐‘›๐‘ฃ โˆ’ 1)/2 be the number of unique pairs of demes. e number of unknowns is๐‘›๐‘ฃ + ๐‘›๐‘, the number of within-deme coalescence times plus the number of between-demes coalescence times.Vertex set representation: e vertices ๐‘‰ are stored in a ๐‘›๐‘ฃ ร— 2 matrix Vcoord, withthe ๐‘ฅ (longitude) coordinates in the first column and the ๐‘ฆ (latitude) coordinates in thesecond column. e locations of the Voronoi sites ๐‘† are stored similarly in Scoord.e triangular grid (๐‘‰, ๐ธ) is fixed, so Vcoord does not change. On the other hand, theVoronoi decomposition ofโ„‹ is updated regularly, which is reflected by (row) changes inScoord.

e two matrices are used to update the Voronoi tessellation whenever a tile movesits location. Recall that by definition the Voronoi tile (cell) ๐‘‡(๐‘ ) consists of the pointscloser to ๐‘  than to any other site.

euDist = rdist(Vcoord,Scoord);Compute all distances between the demesin Vcoord and the sites in Scoord.

[temp,Colors] = min(euDist,[],2);For each deme ๐‘ฃ โˆˆ ๐‘‰, find the closestVoronoi site ๐‘  โˆˆ ๐‘†.

e vector Colors indicates which tile each deme falls into.Edge set representation e edges ๐ธ are stored in a ๐‘›๐‘ฃ ร— 6 matrix Edges. ere is onerow for each vertex (deme) and the columns are its six adjacent vertices, in the order๐‘Š,๐‘๐‘Š, ๐‘๐ธ, ๐ธ, ๐‘†๐ธ, ๐‘†๐‘Š (clockwise). Vertices are identified by their row index in Edges.If the deme does not have a neighbor in some positions, the corresponding entries ofEdges are set to 0. e number of nonzero entries is twice the number of edges 2๐‘›๐‘’.Rate parameters representation e backward migration matrix is stored in a ๐‘›๐‘ฃ ร— ๐‘›๐‘ฃsparse matrix Mrates with 2๐‘›๐‘’ nonzero elements.

Page 73: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 73

7.7.3 Computing coalescence distances

Our MCMC implementation requires repeatedly solving a system of linear equations๐ด๐‘ฅ = ๐‘. e matrix ๐ด = [๐ด1; ๐ด2] is large, sparse, nearly symmetric and positive defi-nite. e regularity of the grid ๐บ gives ๐ด its structure and sparseness.

๐ด1 represents the ๐‘›๐‘ฃ within-deme equations

(๐‘ž๐›ผ + ๐‘š๐›ผ)๐‘‡๐›ผ๐›ผ โˆ’ โˆ‘๐›พโˆˆ๐‘๐‘’๐‘–(๐›ผ)

๐‘š๐›ผ๐›พ๐‘‡๐›ผ๐›พ = 1, (7.81)

and ๐ด2 represents the ๐‘›๐‘ between-demes equations

(๐‘š๐›ผ + ๐‘š๐›ฝ)๐‘‡๐›ผ๐›ฝ โˆ’ โˆ‘๐›พโˆˆ๐‘๐‘’๐‘–(๐›ผ)

๐‘š๐›ผ๐›พ๐‘‡๐›ฝ๐›พ โˆ’ โˆ‘๐›พโˆˆ๐‘๐‘’๐‘–(๐›ฝ)

๐‘š๐›ฝ๐›พ๐‘‡๐›ผ๐›พ = 2. (7.82)

Here ๐‘๐‘’๐‘–(๐›ผ) = {๐›พ โˆˆ ๐‘‰ โˆถ (๐›ผ, ๐›พ) โˆˆ ๐ธ} is the set of vertices adjacent to ๐›ผ and ๐‘š๐›ผ =โˆ‘๐›พโˆˆ๐‘๐‘’๐‘–(๐›ผ) ๐‘š๐›ผ๐›พ is the rate of migration into ๐›ผ. e equations also shows that ๐‘ =[1๐‘›๐‘ฃ ; 1๐‘›๐‘].

e matrix ๐ด is positive definite because

๐ด2 = โ„’2({๐‘š๐›ผ๐›ฝ}), (7.83)

๐ด1 = diag {๐‘ž} + โ„’1({๐‘š๐›ผ๐›ฝ}). (7.84)

e Laplacian matrices โ„’1, โ„’2 are functions of only the migration rates and โ„’ =[โ„’1; โ„’2] is also a Laplacian matrix, and therefore, it is positive definite. We note thatthe matrix ๐’ฌ = โˆ’โ„’ is the infinitesimal generator the migration process where thelineages move from deme to deme according to ๐‘€. For a continuous-time stochasticprocess, the infinitesimal generator is the matrix ๐’ฌ = (๐‘ž๐‘ฅ,๐‘ฆ) with entries

๐‘ž๐‘ฅ,๐‘ฆ =โŽง{โŽจ{โŽฉ

โˆ’๐œ†๐‘ฅ if ๐‘ฅ = ๐‘ฆ,๐œ†๐‘ฅ๐‘Ž๐‘ฅ,๐‘ฆ otherwise

(7.85)

where ๐œ†๐‘ฅ is the holding rate for state ๐‘ฅ and ๐’œ = (๐‘Ž๐‘ฅ,๐‘ฆ) is the transition probabilitymatrix of the embedded jump chain. In this case, the transition probabilities are

๐‘š๐›ผ๐›ฝโˆ‘๐›พโˆˆ๐‘๐‘’๐‘–(๐›ผ) ๐‘š๐›ผ๐›พ

=๐‘0๏ฟฝฬ‡๏ฟฝ๐›ผ๐›ฝ

โˆ‘๐›พโˆˆ๐‘๐‘’๐‘–(๐›ผ) ๐‘0๏ฟฝฬ‡๏ฟฝ๐›ผ๐›พ=

๏ฟฝฬ‡๏ฟฝ๐›ผ๐›ฝ1 โˆ’ ๏ฟฝฬ‡๏ฟฝ๐›ผ๐›ผ

. (7.86)

Solving ๐ด๐‘ฅ = ๐‘, and thus finding all coalescence times at once, has the advantage ofreducing numerical errors. Because we use an iterative procedure (preconditioned con-jugate gradient), we control how close the approximate solution ๐‘ฅ is to the true solution๐‘ฅ. If we first solve for ๐‘ฅ2 and then substitute to find ๐‘ฅ1, numerical errors in ๐‘ฅ2 are prop-agated in ๐‘ฅ1.

7.7.4 Computing resistance distances

Consider again the matrix of migration rates between neighboring demes, ๐‘€. Let ๐ฟ beits Laplacian matrix,

๐ฟ = diag {๐‘€1} โˆ’ ๐‘€ (7.87)

e effective resistance ๐‘…๐›ผ๐›ฝ between a pair of demes (๐›ผ, ๐›ฝ) is equal to the ๐›ฝth elementof the vector ๐‘ฅ given by

๐ฟโˆ’๐›ผ๐‘ฅ = ๐‘’๐›ฝ (7.88)

Page 74: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

74

..1. 2. 3. 4. 5.

6

.

7

.

8

.

9

.

10

.

11

.

12

.

13

.

14

.

15

.

16

.

17

.

18

.

19

.

20

Figure 7.1: is is a5ร—4 regular triangulargrid. If line weight is proportional to mi-gration rate, this pattern corresponds touniform migration with equal deme sizes.

..1. 2. 3. 4. 5.

6

.

7

.

8

.

9

.

10

.

11

.

12

.

13

.

14

.

15

.

16

.

17

.

18

.

19

.

20

Figure 7.2: Barrier to migration. Ifline weights are proportional to migrationrates, this patterns corresponds to a bar-rier across the middle of the habitat.

where ๐ฟโˆ’๐›ผ is the Laplacian ๐ฟ with the ๐›ผth row and column removed, and ๐‘’๐›ฝ is the stan-dard basis vector with a 1 at the ๐›ฝth coordinate and 0s elsewhere. is method can beoptimized by solving ๐ฟโˆ’๐›ผ๐‘‹ = ๐ธ where ๐ธ = (๐‘’๐›ฝ), so that we compute the effectiveresistance between ๐›ผ and all other demes with a single matrix operation.

ere are other methods to compute ๐‘…, e.g., with a single matrix inversion [Babiฤ‡et al., 2002]

๐ป = (๐ฟ + ๐‘›โˆ’1๐‘ฃ ๐ฝ)โˆ’1 (7.89)

๐‘…๐›ผ๐›ฝ = ๐ป๐›ผ๐›ผ + ๐ป๐›ฝ๐›ฝ โˆ’ 2๐ป๐›ผ๐›ฝ (7.90)

is method is more efficient for sparser grids where most demes are sampled from.

7.8 Simulations with ms

Here we describe how we produce samples from a spatially distributed population thatevolves under Kimura's stepping-stone model using the program ms [Hudson, 2002].For all simulations, we first construct a regular triangular grid (๐‘‰, ๐ธ) of โ„“๐‘ฅ ร— โ„“๐‘ฆ demes(vertices) with coordinates (๐‘ฅ๐›ผ, ๐‘ฆ๐›ผ). e spatial information is not explicitly used by msand instead we set ๐‘š๐›ผ๐›ฝ = 0 = ๐‘š๐›ฝ๐›ผ for all pairs of demes such that (๐›ผ, ๐›ฝ) โˆ‰ ๐ธ. Wealso specify the size of each deme, ๐‘๐›ผ, and the migration rates across each edge in thetwo opposite directions, ๐‘š๐›ผ๐›ฝ and ๐‘š๐›ฝ๐›ผ. Migration is not necessarily symmetric but isconservative [i.e., it preserves deme sizes.] Both the deme sizes ๐‘๐›ผ = ๐‘0๐‘๐›ผ and themigration rates ๐‘š๐›ผ๐›ฝ = 4๐‘0๏ฟฝฬ‡๏ฟฝ๐›ผ๐›ฝ are relative to the coalescent timescale ๐‘0. [at is, ๐‘๐›ผis the relative size of deme ๐›ผ and ๏ฟฝฬ‡๏ฟฝ๐›ผ๐›ฝ is the backward migration fraction from ๐›ผ to ๐›ฝper generation.] e input arguments are ๐‘๐›ผ and ๐‘š๐›ผ๐›ฝ, respectively.

7.8.1 Spatial structure due to constant migration

Here edges in themiddle of the habitat havemigration rate that is a factor ofmagnitudelower than the rate of the edges on either side. [e rates range from 0.3 to 3.] ispatterns is a barrier to actual migration and it results in a barrier to effective migration.

ms 20 1 -s 1 -I 20 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0

-m 1 2 3.0 -m 2 1 3.0 -m 1 6 3.0 -m 6 1 3.0 -m 2 3 1.7 -m 3 2 1.7

-m 2 6 3.0 -m 6 2 3.0 -m 2 7 1.7 -m 7 2 1.7 -m 3 4 0.3 -m 4 3 0.3

-m 3 7 0.3 -m 7 3 0.3 -m 3 8 0.3 -m 8 3 0.3 -m 4 5 1.7 -m 5 4 1.7

-m 4 8 0.3 -m 8 4 0.3 -m 4 9 1.7 -m 9 4 1.7 -m 5 9 3.0 -m 9 5 3.0

-m 5 10 3.0 -m 10 5 3.0 -m 6 7 1.7 -m 7 6 1.7 -m 6 11 3.0 -m 11 6 3.0

-m 6 12 3.0 -m 12 6 3.0 -m 7 8 0.3 -m 8 7 0.3 -m 7 12 1.7 -m 12 7 1.7

-m 7 13 0.3 -m 13 7 0.3 -m 8 9 1.7 -m 9 8 1.7 -m 8 13 0.3 -m 13 8 0.3

-m 8 14 0.3 -m 14 8 0.3 -m 9 10 3.0 -m 10 9 3.0 -m 9 14 1.7

-m 14 9 1.7 -m 9 15 3.0 -m 15 9 3.0 -m 10 15 3.0 -m 15 10 3.0

-m 11 12 3.0 -m 12 11 3.0 -m 11 16 3.0 -m 16 11 3.0 -m 12 13 1.7

-m 13 12 1.7 -m 12 16 3.0 -m 16 12 3.0 -m 12 17 1.7 -m 17 12 1.7

-m 13 14 0.3 -m 14 13 0.3 -m 13 17 0.3 -m 17 13 0.3 -m 13 18 0.3

-m 18 13 0.3 -m 14 15 1.7 -m 15 14 1.7 -m 14 18 0.3 -m 18 14 0.3

-m 14 19 1.7 -m 19 14 1.7 -m 15 19 3.0 -m 19 15 3.0 -m 15 20 3.0

-m 20 15 3.0 -m 16 17 1.7 -m 17 16 1.7 -m 17 18 0.3 -m 18 17 0.3

-m 18 19 1.7 -m 19 18 1.7 -m 19 20 3.0 -m 20 19 3.0

Page 75: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 75

..1. 2. 3. 4. 5.

6.

7

.

8

.

9

.

10.

11

.

12

.

13

.

14

.

15

.

16

.

17

.

18

.

19

.

20

Figure 7.3: Barrier to effective migrationdue to differences in effective populationsize. e demes in bold are 5 times bigger;the edges in red are directedโ€” this is nec-essary to preserve equilibrium in time.

..1. 2. 3. 4. 5.

6.

7

.

8

.

9

.

10.

11

.

12

.

13

.

14

.

15

.

16

.

17

.

18

.

19

.

20

Figure 7.4: Uniform effective migrationeven though there are differences in bothpopulation size and in migration rates.e demes in bold are 4 times bigger; theedges in red are directed โ€” this is neces-sary to preserve equilibrium in time.

7.8.2 Spatial structure due to variation in diversity

Here some demes have bigger size and thus lower coalescence rate and higher geneticdiversity. In the first version, migration rates are constant but there are differencesin effective population size. Since demes in the ''east'' and ''west'' of the habitat are 5times bigger than those in the middle, the effect is a barrier to effective migration thatis qualitatively very similar to the true barrier in the previous simulation.

[A few edges are directed, with rate ๐‘š๐›ผ๐›ฝ = 0.2 from a big deme to a small demeand rate ๐‘š๐›ฝ๐›ผ = 1 in the other direction. ese edges cross the ''boundary'' betweenthe areas of high and low diversity and their rates are assigned so that migration isconservative: the same number of migrants are exchanged between ๐›ผ and ๐›ฝ because๐‘๐›ผ๐‘š๐›ผ๐›ฝ = ๐‘๐›ฝ๐‘š๐›ฝ๐›ผ.]

ms 20 1 -s 1 -I 20 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0

-n 1 5.0 -n 2 5.0 -n 3 1.0 -n 4 1.0 -n 5 1.0 -n 6 5.0 -n 7 1.0

-n 8 1.0 -n 9 1.0 -n 10 5.0 -n 11 5.0 -n 12 1.0 -n 13 1.0 -n 14 1.0

-n 15 5.0 -n 16 1.0 -n 17 1.0 -n 18 1.0 -n 19 5.0 -n 20 5.0

-m 1 2 1.0 -m 2 1 1.0 -m 1 6 1.0 -m 6 1 1.0 -m 2 3 0.2 -m 3 2 1.0

-m 2 6 1.0 -m 6 2 1.0 -m 2 7 0.2 -m 7 2 1.0 -m 3 4 1.0 -m 4 3 1.0

-m 3 7 1.0 -m 7 3 1.0 -m 3 8 1.0 -m 8 3 1.0 -m 4 5 1.0 -m 5 4 1.0

-m 4 8 1.0 -m 8 4 1.0 -m 4 9 1.0 -m 9 4 1.0 -m 5 9 1.0 -m 9 5 1.0

-m 5 10 1.0 -m 10 5 0.2 -m 6 7 0.2 -m 7 6 1.0 -m 6 11 1.0 -m 11 6 1.0

-m 6 12 0.2 -m 12 6 1.0 -m 7 8 1.0 -m 8 7 1.0 -m 7 12 1.0 -m 12 7 1.0

-m 7 13 1.0 -m 13 7 1.0 -m 8 9 1.0 -m 9 8 1.0 -m 8 13 1.0 -m 13 8 1.0

-m 8 14 1.0 -m 14 8 1.0 -m 9 10 1.0 -m 10 9 0.2 -m 9 14 1.0 -m 14 9 1.0

-m 9 15 1.0 -m 15 9 0.2 -m 10 15 1.0 -m 15 10 1.0 -m 11 12 0.2

-m 12 11 1.0 -m 11 16 0.2 -m 16 11 1.0 -m 12 13 1.0 -m 13 12 1.0

-m 12 16 1.0 -m 16 12 1.0 -m 12 17 1.0 -m 17 12 1.0 -m 13 14 1.0

-m 14 13 1.0 -m 13 17 1.0 -m 17 13 1.0 -m 13 18 1.0 -m 18 13 1.0

-m 14 15 1.0 -m 15 14 0.2 -m 14 18 1.0 -m 18 14 1.0 -m 14 19 1.0

-m 19 14 0.2 -m 15 19 1.0 -m 19 15 1.0 -m 15 20 1.0 -m 20 15 1.0

-m 16 17 1.0 -m 17 16 1.0 -m 17 18 1.0 -m 18 17 1.0 -m 18 19 1.0

-m 19 18 0.2 -m 19 20 1.0 -m 20 19 1.0

In the second version, differences in migration rates compensate for differences indeme size because ๐‘๐›พ๐‘š๐›พ๐œ” = ๐‘๐œ”๐‘š๐œ”๐›พ for all edges (๐›พ, ๐œ”) โˆˆ ๐ธ. e result is no varia-tion in effective migration although both the deme sizes and the migration rates varyacross the habitat.

ms 20 1 -s 1 -I 20 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0

-n 1 5.0 -n 2 5.0 -n 3 1.0 -n 4 1.0 -n 5 1.0 -n 6 5.0 -n 7 1.0

-n 8 1.0 -n 9 1.0 -n 10 5.0 -n 11 5.0 -n 12 1.0 -n 13 1.0 -n 14 1.0

-n 15 5.0 -n 16 1.0 -n 17 1.0 -n 18 1.0 -n 19 5.0 -n 20 5.0

-m 1 2 0.2 -m 2 1 0.2 -m 1 6 0.2 -m 6 1 0.2 -m 2 3 0.2 -m 3 2 1.0

-m 2 6 0.2 -m 6 2 0.2 -m 2 7 0.2 -m 7 2 1.0 -m 3 4 1.0 -m 4 3 1.0

-m 3 7 1.0 -m 7 3 1.0 -m 3 8 1.0 -m 8 3 1.0 -m 4 5 1.0 -m 5 4 1.0

-m 4 8 1.0 -m 8 4 1.0 -m 4 9 1.0 -m 9 4 1.0 -m 5 9 1.0 -m 9 5 1.0

-m 5 10 1.0 -m 10 5 0.2 -m 6 7 0.2 -m 7 6 1.0 -m 6 11 0.2 -m 11 6 0.2

-m 6 12 0.2 -m 12 6 1.0 -m 7 8 1.0 -m 8 7 1.0 -m 7 12 1.0 -m 12 7 1.0

-m 7 13 1.0 -m 13 7 1.0 -m 8 9 1.0 -m 9 8 1.0 -m 8 13 1.0 -m 13 8 1.0

-m 8 14 1.0 -m 14 8 1.0 -m 9 10 1.0 -m 10 9 0.2 -m 9 14 1.0 -m 14 9 1.0

Page 76: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

76

..1. 2. 3. 4. 5.

6

.

7

.

8

.

9

.

10

.

11

.

12

.

13

.

14

.

15

.

16

.

17

.

18

.

19

.

20

Figure 7.5: Barrier to effective migrationdue to a split in time and otherwise uni-form migration rates. e dashed edgesare disconnected at the same time in thepast.

-m 9 15 1.0 -m 15 9 0.2 -m 10 15 0.2 -m 15 10 0.2 -m 11 12 0.2

-m 12 11 1.0 -m 11 16 0.2 -m 16 11 1.0 -m 12 13 1.0 -m 13 12 1.0

-m 12 16 1.0 -m 16 12 1.0 -m 12 17 1.0 -m 17 12 1.0 -m 13 14 1.0

-m 14 13 1.0 -m 13 17 1.0 -m 17 13 1.0 -m 13 18 1.0 -m 18 13 1.0

-m 14 15 1.0 -m 15 14 0.2 -m 14 18 1.0 -m 18 14 1.0 -m 14 19 1.0

-m 19 14 0.2 -m 15 19 0.2 -m 19 15 0.2 -m 15 20 0.2 -m 20 15 0.2

-m 16 17 1.0 -m 17 16 1.0 -m 17 18 1.0 -m 18 17 1.0 -m 18 19 1.0

-m 19 18 0.2 -m 19 20 0.2 -m 20 19 0.2

7.8.3 Spatial structure due to a split event

Here the effect of a barrier to effective migration is produced by a past event that ze-roes out somemigration rates and thus disconnects the ''east'' and ''west'' regions of thehabitat. e split is instantaneous and occurs 3๐‘0 generations back in the past. iscreates a barrier in time that is detected as a barrier to effective migration.

ms 20 1 -s 1 -I 20 4 3 0 0 0 3 0 0 0 0 0 0 0 0 3 0 0 0 3 4 0

-m 1 2 1.0 -m 2 1 1.0 -m 1 6 1.0 -m 6 1 1.0 -m 2 3 1.0 -m 3 2 1.0

-m 2 6 1.0 -m 6 2 1.0 -m 2 7 1.0 -m 7 2 1.0 -m 5 10 1.0 -m 10 5 1.0

-m 6 7 1.0 -m 7 6 1.0 -m 6 11 1.0 -m 11 6 1.0 -m 6 12 1.0 -m 12 6 1.0

-m 9 10 1.0 -m 10 9 1.0 -m 9 15 1.0 -m 15 9 1.0 -m 10 15 1.0

-m 15 10 1.0 -m 11 12 1.0 -m 12 11 1.0 -m 11 16 1.0 -m 16 11 1.0

-m 14 15 1.0 -m 15 14 1.0 -m 14 19 1.0 -m 19 14 1.0 -m 15 19 1.0

-m 19 15 1.0 -m 15 20 1.0 -m 20 15 1.0 -m 18 19 1.0 -m 19 18 1.0

-m 19 20 1.0 -m 20 19 1.0

-em 3.0 3 7 1.0 -em 3.0 3 8 1.0 -em 3.0 3 4 1.0 -em 3.0 4 3 1.0

-em 3.0 4 8 1.0 -em 3.0 4 9 1.0 -em 3.0 4 5 1.0 -em 3.0 5 4 1.0

-em 3.0 5 9 1.0 -em 3.0 7 12 1.0 -em 3.0 7 13 1.0 -em 3.0 7 8 1.0

-em 3.0 7 3 1.0 -em 3.0 8 7 1.0 -em 3.0 8 13 1.0 -em 3.0 8 14 1.0

-em 3.0 8 9 1.0 -em 3.0 8 4 1.0 -em 3.0 8 3 1.0 -em 3.0 9 8 1.0

-em 3.0 9 14 1.0 -em 3.0 9 5 1.0 -em 3.0 9 4 1.0 -em 3.0 12 16 1.0

-em 3.0 12 17 1.0 -em 3.0 12 13 1.0 -em 3.0 12 7 1.0 -em 3.0 13 12 1.0

-em 3.0 13 17 1.0 -em 3.0 13 18 1.0 -em 3.0 13 14 1.0 -em 3.0 13 8 1.0

-em 3.0 13 7 1.0 -em 3.0 14 13 1.0 -em 3.0 14 18 1.0 -em 3.0 14 9 1.0

-em 3.0 14 8 1.0 -em 3.0 16 17 1.0 -em 3.0 16 12 1.0 -em 3.0 17 16 1.0

-em 3.0 17 18 1.0 -em 3.0 17 13 1.0 -em 3.0 17 12 1.0 -em 3.0 18 17 1.0

-em 3.0 18 14 1.0 -em 3.0 18 13 1.0

Page 77: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

8

Bibliography

D. Babiฤ‡, D. J. Klein, I. Lukovits, S. Nikoliฤ‡, and N. Trinajstiฤ‡. Resistance-distance ma-trix: A computational algorithm and its application. International Journal of QuantumChemistry, 90(1):166โ€“176, 2002.

M. Bahlo and R. C. Griffiths. Coalescence time for two genes from a subdivided popu-lation. Journal of Mathematical Biology, 43(5):397โ€“410, 2001.

R. B. Bapat. Resistance matrix of a weighted graph. MATCH: Communications inMath-ematical and in Computer Chemistry, 50:73โ€“82, 2004.

R. B. Bapat and T. E. S. Raghavan. Nonnegative matrices and applications. CambridgeUniversity Press, 1997.

P. Beerli and J. Felsenstein. Maximum likelihood estimation of amigrationmatrix andeffective population sizes in ๐‘› subpopulations by using a coalescent approach. Proceed-ings of the National Academy of Sciences (PNAS), 98(8):4563โ€“4568, 2001.

C. A. Brewer, G. W. Hatchard, and M. A. Harrower. ColorBrewer in print: a catalog ofcolor schemes for maps. Cartography and Geographic Information Science, 30(1):5โ€“32,2003.

S. D. Byers and A. E. Raftery. Bayesian estimation and segmentation of spatial pointprocesses using Voronoi tilings. In Andrew B. Lawson andDavid G.T. Denison, editors,Spatial Cluster Modeling, page 109โ€“121. Chapman&Hall, 2002.

L. L. Cavalli-Sforza, P.Menozzi, andA. Piazza.ehistory and geography of humangenes.Princeton University Press, 1994.

A. K. Chandra, P. Raghavan, W. L. Ruzzo, R. Smolensky, and P. Tiwari. e electricalresistance of a graph captures its commute and cover times. Computational Complexity,6(4):312โ€“340, 1996.

A. G. Clark, M. J. Hubisz, C. D. Bustamante, S. H. Williamson, and R. Nielsen. Ascer-tainment bias in studies of human genome-wide polymorphism. Genome Research, 15(11):1496โ€“1502, 2005.

C. C. Cockerham. Variance of gene frequencies. Evolution, 23(1):72โ€“84, 1969.

J. Felsenstein. A pain in the torus: Some difficulties with models of isolation by dis-tance. e American Naturalist, 109(967):359โ€“368, 1975.

J. C. Gower. Properties of Euclidean and non-Euclidean distance matrices. LinearAlgebra and its Applications, 67(1):81โ€“97, 1985.

Page 78: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

78

G. Guillot, A. Estoup, F. Mortier, and J. F. Cosson. A spatial statistical model for land-scape genetics. Genetics, 170(3):1261โ€“1280, 2005.

E.M.Hanks andM.B.Hooten. Circuit theory andmodel-based inference for landscapeconnectivity. Journal of the American Statistical Association, 108(501):22โ€“33, 2013.

J. Hey. A multi-dimensional coalescent process applied to multi-allelic selection mod-els and migration models. eoretical Population Biology, 39(1):30โ€“48, 1991.

MW. Horton, A. M. Hancock, Y. S. Huang, C. Toomajian, S. Atwell, and et al. Genome-wide patterns of genetic variation in worldwide Arabidopsis thaliana accessions fromthe RegMap panel. Nature Genetics, 44(2):212โ€“216, 2012.

M. J. Hubisz, D. Falush, M. Stephens, and J. K. Pritchard. Inferring weak popula-tion structure with the assistance of sample group information. Molecular Ecology Re-sources, 9(5):1322โ€“1332, 2009.

R. R. Hudson. Gene genealogies and the coalescent process. In Douglas Futuyma andJanis Antonovics, editors,Oxford surveys in evolutionary biology, volume 7, pages 1--44.Oxford University Press, 1990.

R. R. Hudson. Generating samples under a Wright-Fisher neutral model of geneticvariation. Bioinformatics, 18(2):337โ€“338, 2002.

M. Kimura. e number of heterozygous nucleotide sites maintained in a finite popu-lation due to steady flux of mutations. Genetics, 61(4):893โ€“903, 1969.

M. Kimura and G. H.Weiss. e stepping stonemodel of population structure and thedecrease of genetic correlation with distance. Genetics, 49(4):561โ€“576, 1964.

J. F. C. Kingman. On the genealogy of large populations. Journal of Applied Probability,19(A):27โ€“43, 1982a.

J. F. C. Kingman. e coalescent. Stochastic Processes and their Applications, 13(3):235โ€“248, 1982b.

O. Lao, T. T. Lu, M. Nothnagel, O. Junge, S. Freitag-Wolf, A. Caliebe, and et al. Cor-relation between genetic and geographic structure in Europe. Current Biology, 18(16):1241โ€“1248, 2008.

D. J. Lawson andD. Falush. Population identificationusing genetic data. AnnualReviewof Genomics and Human Genetics, 13:337โ€“361, 2012.

J. Y. Lee and S. V. Edwards. Divergence across Australia's Carpentarian barrier: Statis-tical phylogeography of the red-backed fairy wrenMalurusmelanocephalus. Evolution,62(12):3117โ€“3134, 2008.

D. A. Levin, Y. Peres, and E. L. Wilmer. Markov chains and mixing times. AmericanMathematical Society, 2008.

P. McCullagh. Marginal likelihood for distancematrices. Statistica Sinica, 19:631โ€“649,2009.

B. H. McRae. Isolation by resistance. Evolution, 60(8):1551โ€“1561, 2006.

Page 79: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 79

B. H. McRae, B. G. Dickson, T. H. Keitt, and V. B. Shah. Using circuit theory to modelconnectivity in ecology, evolution, and conservation. Ecology, 89(10):2712โ€“2742,2008.

G. McVean. A genealogical interpretation of principal components analysis. PLoSGenetics, 5(10):e1000686, 2009.

P.Menozzi, A. Piazza, and L. L. Cavalli-Sforza. Syntheticmaps of human gene frequen-cies in Europeans. Science, 201(4358):786โ€“792, 1978.

T.Nagylaki.e strong-migration limit in geographically structured populations. Jour-nal of Mathematical Biology, 9(2):101โ€“114, 1980.

M.Nei. Analysis of genediversity in subdividedpopulations. Proceedings of theNationalAcademy of Sciences (PNAS), 70(12):3321โ€“3323, 1973.

M. R. Nelson, K. Bryc, K. S. King, A. Indap, A. R. Boyko, J. Novembre, and et al. epopulation reference sample, POPRES: A resource for population, disease, and phar-macological genetics research. e American Journal of Human Genetics, 83(3):347โ€“358, 2008.

R. Nielsen. Estimation of population parameters and recombination rates from singlenucleotide polymorphisms. Genetics, 154(2):931โ€“942, 2000.

M. Nordborg, T. T. Hu, Y. Ishino, J. Jhaveri, C. Toomajian, H. Zheng, E. Bakker, andet al. e pattern of polymorphism in Arabidopsis thaliana. PLoS Biology, 3(7):e196,2005.

M. Notohara. e coalescent and the genealogical process in geographically structuredpopulation. Journal of Mathematical Biology, 29(1):59โ€“75, 1990.

M.Notohara. e strong-migration limit for the genealogical process in geographicallystructured populations. Journal of Mathematical Biology, 31(2):115โ€“122, 1993.

J. Novembre and M. Stephens. Interpreting principal component analyses of spatialpopulation genetic variation. Nature Genetics, 40(5):646โ€“649, 2008.

J. Novembre, T. Johnson, K. Bryc, Z. Kutalik, A. R. Boyko, A. Auton, A. Indap, K. S.King, S. Bergmann, M. R. Nelson, M. Stephens, and C. D. Bustamante. Genes mirrorgeography within Europe. Nature, 456(7218):98โ€“101, 2008.

A. Okabe, B. Boots, K. Sugihara, and S. N. Chiu. Spatial tessellations : concepts andapplications of Voronoi diagrams. Wiley Series in Probability and Statistics. Wiley, 2000.

A. Platt,M.Horton, Y. S.Huang, Y. Li, A. E. Anastasio, and et al. e scale of populationstructure in Arabidopsis thaliana. PLoS Genetics, 6(2):e1000843, 2010.

A. L. Price, N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick, and D. Reich.Principal components analysis corrects for stratification in genome-wide associationstudies. Nature Genetics, 38(8):904โ€“909, 2006.

A. L. Price, N. A. Zaitlen, D. Reich, and N. Patterson. New approaches to populationstratification in genome-wide association studies. NatureReviewsGenetics, 11(7):459โ€“463, 2010.

J. K. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure usingmultilocus genotype data. Genetics, 155(2):945โ€“959, 2000.

Page 80: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

80

N. A. Rosenberg, S. Mahajan, S. Ramachandran, C. Zhao, J. K. Pritchard, and M. W.Feldman. Clines, clusters, and the effect of study design on the inference of humanpopulation structure. PLoS Genetics, 1(6):e70, 2005.

F. Rousset. Genetic differentiation and estimation of gene flow fromF-statistics underisolation by distance. Genetics, 145(4):1219โ€“1228, 1997.

F. Rousset. Genetic structure and selection in subdivided populations. Princeton Univer-sity Press, 2004.

F. Rousset. GENEPOP'007: a complete re-implementation of the GENEPOP softwarefor Windows and Linux. Molecular Ecology Resources, 8:103โ€“106, 2008.

D. Serre and S. Pรครคbo. Evidence for gradients of human genetic diversity within andamong continents. Genome Research, 14(9):1679โ€“1685, 2004.

M. Slatkin. Inbreeding coefficients and coalescence times. Genetical Research, 58(2):167โ€“175, 1991.

M. Stephens. Bayesian analysis of mixture models with an unknown number of com-ponents โ€”an alternative to reversible jump methods. e Annals of Statistics, 28(1):40โ€“74, 2000.

C. Strobeck. Average number of nucleotide differences in a sample from a single sub-population: a test for population subdivision. Genetics, 117(1):149โ€“153, 1987.

C. Tian, R. M. Plenge, M. Ransom, A. Lee, P. Villoslada, C. Selmi, and et al. Analysisand application of European genetic substructure using 300K SNP information. PLoSGenetics, 4(1):e4, 2008.

M. N. M. van Lieshout. Markov point processes and their applications. Imperial CollegePress, 2000.

A. P. Verbyla. A conditional derivation of residual maximum likelihood. AustralianJournal of Statistics, 32(2):227โ€“230, 1990.

C. Wang, Z. A. Szpiech, J. H. Degnan, M. Jakobsson, T. J. Pemberton, J. A. Hardy, A. B.Singleton, andN. A. Rosenberg. Comparing spatialmaps of humanpopulation-geneticvariation using Procrustes analysis. Statistical Applications in Genetics and MolecularBiology, 9(1):Article 13, 2010.

C. Wang, S. Zรถllner, and N. A. Rosenberg. A quantitative comparison of the similaritybetween genes and geography in worldwide human populations. PLoS Genetics, 8(8):e1002886, 2012.

S. K. Wasser, A. M. Shedlock, K. Comstock, E. A. Ostrander, B. Mutayoba, andM. Stephens. Assigning African elephant DNA to geographic region of origin: Ap-plications to the ivory trade. Proceedings of the National Academy of Sciences (PNAS), 10(41):14847โ€“14852, 2004.

S. K.Wasser, C.Mailand, R. Booth, B.Mutayoba, E. Kisamo, B. Clark, andM. Stephens.Using DNA to track the origin of the largest ivory seizure since the 1989 trade ban.Proceedings of the National Academy of Sciences (PNAS), 104(10):4228โ€“4233, 2007.

G. H. Weiss and M. Kimura. A mathematical analysis of the stepping stone model ofgenetic correlation. Journal of Applied Probability, 2(1):129โ€“149, 1965.

Page 81: INFERRING EFFECTIVEMIGRATIONFROM โ€ฆstephenslab.uchicago.edu/assets/papers/petkova-thesis.pdfย ยท 1 (๐‘–) 1 (๐‘—) 1 0 0 0 โ€ข ๐‘ก๐‘–๐‘— ๐‘ก๐‘š๐‘Ÿ๐‘๐‘Žโˆ’ ๐‘ก๐‘–๐‘— Figure

inferring effective migration from geographically indexed genetic data 81

S. Wright. Isolation by distance. Genetics, 28(2):114โ€“138, 1943.


Recommended