95
The Coalescent: Inference using trees of individuals Peter Beerli Scientific Computing, Florida State University #MolEvol2019 @peterbeerli

The Coalescent

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Coalescent

The Coalescent:Inference using trees of individuals

Peter BeerliScientific Computing, Florida State University

#MolEvol2019@peterbeerli

Page 2: The Coalescent

Phylogenetics

TreesMutation modelsModel selection

The big overview

2 /95 c©2019 Peter Beerli

Page 3: The Coalescent

Population genetics

Allele frequenciesPopulation models

Model selection

The big overview

3 /95 c©2019 Peter Beerli

Page 4: The Coalescent

PhylogeneticsPopulation genetics

TreesMutation modelsModel selection

Coalescencetheory Allele frequencies

Population modelsModel selection

The big overview

4 /95 c©2019 Peter Beerli

Page 5: The Coalescent

Coalescence theory as a tool for population genetics

5 /95 c©2019 Peter Beerli

co•a•lesce |ˌkōəˈles|verb [ intrans. ]come together and form one mass or whole : the puddles had

coalesced into shallow streams | the separate details coalesce toform a single body of scientific thought.• [ trans. ] combine (elements) in a mass or whole : to helpcoalesce the community, they established an office.

Page 6: The Coalescent

Coalescence theory as a tool for population genetics

6 /95 c©2019 Peter Beerli

¡ We have data: for example, microsatellite data, singlelocus DNA sequences, or genomes

we need to decide on a model to connectthe data with parameters of interest.

The coalescent represent the relationship amongindividuals and can be expressed as a genealogy ofindividuals genes (not individuals)

Page 7: The Coalescent

Interaction among individuals

7/95 c©2019 Peter Beerli

Page 8: The Coalescent

“Interaction” among individuals

8 /95 c©2019 Peter Beerli

Adult TadpoleTadpoleAdultWright-Fisher population model

All individuals live one generationand get replaced by their offspring

All have same chance to reproduce, all are equally fit

The number of individuals in the population is constant

As a result the individuals in generation n are a random draw from the previousgeneration n− 1.

Page 9: The Coalescent

Past Present

Population model

Page 10: The Coalescent

Past Present

Population model

Page 11: The Coalescent

Past Present

Population model

Page 12: The Coalescent

Past Present

Population model

12 /95 c©2019 Peter Beerli

Page 13: The Coalescent

Population model

13 /95 c©2019 Peter Beerli

Sewall Wright evaluated the probability that two randomly chosenindividuals in generation t have a common ancestor ingeneration t− 1. If we assume that there are 2N chromosomesthen the probability of sharing a common ancestor in the last generation is

t− 1

t

Page 14: The Coalescent

Population model

14 /95 c©2019 Peter Beerli

Sewall Wright evaluated the probability that two randomly chosenindividuals in generation t have a common ancestor ingeneration t− 1. If we assume that there are 2N chromosomesthen the probability of sharing a common ancestor in the last generation is

1.0

t− 1

t

Page 15: The Coalescent

Population model

15 /95 c©2019 Peter Beerli

Sewall Wright evaluated the probability that two randomly chosenindividuals in generation t have a common ancestor ingeneration t− 1. If we assume that there are 2N chromosomesthen the probability of sharing a common ancestor in the last generation is

1.0× 1

2N

t− 1

t

Page 16: The Coalescent

Population model

Sewall Wright evaluated the probability that two randomly chosenindividuals in generation t have a common ancestor in generationt−1. If we assume that there are 2N chromosomes then theprobability of sharing a common ancestor in the last generation is

1

2N

t− 1

t

t− 1

t

The probability that two randomly picked chromosome do not have a commonancestor is

1− 1

2N

Page 17: The Coalescent

Population model

17 /95 c©2019 Peter Beerli

The probability that two individuals share a common parentafter t generations is

P(t|N) =

(1− 1

N

)×(

1− 1

N

)...×

(1− 1

N

)︸ ︷︷ ︸

t times

(1

2N

)

=

(1− 1

N

)t(1

2N

)where t is the number of generations with no coalescence. This formula isknown as the Geometric Distribution and we can calculate the expectation ofthe waiting time until two random individuals coalesce as

E(t) = 2N

Page 18: The Coalescent

Past Present

Population model

18 /95 c©2019 Peter Beerli

Page 19: The Coalescent

Past Present

Population model

19 /95 c©2019 Peter Beerli

Page 20: The Coalescent

Past Present

Population model

20 /95 c©2019 Peter Beerli

Page 21: The Coalescent

Past Present

Population model

21 /95 c©2019 Peter Beerli

Page 22: The Coalescent

Past Present

Population model

22 /95 c©2019 Peter Beerli

Page 23: The Coalescent

Past Present

Population model

23 /95 c©2019 Peter Beerli

Page 24: The Coalescent

Probability Distribution

24 /95 c©2019 Peter Beerli

50 100

0.05

0.10

0.15

0.20

0.25

20generations

P

10000 random draw from a population with size 2N=20leads to this distribution of times until two randomly chosenindividuals have a common ancestor. The observed meanwaiting time of 2N=20.34

Page 25: The Coalescent

Observations: Coalescence of two lineages

25 /95 c©2019 Peter Beerli

For the time of coalescence in a sample of TWO , we will wait on average

2N generations assuming it is a Wright-Fisher population

The model assumes that the generations are discrete and non-overlapping

Real populations do not necessarily behave like a Wright-Fisher (the ‘ideal’population)

We assume that calculation using Wright-Fisher populations can beextrapolated to real populations.

Page 26: The Coalescent

Other population models

26 /95 c©2019 Peter Beerli

Wright-Fisher Canning Moran

σ2offspring ' 1 σ2

offspring = x σ2offspring = 2

2N

E(t) = 2N E(t) = 2N/x E(t) = 12(2N)2

generation time g = 1 g = 1 g = 2N

You can generate graphs like this using the python program popsim (check out my faculty page for the link)

Page 27: The Coalescent

Other population models

27 /95 c©2019 Peter Beerli

Wright-Fisher Canning Moran

σ2offspring ' 1 σ2

offspring = x σ2offspring = 2

2N

E(t) = 2N E(t) = 2N/x E(t) = 12(2N)2

generation time g = 1 g = 1 g = 2N

You can generate graphs like this using the python program popsim (check out my faculty page for the link)

Page 28: The Coalescent

Past Present

Sample larger than TWO

28/95 c©2019 Peter Beerli

Page 29: The Coalescent

Past Present

Sample larger than TWO

29/95 c©2019 Peter Beerli

Page 30: The Coalescent

Past Present

Sample larger than TWO

30/95 c©2019 Peter Beerli

Page 31: The Coalescent

Past Present

Sample larger than TWO

31/95 c©2019 Peter Beerli

Page 32: The Coalescent

Past Present

Sample larger than TWO

32/95 c©2019 Peter Beerli

Page 33: The Coalescent

Samples larger than two

33 /95 c©2019 Peter Beerli

Sir J. F. C. Kingman described in 1982 the n-coalecent. Heshowed the behavior of a sample of size n, and its probabilitystructure looking backwards in time.

General findings:

coalescence rate =

(n

2

)=n(n− 1)

2

Once a coalescence happened n is reduced to n− 1 becausetwo lineages merged into one. He then imposed a continuousapproximation of the Canning’s exchangeable model to getresults.

Page 34: The Coalescent

Sewall Wrights result on two lineages can be approximated:

In the discrete Wright-Fisher model we calculate the probability of non-coalescent during t generation; By using a suitable timescale τ such thatone unit of scaled time corresponds to 2N generations, we can simplify toan continuous process

(1− 1

2N)t = (1− 1

2N)(2N)τ → e−τ ,

as N goes to infinity. For more than two lineages we use Kingman’s resultand use

e−τ(n2)

for the probability of non-coalescence of n lineages during the time intervalτ ; we will elaborate on τ soon.

Timescale

34/95 c©2019 Peter Beerli

Page 35: The Coalescent

Sewall Wrights result on two lineages can be approximated:

In the discrete Wright-Fisher model we calculate the probability of non-coalescent during t generation; By using a suitable timescale τ such thatone unit of scaled time corresponds to 2N generations, we can simplify toan continuous process

(1− 1

2N)t = (1− 1

2N)(2N)τ → e−τ ,

as N goes to infinity. For more than two lineages we use Kingman’s resultand use

e−τ(n2)

for the probability of non-coalescence of n lineages during the time intervalτ ; we will elaborate on τ soon.

For there curious: thisis Poisson τk

k! e−τ with

k events, here withk=0→ e−τ

Timescale

35/95 c©2019 Peter Beerli

Page 36: The Coalescent

First analogy

36 /95 c©2019 Peter Beerli

Page 37: The Coalescent

First analogy

37 /95 c©2019 Peter Beerli

Time

The time scale here is arbitrary, for example ifthe rate is 2 calls per 10 minutes; we then havea probability of getting no call for 10 minutes as

e−10×2/10 = 0.135

Page 38: The Coalescent

Time

If another type of call has a rate of 4 calls per 10minutes; we then have a probability of gettingno call for 10 minutes as I

e−10×4/10 = 0.018

First analogy

38/95 c©2019 Peter Beerli

Page 39: The Coalescent

First analogy

39 /95 c©2019 Peter Beerli

Time

Having two type of calls with different rates2/10 and 4/10; we then have a probability ofgetting no call for 10 minutes as

e−10×(2/10+4/10) = 0.0024

Page 40: The Coalescent

Second analogy

40/95 c©2019 Peter Beerli

Page 41: The Coalescent

Samples larger than two

41 /95 c©2019 Peter Beerli

Page 42: The Coalescent

Samples larger than two

42 /95 c©2019 Peter Beerli

u0

u1

u3

u4

Looking backward in time, the firstcoalescence between two randomindividuals is the result of a waitingprocess that depends on the sample n andthe total population size N .

Page 43: The Coalescent

Samples larger than two

43 /95 c©2019 Peter Beerli

u0

u1

u3

u4

Looking backward in time, the firstcoalescence between two randomindividuals is the result of a waitingprocess that depends on the sample n andthe total population size N .

Using Kingman’s coalescence rate andimposing a time scale we can approximatethe process with a exponential distribution:

Page 44: The Coalescent

Samples larger than two

44 /95 c©2019 Peter Beerli

u0

u1

u3

u4

Looking backward in time, the first coalescencebetween two random individuals is the result ofa waiting process that depends on the samplen and the total population size N .Using Kingman’s coalescence rate andimposing a time scale we can approximate theprocess with an exponential distribution:

P(uj|N) = e−ujλλ

with the scaled coalescence rate

λ =

(k

2

)1

2N× Prob(others do not coalesce)

Page 45: The Coalescent

Samples larger than two

45 /95 c©2019 Peter Beerli

u0

u1

u3

u4

Looking backward in time, the first coalescencebetween two random individuals is the result ofa waiting process that depends on the samplen and the total population size N .

Using Kingman’s coalescence rate andimposing a time scale we can approximate theprocess with a exponential distribution:

P(uj|N) = e−ujλλwith the scaled coalescence rate

λ =

(k

2

)1

2N=k(k − 1)

2(2N)=k(k − 1)

4N

Page 46: The Coalescent

Samples larger than two

46 /95 c©2019 Peter Beerli

u0

u1

u3

u4

We are now able to calculate the probabilityof a whole relationship tree (GenealogyG). We assume that each coalescence isindependent from any other:

P(G|N)

Page 47: The Coalescent

Samples larger than two

47 /95 c©2019 Peter Beerli

u0

u1

u3

u4

We are now able to calculate the probabilityof a whole relationship tree (GenealogyG). We assume that each coalescence isindependent from any other:

P(G|N) = P(u0|N, i1, i2)

×

Page 48: The Coalescent

Samples larger than two

48 /95 c©2019 Peter Beerli

u0

u1

u3

u4

We are now able to calculate the probabilityof a whole relationship tree (GenealogyG). We assume that each coalescence isindependent from any other:

P(G|N) = P(u0|N, i1, i2)

× P(u1|N, i3, i4)

Page 49: The Coalescent

Samples larger than two

49 /95 c©2019 Peter Beerli

u0

u1

u3

u4

We are now able to calculate the probabilityof a whole relationship tree (GenealogyG). We assume that each coalescence isindependent from any other:

P(G|N) = P(u0|N, i1, i2)

× P(u1|N, i3, i4)

× P(u3|N, i3,4, i5)

Page 50: The Coalescent

Samples larger than two

50 /95 c©2019 Peter Beerli

u0

u1

u3

u4

We are now able to calculate the probabilityof a whole relationship tree (GenealogyG). We assume that each coalescence isindependent from any other:

P(G|N) = P(u0|N, i1, i2)

× P(u1|N, i3, i4)

× P(u3|N, i3,4, i5)

× P(u4|N, i1,2, i3,4,5)

Page 51: The Coalescent

Samples larger than two

51 /95 c©2019 Peter Beerli

u0

u1

u3

u4

We are now able to calculate the probabilityof a whole relationship tree (GenealogyG). We assume that each coalescence isindependent from any other:

P(G|N) = P(u0|N, i1, i2)

× P(u1|N, i3, i4)

× P(u3|N, i3,4, i5)

× P(u4|N, i1,2, i3,4,5)

P(G|N) =T∏j=0

e−ujkj(kj−1)

4Nk(k − 1)

4N

2

k(k − 1)=

T∏j=0

e−ujkj(kj−1)

4N2

4N

Page 52: The Coalescent

Samples larger than two

52 /95 c©2019 Peter Beerli

u0

u1

u3

u4

P(G|N) =T∏j=0

e−ujkj(kj−1)

4N2

4N

The expectations of the total time tocoalescence is the sum of the expectationsfor each interval. Each interval hasexpectation

E(u) =4N

k(k − 1)

this leads to the expectation for the time ofthe most recent common ancestor

E(τMRCA) = Sum of the expectation of each time interval =J∑j=0

4N

kj(kj − 1)

limk→∞

E(τMRCA) = 2N +2

3N +

1

3N +

1

5N +

2

15N + ... = 4N lim

k→∞σ(τMRCA) = 4N

Page 53: The Coalescent

If we know the genealogy G with certaintythen we can calculate the population size N .Finding the maximum probability P(G|N, k) issimple, we evaluate all possible values for Nand pick the value with the highest probability.

What is it good for?

53/95 c©2019 Peter Beerli

Page 54: The Coalescent

If we know the genealogy G with certaintythen we can calculate the population size N .Finding the maximum probability P(G|N, k) issimple, we evaluate all possible values for Nand pick the value with the highest probability.

What is it good for?

54/95 c©2019 Peter Beerli

Page 55: The Coalescent

What is it good for?

55 /95 c©2019 Peter Beerli

If we know the genealogy G with certainty then we can calculate the populationsize N . Finding the maximum probability P(G|N, k) is simple, we evaluate allpossible values for N and pick the value with the highest probability.

Page 56: The Coalescent

Population size estimation

56 /95 c©2019 Peter Beerli

If an oracle gives us the true relationship tree G then we can calculate thepopulation size N .

p(G|N,n) =n∏

k=2

exp

(−uk

k(k− 1)

4N

)2

4N

Page 57: The Coalescent

Population size estimation

57 /95 c©2019 Peter Beerli

If an oracle gives us the true relationship tree G then we can calculate thepopulation size N .

p(G|N,n) =n∏

k=2

exp

(−uk

k(k− 1)

4N

)2

4N

Page 58: The Coalescent

Population size estimation

58 /95 c©2019 Peter Beerli

If an oracle gives us the true relationship tree G then we can calculate thepopulation size N .

p(G|N,n) =n∏

k=2

exp

(−uk

k(k− 1)

4N

)2

4N

Page 59: The Coalescent

Population size estimation

59 /95 c©2019 Peter Beerli

If an oracle gives us the true relationship tree G then we can calculate thepopulation size N .

p(G|N,n) =n∏

k=2

exp

(−uk

k(k− 1)

4N

)2

4N

Page 60: The Coalescent

Population size estimation

60 /95 c©2019 Peter Beerli

If an oracle gives us the true relationship tree G then we can calculate thepopulation size N .

p(G|N,n) =n∏

k=2

exp

(−uk

k(k− 1)

4N

)2

4N

Page 61: The Coalescent

Population size estimation

61 /95 c©2019 Peter Beerli

If an oracle gives us the true relationship tree G then we can calculate thepopulation size N .

p(G|N,n) =n∏

k=2

exp

(−uk

k(k− 1)

4N

)2

4N

Page 62: The Coalescent

Population size estimation

62 /95 c©2019 Peter Beerli

If an oracle gives us the true relationship tree G then we can calculate thepopulation size N .

p(G|N,n) =n∏

k=2

exp

(−uk

k(k− 1)

4N

)2

4N

Page 63: The Coalescent

Population size estimation

63 /95 c©2019 Peter Beerli

If an oracle gives us the true relationship tree G then we can calculate thepopulation size N .

p(G|N,n) =n∏

k=2

exp

(−uk

k(k− 1)

4N

)2

4N

Page 64: The Coalescent

Population size estimation

64 /95 c©2019 Peter Beerli

If an oracle gives us the true relationship tree G then we can calculate thepopulation size N .

p(G|N,n) =n∏

k=2

exp

(−uk

k(k− 1)

4N

)2

4N

Page 65: The Coalescent

Population size estimation

65 /95 c©2019 Peter Beerli

If an oracle gives us the true relationship tree G then we can calculate thepopulation size N .

p(G|N,n) =n∏

k=2

exp

(−uk

k(k− 1)

4N

)2

4N

Page 66: The Coalescent

Population size estimation

66 /95 c©2019 Peter Beerli

There are at least two problems with the oracle-approach:

There is no oracle to gives us clear information!

We do not record genealogies, our data are sequences, microsatellite loci!

What about the variability of the coalescent?

Page 67: The Coalescent

Variability of the coalescent process

67 /95 c©2019 Peter Beerli

All genealogies were simulated with the same population size Ne = 10, 000

Page 68: The Coalescent

Variability of the coalescent process

68 /95 c©2019 Peter Beerli

20 40 60 80 100

5.

10.

15.

20.

25.[10-6]

[103 generations]Time to MRCA

freq.

MRCA = most recent common ancestor (last node in the genealogy)

Page 69: The Coalescent

Kingman’s n-coalescent is an approximation

69 /95 c©2019 Peter Beerli

All individuals have the same fitness (no selection).

All individuals have the same chance to be in the sample (randomsampling).

The coalescent allows only merging two lineages per generation. Thisrestricts us to to have a much smaller sample size than the population size.

n << N

Yun-Xin Fu (2005) described the exact coalescent for the Wright-Fishermodel and derived a maximal sample size n<

√4N for a diploid population.

Although this may look like a severe restriction for the use of thecoalescence in small populations, it turned out that the coalescence is ratherrobust and that even sample sizes close to the effective population size arenot biasing immensely.

Page 70: The Coalescent

Ignoring multimerger coalescences

70 /95 c©2019 Peter Beerli

Here are the exact probabilities of 0, 1, or more coalescences with 10 lineagesin populations of different sizes:

Population size Coalescences within a single generation0 1 >1

100 0.79560747 0.18744678 0.016945751000 0.97771632 0.02209806 0.00018562

10000 0.99775217 0.00224595 0.00000187

Note that increasing the population size by a factor of 10 reduces thecoalescent rate for pairs by about 10-fold, but reduces the rate for triples (ormore) by about 100-fold.

Page 71: The Coalescent

Observations

71 /95 c©2019 Peter Beerli

Large samples coalesce on average in 4N generations.

The time to the most recent common ancestor (TMRCA) has a large variance

Even a sample with few individuals can most often recover the same TMRCAas a large sample.

The sample size should be much smaller than the population size, althoughsevere problems appear only with sample sizes of the same magnitudeas the population size, or with non-random samples because Kingman’scoalescence process assumes that maximally two sample lineages coalescein any generation.

With a known genealogy we can estimate the population size. Unfortunately,the true genealogy of a sample is rarely known.

Page 72: The Coalescent

Genealogy and data

72/95 c©2019 Peter Beerli

Page 73: The Coalescent

Finding the best genealogy from such data is difficult

Genealogy and data

73/95 c©2019 Peter Beerli

Page 74: The Coalescent

Genetic data and the coalescent

74 /95 c©2019 Peter Beerli

Finite populations loose alleles due to genetic drift

Mutation introduces new alleles into a population at rate µ

With 2N chromosomes we can expect to see every generation 2Nµ newmutations. The population size N is positively correlated with the mutationrate µ.

With genetic data sampled from several individuals we can use the mutationalvariability to estimate the population size.

Page 75: The Coalescent

Population size

75 /95 c©2019 Peter Beerli

The observed genetic variabilityS = f(N,µ, n).

Different N and appropriate µ can give the same number of mutations. Forexample, for 100 loci sampled from 20 individuals with 1000bp each, we get :

N µ 4Nµ S σ2S

1250 10−5 0.05 153.95 16.25

12500 10−6 0.05 152.89 16.05

Using genetic variability alone therefore does not allow to disentangle N and µ.With multiple dated samples and known generation time we can estimate N andµ independently.

Page 76: The Coalescent

Mutation-scaled population size

76 /95 c©2019 Peter Beerli

By convention we express most results as the compound Nµ and an inheritancescalar x, for simplicity we call this the mutation-scaled population size

Θ = xNµ,

where µ is the mutation rate per generation and per site. With a mutation rateper locus we use θ.

for diploids: Θ = 4Nµ.

for haploids: Θ = 2Nµ.

Page 77: The Coalescent

Time scale (second)

77 /95 c©2019 Peter Beerli

P (no coalescence with n lineages) = exp

(−τ(n

2

))scaling time τ by the population size 2N and using t in generations we getτ = t 1

2N , this then leads to

P (coalescence at t0 + t) = exp(−tλ)λ with λ =1

2N

n(n− 1)

2

we use DNA data, we assume there is mutation model with a mutation rate µ;we include that in our scaling and use time t as scaled by expected mutationrate per generation and Θ = 4Nµ:

P (a coalescent at time t0 + t|t0) =2

Θexp(−tn(n− 1)

Θ)

Page 78: The Coalescent

Each site has a single coalescent tree

Loci in close proximity share most likely the same coalescent tree,but dependent on recombination rate, many loci may be in linkagedisequilibrium.

We assume that loci on different chromosomes are independent.

We can treat each locus as independent an combine the single lociestimates.

What is an individual for the coalescent

78/95 c©2019 Peter Beerli

Page 79: The Coalescent

Each site can have different coalescents: Recombination

79/95 c©2019 Peter Beerli

Page 80: The Coalescent

How to make inference using genetic data

80 /95 c©2019 Peter Beerli

Sequence data has some sites insome individuals that are differentthan others

=⇒ Mutation model (finite vs infinite)

Population model =⇒ the Coalescent

Analysis method =⇒ Summary statisticsMaximum likelihoodBayesian Inference

Page 81: The Coalescent

How to make inference using genetic data

81 /95 c©2019 Peter Beerli

Sequence data has some sites insome individuals that are differentthan others

=⇒ Mutation model (finite vs infinite)

Population model =⇒ the Coalescent

Analysis method =⇒ Summary statisticsMaximum likelihoodBayesian Inference

I

Page 82: The Coalescent

Genetic data and the coalescent

82 /95 c©2019 Peter Beerli

Using the infinite sites model we use the number of variable sites S per locus tocalculate the mutation-scaled population size:

θW =S

n−1∑k=1

1k

from a sample of n individuals. For a single population the Watterson’s estimatorworks marvelously well, but it is vulnerable to population structure.

Watterson’s θW uses a mutation rate per locus! To compare with other work usemutation rate per site.

Page 83: The Coalescent

Construction of a versatile estimator

83/95 c©2019 Peter Beerli

Page 84: The Coalescent

Construction of a versatile estimator

84 /95 c©2019 Peter Beerli

Wikimedia: Neon sign at Autonomy in Cambridge UK

For Bayesian inference we want to calculate the probability of the modelparameters given the data p(model|D).

Coalescent to describe the population genetic processes.Mutation model to describe the change of genetic material over time.

Page 85: The Coalescent

Construction of a versatile estimator

85 /95 c©2019 Peter Beerli

We calculate the Posterior distribution p(Θ|D) using Bayes’ rule

p(Θ|D) =p(Θ)p(D|Θ)

p(D)

where p(D|Θ) is the likelihood of the parameters.

Page 86: The Coalescent

(almost) Felsenstein equation

86 /95 c©2019 Peter Beerli

p(D|Θ, G) = p(G|Θ)p(D|G)

p(G|Θ) The probability density of a genealogy given parameters.

p(D|G)The probability density of the data for a givengenealogy. Phylogeneticists know this as the tree-likelihood.

Page 87: The Coalescent

Felsenstein equation

87 /95 c©2019 Peter Beerli

p(D|Θ) =

∫G

p(G|Θ)p(D|G)dG

p(G|Θ) The probability density of a genealogy given parameters.

p(D|G)The probability density of the data for a givengenealogy. Phylogeneticists know this as the tree-likelihood.

Page 88: The Coalescent

Felsenstein equation

88 /95 c©2019 Peter Beerli

p(D|Θ) =∑G

p(G|Θ)p(D|G)

p(G|Θ) The probability density of a genealogy given parameters.

p(D|G)The probability density of the data for a givengenealogy. Phylogeneticists know this as the tree-likelihood.

Page 89: The Coalescent

p(D|Θ) =

∫G

p(G|Θ)p(D|G)dG

The number of possible genealogies is verylarge and for realistic data sets, programsneed to use Markov chain Monte Carlomethods.

Problem with integration formula

89/95 c©2019 Peter Beerli

Page 90: The Coalescent

Mode=0.009032.5% percentile=0.007

97.5% percentile=0.0118

Mean=0.00934Median=0.00934

Bayesian inference: Θ = 0.00903

Watterson Estimator ΘW = 0.01003

Inference of population size

90/95 c©2019 Peter Beerli

Page 91: The Coalescent

Mutation-scaled population size, revisited

91 /95 c©2019 Peter Beerli

By convention we express most results as the compound Nµ and an inheritancescalar x, for simplicity we call this the mutation-scaled population size

Θ = xNµ,

where µ is the mutation rate per generation and per site. With a mutation rateper locus we use θ.

for diploids: Θ = 4Nµ.

for haploids: Θ = 2Nµ.

For mtDNA in diploids with strictly maternal inheritance this leads to Θ =2Nfµ, and if the sex ratio is 1 : 1 then Θ = Nµ

Most real populations do not behave exactly like Wright-Fisher populations,therefore we subscriptN and call it the effective population sizeNe, and considerΘ the mutation-scaled EFFECTIVE population size.

Page 92: The Coalescent

Mutation-scaled population size

92 /95 c©2019 Peter Beerli

Gag Grouper starts outas a female and later inlive becomes male.

By convention we express most results as the compound Nµ and an inheritancescalar x, for simplicity we call this the mutation-scaled population size

Θ = xNµ,

where µ is the mutation rate per generation and per site. With a mutation rateper locus we use θ.

for diploids: Θ = 4Nµ.

for haploids: Θ = 2Nµ.

For mtDNA in diploids with strictly maternal inheritance this leads to Θ =2Nfµ, and if the sex ratio is 1 : 1 then Θ = Nµ

Most real populations do not behave exactly like Wright-Fisher populations,therefore we subscriptN and call it the effective population sizeNe, and considerΘ the mutation-scaled EFFECTIVE population size.

[ t]this is a test

Page 93: The Coalescent

Historical humpback whale population size

93/95 c©2019 Peter Beerli

Humpback whales in the North Atlantic: Census population size around 12,000.

Page 94: The Coalescent

Historical humpback whale population size

94/95 c©2019 Peter Beerli

using the data by Joe Roman and Stephen R. Palumbi (Science 2003 301: 508-510)

Θ = 2N~µ 0.01529 Population size of the North Atlanticpopulation, estimated using migrate

N~ = Θ2µ 12,251 with µ = 5.2 × 10−8bp−1year−1 and a

generation time of 12 years

Ne = N~ +N| 24503 Sex ratio is 1:1

NB = 2Ne 49,006 ratio NB/Ne assumed, using other data

NT = NBNjuveniles+Nadults

Nadults78,410 from catch and survey data (used a ratio of 1.6)

using a mutation rate of Alter and Palumbi 2009; for nucDNA: 112, 000(45, 000 − 235, 000)

(Ruegg et al. Conservation Genetics (2013) 14:103–114)

Page 95: The Coalescent

Historical humpback whale population size

95/95 c©2019 Peter Beerli

using the data and original mutation rate by Joe Roman and Stephen R. Palumbi(Science 2003 301: 508-510)

Θ = 2N~µ 0.01529 Population size of the North Atlanticpopulation, estimated using migrate

N~ = Θ2µ 31,854 with µ = 2.0 × 10−8bp−1year−1 and a

generation time of 12 years

Ne = N~ +N| 63,708 Sex ratio is 1:1

NB = 2Ne 127,417 ratio NB/Ne assumed, using other data

NT = NBNjuveniles+Nadults

Nadults203,867 from catch and survey data (used a ratio of 1.6)

More modern estimates for mtDNA: 150, 000 [improved estimation of mutation rate]; fornucDNA: 112, 000(45, 000− 235, 000) [Conservation Genetics (2013) 14:103–114]