Evolutionary Genetics: Part 3 Coalescent 2 –Effective ... · The real physical population is likely not to behave as in the Wright –Fisher model Most populations show some kind

Evolutionary Genetics: Part 3

Coalescent 2 – Effective Population size

S. peruvianum

S. chilense

Winter Semester 2012-2013

Prof Aurélien TellierFG Populationsgenetik

Color code

Color code:

Red = Important result or definition

Purple: exercise to do

Green: some bits of maths

Population genetics: 4 evolutionary forces

random genomic processes(mutation, duplication, recombination, gene conversion)

natural

selection

random demographicprocess (drift)

random spatial

process (migration)

molecular diversity

Effective population size

The coalescent

� We can calculate many aspects of a genealogical (coalescent) tree for a

population of size 2N

� Time to MRCA : E[TMRCA] = 4N (1 – 1/n)

� Length of a tree: E[L] ≈ 4N log(n-1)

� Time of coalescence of last two lineages : E[T2] = 2N

2N

2N/3

2N/6

2N/10

Definition

� The real physical population is likely not to behave as in the Wright – Fisher model

� Most populations show some kind of structure:

� Geographic proximity of individuals,

� Social constraints…

� The number of descendants may be > 1 for the Poisson distribution

� Effective population size = size of a Wright – Fisher population that would

produce the same rate of genetic drift as the population of interest

� One consequence of drift: do two randomly picked offspring individuals have a common

ancestor in the parent generation?

Definition

� We will use here the inbreeding effective size = Ne

� Also called identity by descent population size

Ne = 1/ (2 * P[T2 = 1])

Where T2 is given in generations, T2 = time until two lineages coalesce

This depends on the immediate previous generation!

� An extension is:

Ne(t) = E[T2 ] / 2

� This relates to the number of generations until a MRCA is found in the population

Definition

� For the haploid Wright – Fisher model

Ne = 1/ (2 * P[T2 = 1])

With P[T2 = 1] = 1 / (2N)

So that Ne = N

� The extension is:

Ne(t) = E[T2 ] / 2

With E[T2] = 2N

So that Ne(t) = N

� For the Wright – Fisher model the two definitions agree

Calculating Ne

� Diploid model with different numbers of males and females

� Nf = number of females

� Nm = number of males

� Nf + Nm = N

� P[T2 = 1] = (1- 1/(2N)) * N /(8NfNm)

� Ne(t) = Ne = 4NfNm/(Nf+ Nm)

� For example: when some men have a harem, Nf = 20 and Nm = 1

� What is Ne ?

Calculating N and Ne

� Example based on human population: many human genes have an MRCA less than

200,000 years ago

� If one generation = 20 years

� So if 4Ne < E[MRCA]

� Ne < 200,000 / (4*20) => Ne < 2,500 !!!!!!!!!

� Of course N is bigger in human population, but Ne maybe be very small ☺

� We will see how to estimate Ne from sequence data later on

The coalescent – 2 role of mutations

Coalescent tree + mutations

“Coalescent theory” John Wakeley, 2009

� The distribution of mutations amongst individuals can be summarized as a tree (on

a genealogy)

� The distribution of mutations amongst individuals can be summarized as a tree (on

a genealogy)


� How to add mutation on a coalescent tree?

� In a Wright Fisher model: see drawing

� Probability of mutation = µ that an offspring changes its genotype

� And P[no mutation] = 1- µ

� This means for example: for a two allele model A and a: mutation to go

from a to A, and vice and versa

� Classical model for DNA sequences is the so called infinite site model

� Definition: each new mutation hits a new site in the genome

� So it cannot be masked by back mutation

� Not affected by recurrent mutation

� Every mutation is visible except if lost by drift

Models of mutation

� There are other models of sequence evolution, but these will not be used

for now.

� Infinite allele model

� Definition: each mutation creates a new allele

� Example on a tree

� Finite site model

� Definition: mutations fall on a finite number of sites

� Example on a tree



� Probability of mutation = µ that an offspring changes its genotype

� And P[no mutation] = 1- µ

� Do you see where this is going?

� After t generations, what is the probability that there was no mutations?

� P[X>t] = (1- µ)t = e- µt

� So we can draw again in an exponential distribution the time until a

new mutation occurs

� And put this on a tree, drawing for each branch the time to new mutation



� The mutation will be visible in all descendants from that branch

4 sites

AAAA

AAAA TTAA TTTT



� The mutation will be visible in all descendants from that branch

4 sites

AAAA

AAAA TTAA TTTT

5 sites

AAAAA

AAAAG TTAAA TTTTA

One more mutation

Mutations on a tree

� For neutral mutations we can do this process without changing the shape of the

tree or the size of the tree

� Tree topology = shape and branching of the tree

� Branch lengths = length of branches usually in units of 2N generations

� BECAUSE

� Forward in time: a neutral mutation does not change the offspring distribution

of an individual

� Backward in time: mutation does not change the probability to be picked as a

parent

Tree topology

� For neutral mutations we can do this process without changing the shape of the

tree or the size of the tree

� Tree topology = shape and branching of the tree

� Branch lengths = length of branches usually in units of 2N generations

� Definitions: external branches and internal branches

Tree topology and mutation

� We define mutations = SNPs depending on their frequency

� Mutation a is found in two sequences = doubleton

� Mutation b is found in one sequence = singleton

a

b

1 2 3 4

Mutations on a tree

� We are now interested in the number of mutations on each branch of the tree

� For a branch of length llll

� The number of mutations follows a Poisson distribution with parameter (l l l l µ)

� So for the total tree: Poisson (Lµ)

� Remember

� So we define S as the total number of mutations on a tree (on a set of sequences)

1

1

1[ ] 4

n

i

E L Ni

−

=

= ∑

1

1

1

1

[ ] 4 [ ]

1[ ] 4

4

1[ ]

n

i

n

i

E S N E L

E S NN i

E Si

µ

θ

θ

−

=

−

=

=

=

=

∑

∑ With θ=4Neµ

The population mutation rate

� This is the crucial parameter: combines mutation and Ne

� θ is called the population mutation rate or scaled mutation rate

� We can estimate θ based on sequence data

� Two estimators have been derived:

� θ̟ derived by Tajima (1983)

� θS (or θW ) derived by Watterson (1975)

θ=4Neµ

Watterson estimator

� θS = θW is based on the number of segregating sites in a tree S, compared to

the average branch length of sample of size n

� defined as remember:

� This is the expected average number of segregating sites per given length

of tree branch

1

1

1S n

i

S

i

θ−

=

=

∑

1

1

1[ ] 4

n

i

E L Ni

−

=

= ∑

Tajima estimator

� θ̟ is defined as the number of average differences for all pairs of sequences

in a sample

� Based on ̟ij which is the number of differences between two sequences i

and j

� Defined as

� Because there are n(n-1)/2 pairs of sequences

� So take all sequences, and count for all pairs the number of differences,

� And then do the average

1 2

( 1)

2

ij ij

i j i jn n nπθ π π

≠ ≠

= =−

∑ ∑

Tajima estimator

� Based on πij which is the number of differences between two sequences i and j

� Different mutations counts differently

� Mutation a is counted in four pairwise comparisons

� Mutation b is counted in three comparisons

� πij and thus θπ depends on how many mutations fall on internal or external

branches

a

b

1 2 3 4


� Example of calculation

4 sites

AAAA

ATAA TAAT TATA

1

1

4 8

11 31

2

S n

i

S

i

θ−

=

= = =

+∑

2 3 3 2 8

( 1) 3 3ij

i jn nπθ π

≠

+ += = =

−∑

Watterson estimator

� θS = θW is based on the number of segregating sites in a tree S, compared to

the average branch length of sample of size n

� defined as remember:1

1

1S n

i

S

i

θ−

=

=

∑

1

1

1[ ] 4

n

i

E L Ni

−

=

= ∑

5 * 1/10

4 * 1/6

3 * 1/3

2 * 1

1

1

2 1 1 1 1 1[ ] 2 2 1 2 2(1 ) 4

3 2 2 3 4

n

i

E L N N Ni

−

=

= + + + = + + + =

∑

Neutral model of coalescent

� Very important result:

θS = θ̟

� If the population follows

� a neutral model of coalescent with constant population size!!!!

θ=4Neµ

Estimating Ne

� It is possible to estimate Ne based on the two estimators

� IF and only IF you have independent data on the mutation rate

Ne = θ̟ / 4µ = θS / 4µ

� This assumes:

� Infinite site model

� Constant Ne over time

� Homogeneous population (equal coalescent probability for all pairs)

Estimating Ne

� Exercise Calculate θ̟, θS and estimate Ne

� For two datasets:

� In human populations: TNFSF-5-Humans.fas

� In Drosophila populations: 055-Droso.nex

� Define populations in Dnasp using: data => define sequence sets

� Then => Polymophism analysis

� For droso: europe and africa

� Mutation rate in humans = 1.2 * 10-8 per base per generation (Scally and Durbin,

Nat Rev Genetics October 2012)

� Mutation rate in Drosophila = 10-8 per base per generation

� What are the differences?

Heterozygosity

Heterozygosity

� Definition: Heterozygosity H is the probability that two alleles taken

at random from a population are different at a random site or locus.

� It is a key measure of diversity in populations

� If H0 is the heterozygosity at generation 0, then at generation 1:

� Assuming no new mutations

1 0

1 10 (1 )

2 2H H

Ne Ne= + −

Proba to have the same parents at

generation 0, with probability=0 to

be different

With proba 1-(1/2N) offsprings have

different parents, and these parents have

proba H0 (by definition) to be different

Heterozygosity

� By iteration we get at generation t

� This means that in the absence of mutation, heterozygosity is lost at

a rate of (1/2N) every generation

0

11

2

t

tH H

Ne

= −

Heterozygosity + mutation

� With the infinite allele model assumption that every new mutation

creates a new allele:

� Two contrary mechanisms drive the evolution of diversity in population:

genetic drift and mutation

� If they have the same strength and balance each other = mutation-

drift balance

� The change in heterozygosity between two generations is:

( )1

12 1

2t t t t

H H H H HNe

µ+∆ = − = − + −

Heterozygosity + mutation

� At equilibrium the value of heterozygosity is Ĥ:

( )1

12 1

2t t t t

H H H H HNe

µ+∆ = − = − + −

Change of heterozygosity due to

random drift (always negative)

Change of heterozygosity due to new

mutations (always positive)

4ˆ01 4

e

e

NH H

N

µ

µ∆ = ⇒ =

+

Ĥ=θ / (1+ θ) The value at equilibrium increases with increasing µ and NeWHY?

Mutation – Drift balance

� In the case of such model, we are interested in:

� The probability for a new mutation to get fixed?

� How long does it take to get fixed?

� Using a coalescent argument: fixation of the mutation occured if and only

if the mutant is that ancestor, this probability = 1/ 2N

� The expected time of fixation is equal to the expected time to the MRCA,

so it is = 4N

� What do we expect for selected loci?

Mutation – Drift balance

� Substitution rate = rate at which mutations get fixed in a

population/species

� It is called k

� A new mutation starts with frequency 1/ 2N in a population,

� The substitution rate occurs mutliplying the number of mutations in a

population = 2 N µ

� And the probability that one mutation gets fixed = 1/ 2N

� So k = 2 N µ * (1/2N) = µ (Kimura)

� Most striking result: k does not depend on the effective population size

Documents

Evolutionary Genetics: Part 3 Coalescent 2 –Effective ... · The real physical population is likely not to behave as in the Wright –Fisher model Most populations show some kind