Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Evolutionary Genetics: Part 3
Coalescent 2 – Effective Population size
S. peruvianum
S. chilense
Winter Semester 2012-2013
Prof Aurélien TellierFG Populationsgenetik
Color code
Color code:
Red = Important result or definition
Purple: exercise to do
Green: some bits of maths
Population genetics: 4 evolutionary forces
random genomic processes(mutation, duplication, recombination, gene conversion)
natural
selection
random demographicprocess (drift)
random spatial
process (migration)
molecular diversity
Effective population size
The coalescent
� We can calculate many aspects of a genealogical (coalescent) tree for a
population of size 2N
� Time to MRCA : E[TMRCA] = 4N (1 – 1/n)
� Length of a tree: E[L] ≈ 4N log(n-1)
� Time of coalescence of last two lineages : E[T2] = 2N
2N
2N/3
2N/6
2N/10
Definition
� The real physical population is likely not to behave as in the Wright – Fisher model
� Most populations show some kind of structure:
� Geographic proximity of individuals,
� Social constraints…
� The number of descendants may be > 1 for the Poisson distribution
� Effective population size = size of a Wright – Fisher population that would
produce the same rate of genetic drift as the population of interest
� One consequence of drift: do two randomly picked offspring individuals have a common
ancestor in the parent generation?
Definition
� We will use here the inbreeding effective size = Ne
� Also called identity by descent population size
Ne = 1/ (2 * P[T2 = 1])
Where T2 is given in generations, T2 = time until two lineages coalesce
This depends on the immediate previous generation!
� An extension is:
Ne(t) = E[T2 ] / 2
� This relates to the number of generations until a MRCA is found in the population
Definition
� For the haploid Wright – Fisher model
Ne = 1/ (2 * P[T2 = 1])
With P[T2 = 1] = 1 / (2N)
So that Ne = N
� The extension is:
Ne(t) = E[T2 ] / 2
With E[T2] = 2N
So that Ne(t) = N
� For the Wright – Fisher model the two definitions agree
Calculating Ne
� Diploid model with different numbers of males and females
� Nf = number of females
� Nm = number of males
� Nf + Nm = N
� P[T2 = 1] = (1- 1/(2N)) * N /(8NfNm)
� Ne(t) = Ne = 4NfNm/(Nf+ Nm)
� For example: when some men have a harem, Nf = 20 and Nm = 1
� What is Ne ?
Calculating N and Ne
� Example based on human population: many human genes have an MRCA less than
200,000 years ago
� If one generation = 20 years
� So if 4Ne < E[MRCA]
� Ne < 200,000 / (4*20) => Ne < 2,500 !!!!!!!!!
� Of course N is bigger in human population, but Ne maybe be very small ☺
� We will see how to estimate Ne from sequence data later on
The coalescent – 2 role of mutations
Coalescent tree + mutations
“Coalescent theory” John Wakeley, 2009
� The distribution of mutations amongst individuals can be summarized as a tree (on
a genealogy)
� The distribution of mutations amongst individuals can be summarized as a tree (on
a genealogy)
Coalescent tree + mutations
� How to add mutation on a coalescent tree?
� In a Wright Fisher model: see drawing
� Probability of mutation = µ that an offspring changes its genotype
� And P[no mutation] = 1- µ
� This means for example: for a two allele model A and a: mutation to go
from a to A, and vice and versa
� Classical model for DNA sequences is the so called infinite site model
� Definition: each new mutation hits a new site in the genome
� So it cannot be masked by back mutation
� Not affected by recurrent mutation
� Every mutation is visible except if lost by drift
Models of mutation
� There are other models of sequence evolution, but these will not be used
for now.
� Infinite allele model
� Definition: each mutation creates a new allele
� Example on a tree
� Finite site model
� Definition: mutations fall on a finite number of sites
� Example on a tree
Coalescent tree + mutations
� How to add mutation on a coalescent tree?
� Probability of mutation = µ that an offspring changes its genotype
� And P[no mutation] = 1- µ
� Do you see where this is going?
� After t generations, what is the probability that there was no mutations?
� P[X>t] = (1- µ)t = e- µt
� So we can draw again in an exponential distribution the time until a
new mutation occurs
� And put this on a tree, drawing for each branch the time to new mutation
Coalescent tree + mutations
� How to add mutation on a coalescent tree?
� The mutation will be visible in all descendants from that branch
4 sites
AAAA
AAAA TTAA TTTT
Coalescent tree + mutations
� How to add mutation on a coalescent tree?
� The mutation will be visible in all descendants from that branch
4 sites
AAAA
AAAA TTAA TTTT
5 sites
AAAAA
AAAAG TTAAA TTTTA
One more mutation
Mutations on a tree
� For neutral mutations we can do this process without changing the shape of the
tree or the size of the tree
� Tree topology = shape and branching of the tree
� Branch lengths = length of branches usually in units of 2N generations
� BECAUSE
� Forward in time: a neutral mutation does not change the offspring distribution
of an individual
� Backward in time: mutation does not change the probability to be picked as a
parent
Tree topology
� For neutral mutations we can do this process without changing the shape of the
tree or the size of the tree
� Tree topology = shape and branching of the tree
� Branch lengths = length of branches usually in units of 2N generations
� Definitions: external branches and internal branches
Tree topology and mutation
� We define mutations = SNPs depending on their frequency
� Mutation a is found in two sequences = doubleton
� Mutation b is found in one sequence = singleton
a
b
1 2 3 4
Mutations on a tree
� We are now interested in the number of mutations on each branch of the tree
� For a branch of length llll
� The number of mutations follows a Poisson distribution with parameter (l l l l µ)
� So for the total tree: Poisson (Lµ)
� Remember
� So we define S as the total number of mutations on a tree (on a set of sequences)
1
1
1[ ] 4
n
i
E L Ni
−
=
= ∑
1
1
1
1
[ ] 4 [ ]
1[ ] 4
4
1[ ]
n
i
n
i
E S N E L
E S NN i
E Si
µ
θ
θ
−
=
−
=
=
=
=
∑
∑ With θ=4Neµ
The population mutation rate
� This is the crucial parameter: combines mutation and Ne
� θ is called the population mutation rate or scaled mutation rate
� We can estimate θ based on sequence data
� Two estimators have been derived:
� θ̟ derived by Tajima (1983)
� θS (or θW ) derived by Watterson (1975)
θ=4Neµ
Watterson estimator
� θS = θW is based on the number of segregating sites in a tree S, compared to
the average branch length of sample of size n
� defined as remember:
� This is the expected average number of segregating sites per given length
of tree branch
1
1
1S n
i
S
i
θ−
=
=
∑
1
1
1[ ] 4
n
i
E L Ni
−
=
= ∑
Tajima estimator
� θ̟ is defined as the number of average differences for all pairs of sequences
in a sample
� Based on ̟ij which is the number of differences between two sequences i
and j
� Defined as
� Because there are n(n-1)/2 pairs of sequences
� So take all sequences, and count for all pairs the number of differences,
� And then do the average
1 2
( 1)
2
ij ij
i j i jn n nπθ π π
≠ ≠
= =−
∑ ∑
Tajima estimator
� Based on πij which is the number of differences between two sequences i and j
� Different mutations counts differently
� Mutation a is counted in four pairwise comparisons
� Mutation b is counted in three comparisons
� πij and thus θπ depends on how many mutations fall on internal or external
branches
a
b
1 2 3 4
Coalescent tree + mutations
� Example of calculation
4 sites
AAAA
ATAA TAAT TATA
1
1
4 8
11 31
2
S n
i
S
i
θ−
=
= = =
+∑
2 3 3 2 8
( 1) 3 3ij
i jn nπθ π
≠
+ += = =
−∑
Watterson estimator
� θS = θW is based on the number of segregating sites in a tree S, compared to
the average branch length of sample of size n
� defined as remember:1
1
1S n
i
S
i
θ−
=
=
∑
1
1
1[ ] 4
n
i
E L Ni
−
=
= ∑
5 * 1/10
4 * 1/6
3 * 1/3
2 * 1
1
1
2 1 1 1 1 1[ ] 2 2 1 2 2(1 ) 4
3 2 2 3 4
n
i
E L N N Ni
−
=
= + + + = + + + =
∑
Neutral model of coalescent
� Very important result:
θS = θ̟
� If the population follows
� a neutral model of coalescent with constant population size!!!!
θ=4Neµ
Estimating Ne
� It is possible to estimate Ne based on the two estimators
� IF and only IF you have independent data on the mutation rate
Ne = θ̟ / 4µ = θS / 4µ
� This assumes:
� Infinite site model
� Constant Ne over time
� Homogeneous population (equal coalescent probability for all pairs)
Estimating Ne
� Exercise Calculate θ̟, θS and estimate Ne
� For two datasets:
� In human populations: TNFSF-5-Humans.fas
� In Drosophila populations: 055-Droso.nex
� Define populations in Dnasp using: data => define sequence sets
� Then => Polymophism analysis
� For droso: europe and africa
� Mutation rate in humans = 1.2 * 10-8 per base per generation (Scally and Durbin,
Nat Rev Genetics October 2012)
� Mutation rate in Drosophila = 10-8 per base per generation
� What are the differences?
Heterozygosity
Heterozygosity
� Definition: Heterozygosity H is the probability that two alleles taken
at random from a population are different at a random site or locus.
� It is a key measure of diversity in populations
� If H0 is the heterozygosity at generation 0, then at generation 1:
� Assuming no new mutations
1 0
1 10 (1 )
2 2H H
Ne Ne= + −
Proba to have the same parents at
generation 0, with probability=0 to
be different
With proba 1-(1/2N) offsprings have
different parents, and these parents have
proba H0 (by definition) to be different
Heterozygosity
� By iteration we get at generation t
� This means that in the absence of mutation, heterozygosity is lost at
a rate of (1/2N) every generation
0
11
2
t
tH H
Ne
= −
Heterozygosity + mutation
� With the infinite allele model assumption that every new mutation
creates a new allele:
� Two contrary mechanisms drive the evolution of diversity in population:
genetic drift and mutation
� If they have the same strength and balance each other = mutation-
drift balance
� The change in heterozygosity between two generations is:
( )1
12 1
2t t t t
H H H H HNe
µ+∆ = − = − + −
Heterozygosity + mutation
� At equilibrium the value of heterozygosity is Ĥ:
( )1
12 1
2t t t t
H H H H HNe
µ+∆ = − = − + −
Change of heterozygosity due to
random drift (always negative)
Change of heterozygosity due to new
mutations (always positive)
4ˆ01 4
e
e
NH H
N
µ
µ∆ = ⇒ =
+
Ĥ=θ / (1+ θ) The value at equilibrium increases with increasing µ and NeWHY?
Mutation – Drift balance
� In the case of such model, we are interested in:
� The probability for a new mutation to get fixed?
� How long does it take to get fixed?
� Using a coalescent argument: fixation of the mutation occured if and only
if the mutant is that ancestor, this probability = 1/ 2N
� The expected time of fixation is equal to the expected time to the MRCA,
so it is = 4N
� What do we expect for selected loci?
Mutation – Drift balance
� Substitution rate = rate at which mutations get fixed in a
population/species
� It is called k
� A new mutation starts with frequency 1/ 2N in a population,
� The substitution rate occurs mutliplying the number of mutations in a
population = 2 N µ
� And the probability that one mutation gets fixed = 1/ 2N
� So k = 2 N µ * (1/2N) = µ (Kimura)
� Most striking result: k does not depend on the effective population size