Upload
others
View
20
Download
1
Embed Size (px)
Citation preview
Evolutionary Genetics: Part 5
Inbreeding – Structured populations
S. peruvianum
S. chilense
Winter Semester 2012-2013
Prof Aurélien TellierFG Populationsgenetik
Color code
Color code:
Red = Important result or definition
Purple: exercise to do
Green: some bits of maths
Population genetics: 4 evolutionary forces
random genomic processes
(mutation, duplication, recombination, gene conversion)
natural
selection
random demographic
process (drift)
random spatial
process (migration)
molecular diversity
Deviations from panmixia
� So far, standard neutral model assumes a Wright-Fisher model of
population
� All individuals have the same probability to pick their parents, and all
parents have same probability
� In reality this is not true
� Inbreeding (include selfing) = mating between closely related
individuals
� Population structure = species consist of several populations,
panmixia occurs within populations
Hardy-Weinberg equilibrium
� The first model of panmixia = Hardy – Weinberg equilibrium
� Formulated independently by GH Hardy and W Weinberg in 1908
Hardy-Weinberg equilibrium
� The assumptions of the model are:
� organisms are diploid
� only sexual reproduction occurs
� generations are non overlapping
� mating is random (panmixia)
� population size is large
� allele frequencies are equal in the sexes
� there is no migration, mutation or selection
Hardy-Weinberg equilibrium
� With two alleles at a locus: A and a, the allele frequencies are:
� A has frequency p
� a has frequency q = 1-p
Homozygote HomozygoteHeterozygote
A A A a a aParentgenotypes
Gametes
Offspringgenotypes
Aaa A
A A A a a a
Frequency = p2 Frequency = q2Frequency = 2pq
Hardy-Weinberg equilibrium
� Demonstration of the equilibrium:
� A has frequency p, a has frequency q = 1-p
� In the parents:
�AA in frequency P
�Aa in frequency H
�aa in frequency Q
p = P + (1/2)H and q = Q + (1/2)H
Hardy-Weinberg equilibrium
1) Frequency of the matings
AA (P) Aa (H) aa (Q)
AA (P) P2 PH PQ
Aa (H) PH H2 HQ
aa (Q) PQ HQ Q2
Female genotypes
Male
gen
oty
pes
Hardy-Weinberg equilibrium
2) Then for each mating, what are the progenies
Freq from
matingAA Aa aa
AA××××AA P2 P2
AA××××Aa 2PH PH PH
AA××××aa 2PQ 2PQ
Aa××××Aa H2 (1/4)H2 (1/2)H2 (1/4)H2
Aa××××aa 2HQ HQ HQ
aa××××aa Q2 Q2
(P + (1/2)H)2
= p2
2(P+(1/2)H)
(Q +(1/2)H)
= 2pq
(Q + (1/2)H)2
= q2
Progeny genotypes
Hardy-Weinberg equilibrium
� Numerical example:
� P = p2 = 0.2
� H = 2pq = 0.4
� Q = q2 = 0.4
So that p=0.4 and q=0.6
� At next generation:
� p2 = (0.4)2 = 0.16
� 2pq = 2 * 0.4 * 0.6 = 0.48
� q2 = (0.6)2 = 0.36
�These are the equilibrium frequencies
Genotypes: AA Aa aa
Hardy-Weinberg equilibrium
� Numerical example: When frequencies are unknown…
Example: 50 Individuals (n = 50, 2n = 100)
Observed number of individuals: AA Aa aa
15 25 10
p = (2 x 15 + 25) / 100 = 0.55
q = (2 x 10 + 25) / 100 = 0.45
np2 2pqn nq2
50 x (0.55)2 2(0.55)(0.45) x 50 50 x (0.45)2
Expected number of individuals 15,125 24,75 10,125
Hardy-Weinberg equilibrium
� HW equilibrium means that if frequencies change, they go back to the
equilibrium value within ONE generation of random mating
� Drift for example affect the allele frequencies if the population is small
� It is a useful neutral model which predict if a population follows the HW
equilibrium frequencies
� If not, WHY???
� One can use a χ2 test for statistical testing of deviation from equilibrium
Data = frequencies or numbers (please look at text book)
( )2
2_ _
_
Observed data Expected data
Expected dataχ
−= ∑
Hardy-Weinberg equilibrium
� HW equilibrium means that if frequencies change, they go back to the
equilibrium value within ONE generation of random mating
� Inbreeding and population structure create deviations from equilibrium
values in HW
� BUT only factors that act at present can create observable deviations from
equilibrium values (= NO memory) as frequencies go back to equilibrium in
one generation
� Softwares STRUCTURE (and INSTRUCT) are used to detect population
structure based on HW equilibrium
Hardy-Weinberg equilibrium
Heterozygosity
� We are interested in heterozygosity in populations
� How to detect it in data?
� When sequencing one individual, SNPs are found at the heterozygous
positions
� Ex: chr1 AAA
chr2 AAT
� Problem of knowing the phase when sequencing
� For microsatellites, alleles of different size in one individual
Heterozygosity
� Heterozygosity H = 1 – proportion of homozygotes
(Nei)
� k is here the number of alleles, and i indicate the allele number
� The maximum of H is observed when all alleles have the same frequency = 1/k
� For k=2, Hmax = 0.5
� Note that this definition is close to that from WF model or coalescent
� H = 1- proba that two individuals are identical by state
2
1
1k
i
i
H p=
= −∑
Identical by descent and no mutation Identical by state
Inbreeding
Inbreeding
� Inbreeding = a form of non-random mating
= occurs in population where mating individuals are more closely related than
those drawn by chance from the population
� Extrem form = self-fertilization (= selfing) which occurs in plants, fungi,
crustaceans, aphids,…
� Inbreeding decreases heterozygosity
Inbreeding
� A simple model, in a diploid population:
� Random mating
� Selfing with probability ps
� Calculate the probability that two alleles in a single offspring individual derive
from the same allele in the parent generation
� This is the inbreeding coefficient f
� Because both alleles are from the same diploid parent (with proba ps)
� AND the copies of that allele are from the same parental allele (with proba ½)
2
sp
f =
Inbreeding
� What is the probability of identity by descent of two homologous alleles picked in the offspring
generation?
� = average single generation probabbility of coalescence (pc,1)
,1
1 2 2 1
2 1 2 1 2c
Np f
N N N
− = +
− −
Proba that the two alleles are in one
individual offspring, and they have a
proba f to be the same (from above)
Proba that the two alleles are in different
individual offsprings
proba=[1-1/(2N-1)]
and they have a proba (1/2N) to coalesce
Inbreeding
� What is the probability of identity by descent of two homologous alleles picked in the offspring
generation?
� = average single generation probabbility of coalescence (pc,1)
,1
1 2 2 1
2 1 2 1 2c
Np f
N N N
− = +
− −
,1
1 1
2 2c
p fN N
+≃
2 ,1
1 1
2 [ 1] 2 c
NeP T p
= ==
Remember that the Identity by descent population size Ne is :
Inbreeding
� What is the probability of identity by descent of two homologous alleles picked in the offspring
generation?
1 1 1
2 2 2 1
Nf Ne
Ne N N f= + ⇒ =
+
� As 0 < f <1 => inbreeding reduces the effective population size!
� If f =0, Ne =N following the Wright-Fisher model
Inbreeding
1
NNe
f=
+
� This means that inbreeding will enhance the effect of genetic drift
� Under inbreeding (selfing) alleles will get fixed or lost faster
Inbreeding
� The inbreeding coefficient is defined as the probability of identity by descent,
but from data we can only infer the identity by state (alleles are identical or not)
� Identity by state = homozygosity
� We can define then heterozygosity at two levels (S. Wright):
� Total heterozygosity HT = probability that two randomly chosen alleles in
the population are different
� Individual heterozygosity HI = probability that two alleles from a single
randomly chosen individual are different
Inbreeding
� Total heterozygosity HT = probability that two randomly chosen alleles in
the population are different
� Individual heterozygosity HI = probability that two alleles from a single
randomly chosen individual are different
� On obtains:
� f measures the proportion of variation between alleles that is due to differences
among individuals
T I
T
H Hf
H
−=
Inbreeding coefficient- Proof
� In population with two alleles A and a, in frequency p and q respectively
� Expectation HT=2pq and HI=2pq from HW equilibrium
� If not, the inbreeding coefficient f measures the difference
� Pick one individual and check its genotype (G)
� 2 alleles can be identical by descent (and then also by state if we assume no recent
mutation)
� 2 alleles are not identical by descent, and are different by state with probability 2pq
2 [ ]
2
T I
T
H Hpq P G Aaf
pq H
−− =⇒ = =
2 2[ ] (1 )P G AA pf p f p fpq= = + − = +
2[ ]P G aa q fpq= = +
2 2[ ] 1 2 2 (1 )P G Aa p q fpq pq f= = − − − = −
Inbreeding - exercise
� In the absence of new genetic variation, contributed by mutation or migration, f
increases in finite populations over time.
Drift makes that some alleles get lost, other increase in frequency. The result is that
individuals at later generations have greater and greater probability of carrying the
same ancestral alleles.
� Using Populus (Mendelian genetics – Inbreeding model) look at the
evolution of the coefficient f in a finite population with various sizes
� After how many generations does f =1? How does it depend on N?
� Ft = theoretical expectation for f
� From simulations of drift:
� Fa = actual homozygosity of individuals = 1-HI
� Ff = homozygosity for the entire population = 1-HT
Structured populations
Population structure - definitions
� Most species are found in several populations = also called demes or
subpopulations
� A population is defined as a panmictic unit
� Gene flow = migration = movement of individuals or gametes between
populations that result in genetic exchange between populations
� Real migrants that can be observed (birds, insects, mammals,…) overestimate
the gene flow because most migrants do not reproduce !!!
� A migrant must reproduce in the population to generate gene flow
� We are interested in so-called effective migration
Population structure – look at data
� Exercise 4.2
� Using DnaSP and the data set TNFSF5-humans.nex
� 1) Define the populations: Data => define sequence sets
� 2) Analysis => DNA divergence between populations
� This gives you a basic idea of difference between populations
� Compare Africa to the three others. Which one is the most different from Africa?
� How do we learn more?
Population structure – fixation indices
� Wright and later Nei derived three fixation indices: based on two alleles per locus
(biallelic)
� HS = probability for two random alleles from same subpopulation to be different
by state
� HT = probability for two random alleles from the total population to be different
by state
� HI = probability for two random alleles from one random individual to be
different by state
� One can partition the variance in allele frequencies at these different levels:
S IIS
S
H HF
H
−= T I
IT
T
H HF
H
−=
T SST
T
H HF
H
−=
Population structure – fixation indices
S IIS
S
H HF
H
−=
T IIT
T
H HF
H
−=
Measures departure from HW expectations within a deme
Values between -1 and 1
Measures departure from HW expectations between
individuals and total population
Population structure – fixation indices
T SST
T
H HF
H
−=
Measures departure from HW expectations when comparing
subpopulations and the total population
values between 0 and 1
� Drift generates fixation of alleles in populations
� FST measures the difference between subpopulations in the process of allele
fixation under the influence of drift
Schematic of a population subdivision and effect of drift in fixing alleles in populations
Population structure – data analysis statistics
� These can be extended for different type of data
� For allelic variants (microsatellites): GST
� Based on gene (or allelic) diversity measure (HT and HS)
( )T S
ST
T
H HG
H
−=
� This measure does not calculate whether alleles are the same in subpopulations
and total population
� It compares the amount of diversity at subpopulation level to total population
(See later)
Population structure – data analysis statistics
� These can be extended for different type of data
� For sequence data KST an analog of FST
� Can calculate it per locus
� But you can also calculate the FST per SNP as it is a biallelic locus
_
T S
ST Hudson ST
T
F Kπ π
π
−= =
Average number of differences
between sequences in a pop
(Pairwise comparison intrapop)
Average number of differences
between all sequences
(Pairwise comparison interpop)
2
1
1k
i
i
H p=
= −∑T S
ST
T
H HF
H
−=
Population structure – look at data
� Exercise 4.3
� Using DnaSP and the data set TNFSF5-humans.nex
� Analysis => Gene flow and Genetic differentiation
� This gives you a basic idea of difference between populations
� Compare Africa to all three others.
� Look at values of FST, GST
� Does this fit with your previous observations?
Population structure – look at data
� Exercise 4.3
� Hand calculations of one locus = 1 SNP, for different formulae of FST
T SST
T
H HF
H
−=
Pop 1 Pop 2
A A
TT
T
A A
T
AA
_1 S
ST Hudson
T
Fπ
π= −
Population structure – look at data
� Exercise 4.3
� Hand calculations of different formulae of GST, FST
8
48
T SST
T
H HF
H
−= =
Pop 1 Pop 2
A A
TT
T
A A
T
AA
_
31
48
SST Hudson
T
Fπ
π= − =
� This is because π assumes that both sequences are from different individuals, but
heterotygosity does not assume that
Population structure – look at data
� Exercise: Hand calculations of different formulae of GST, FST
� Which species present populations which are genetically differentiated?
� Can you calculate GST? What do you see?
Jost, Molecular Ecology
2008
Population structure – look at data
� Be careful, GST does not measure differentiation when diversity within population
(HS) is too big
� It can take low values despite the presence of private alleles
Population structure – look at data
� Be careful, GST does not measure differentiation when diversity is too big
� Jost has proposed another measure D to capture differentiation
� This is also true for SNP data when calculating GST based on haplotypes
� One should be cautious when interpreting values of FST as it depends on the
effective population size,
=> FST =0.2 can be low in one species and high in another one
Population structure
� If the subpopulations are in HW equilibrium, there is always an excess of
homozygotes in the subpopulations compared to total population (FST >0)
� UNLESS allele frequencies are exactly the same in all subpopulations
� This is the Wahlund effect
� HS = probability for two random alleles from same subpopulation to be different
by state
� HT = probability for two random alleles from the total population to be different
by state
� HI = probability for two random alleles from one random individual to be
different by state
T S
ST
T
H HF
H
−=
Coalescence in structured populations
Coalescence in structured populations
Coalescence in structured populations
Coalescence in structured populations
If coalescence and migration have small probabilities, they cannot occur in the same time
= low migration limit
Coalescence in structured populations
Using msms to look at the frequency-spectrum for
different migration rates
Population structure models
� Island model (S. Wright): symetric migration between all subpopulations
� d demes, all demes exchangeable and equivalent, each has a size N
� Symmetric migration rate m (proportion of migrants in each deme, 0<m<1)
� We can calculate for low migration:
� TS = mean time of coalescence of a pair of alleles sampled from teh same deme
� TB = mean time of coalescence of alleles from different demes
� There is a probability 1/(d-1) that an allele migrating comes from a given deme
Population structure models
� We find (Charlesworth):
� Which one is bigger ? Why?
� We can calculate also:
� TT = mean time of coalescence of two alleles drawn randomly from the whole
population
2ST dN≈1
22
B
dT dN
m
−≈ +
2( 1)
22
T
dT dN
dm
−≈ +
Why does the time of coalescence in the
whole population depends on m?
Population structure models
� We find (Charlesworth):
� Which one is bigger ? Why?
� We can calculate also:
� TT = mean time of coalescence of two alleles drawn randomly from the whole
population
2ST dN≈1
22
B
dT dN
m
−≈ +
2( 1)
22
T
dT dN
dm
−≈ +
� We can define the effective size of the total population
2 4
TT dNe dN
m= ≈ +
Lets compare this to a one panmictic
population with dN individuals
WHY is it bigger?
Population structure
� Connect FST with migration (Wright, Slatkin, Takahata)
� We can define a coalescent version of FST
T SST
T
T TF
T
−=
2
2
1 1
4 1 41
( 1)
STFNmd Nm
d
≈ ≈+
+−
� In an island model (with all restriction on the assumptions)
� FST is proportional to Nm which is the number of effective migrants!!!
� If Nm<1 then FST > 0.2 and there is some difference between populations
� If Nm>2 then FST < 0.1 and there is little difference between populations
Coalescence in structured populations
If migration is high, it behaves almost like a single big panmictic population of size dN= strong migration limit
Coalescence in structured populations
Other population structure models
� Continental island
� One main deme, no drift in the big island
Other population structure models
� Stepping-stone model
� Demes connected only to neighbours
Coalescence in metapopulation
Most (all ?) plant species exhibit metapopulation
� Metapopulation = “a population that is subdivided into a large number of demes, each of
which is subject to extinction and recolonization and receives migrants from other demes” (Wakeley
and Aliacar 2001 Genetics)
Hanski, LaineT. Städler ©
Population structure models
� Metapopulation model (Charlesworth, Wakeley)
� Many demes d, With extinction – recolonization, migration
� The sampling n << d
� Extinction – recolonization affect the genetic diversity and time of coalescence
� Complex formulae (see Charlesworth, Wakeley books)
Coalescence in metapopulation: more complex
Two phases: collecting (long) and scattering (short) (Wakeley and Aliacar 2001 Genetics)
Genealogy depends on the number of demes (n) and migration rate (M)
time
past
present
Deme 1 Deme 2 Deme 3 Deme 4
collecting phase
scattering phase
Species wide sampling
1 individual per deme, over the species range = reflect the species wide evolution
(Wakeley and Aliacar 2001 Genetics, Pannell 2003 Evolution, Städler et al. 2009 Genetics)
time
past
present
Deme 1 Deme 2 Deme 3 Deme 4
collecting phase
scattering phase
Population sample
Several individuals per deme, few populations = reflect the local evolution
time
past
present
Deme 1 Deme 2 Deme 3 Deme 4
collecting phase
scattering phase
Study of structured populations
Deme 1 Deme 2 Deme 3 Deme 4
When studying structured populations: be careful at the sampling and markers used
How was it done? Why? Which information will you get from these samples?
Calculate FST but also Jost D to compare. Can you interpret migration from FST values?
Remember: each locus has its own coalescent tree!!!