Evolutionary Genetics: Part 5 Inbreeding – Structured ... · Evolutionary Genetics: Part 5 Inbreeding – Structured populations S. peruvianum S. chilense Winter Semester 2012-2013

Evolutionary Genetics: Part 5

Inbreeding – Structured populations

S. peruvianum

S. chilense

Winter Semester 2012-2013

Prof Aurélien TellierFG Populationsgenetik

Color code

Color code:

Red = Important result or definition

Purple: exercise to do

Green: some bits of maths

Population genetics: 4 evolutionary forces

random genomic processes

(mutation, duplication, recombination, gene conversion)

natural

selection

random demographic

process (drift)

random spatial

process (migration)

molecular diversity

Deviations from panmixia

� So far, standard neutral model assumes a Wright-Fisher model of

population

� All individuals have the same probability to pick their parents, and all

parents have same probability

� In reality this is not true

� Inbreeding (include selfing) = mating between closely related

individuals

� Population structure = species consist of several populations,

panmixia occurs within populations

Hardy-Weinberg equilibrium

� The first model of panmixia = Hardy – Weinberg equilibrium

� Formulated independently by GH Hardy and W Weinberg in 1908


� The assumptions of the model are:

� organisms are diploid

� only sexual reproduction occurs

� generations are non overlapping

� mating is random (panmixia)

� population size is large

� allele frequencies are equal in the sexes

� there is no migration, mutation or selection


� With two alleles at a locus: A and a, the allele frequencies are:

� A has frequency p

� a has frequency q = 1-p

Homozygote HomozygoteHeterozygote

A A A a a aParentgenotypes

Gametes

Offspringgenotypes

Aaa A

A A A a a a

Frequency = p2 Frequency = q2Frequency = 2pq


� Demonstration of the equilibrium:

� A has frequency p, a has frequency q = 1-p

� In the parents:

�AA in frequency P

�Aa in frequency H

�aa in frequency Q

p = P + (1/2)H and q = Q + (1/2)H


1) Frequency of the matings

AA (P) Aa (H) aa (Q)

AA (P) P2 PH PQ

Aa (H) PH H2 HQ

aa (Q) PQ HQ Q2

Female genotypes

Male

gen

oty

pes


2) Then for each mating, what are the progenies

Freq from

matingAA Aa aa

AA××××AA P2 P2

AA××××Aa 2PH PH PH

AA××××aa 2PQ 2PQ

Aa××××Aa H2 (1/4)H2 (1/2)H2 (1/4)H2

Aa××××aa 2HQ HQ HQ

aa××××aa Q2 Q2

(P + (1/2)H)2

= p2

2(P+(1/2)H)

(Q +(1/2)H)

= 2pq

(Q + (1/2)H)2

= q2

Progeny genotypes


� Numerical example:

� P = p2 = 0.2

� H = 2pq = 0.4

� Q = q2 = 0.4

So that p=0.4 and q=0.6

� At next generation:

� p2 = (0.4)2 = 0.16

� 2pq = 2 * 0.4 * 0.6 = 0.48

� q2 = (0.6)2 = 0.36

�These are the equilibrium frequencies

Genotypes: AA Aa aa


� Numerical example: When frequencies are unknown…

Example: 50 Individuals (n = 50, 2n = 100)

Observed number of individuals: AA Aa aa

15 25 10

p = (2 x 15 + 25) / 100 = 0.55

q = (2 x 10 + 25) / 100 = 0.45

np2 2pqn nq2

50 x (0.55)2 2(0.55)(0.45) x 50 50 x (0.45)2

Expected number of individuals 15,125 24,75 10,125


� HW equilibrium means that if frequencies change, they go back to the

equilibrium value within ONE generation of random mating

� Drift for example affect the allele frequencies if the population is small

� It is a useful neutral model which predict if a population follows the HW

equilibrium frequencies

� If not, WHY???

� One can use a χ2 test for statistical testing of deviation from equilibrium

Data = frequencies or numbers (please look at text book)

( )2

2_ _

_

Observed data Expected data

Expected dataχ

−= ∑


� HW equilibrium means that if frequencies change, they go back to the

equilibrium value within ONE generation of random mating

� Inbreeding and population structure create deviations from equilibrium

values in HW

� BUT only factors that act at present can create observable deviations from

equilibrium values (= NO memory) as frequencies go back to equilibrium in

one generation

� Softwares STRUCTURE (and INSTRUCT) are used to detect population

structure based on HW equilibrium


Heterozygosity

� We are interested in heterozygosity in populations

� How to detect it in data?

� When sequencing one individual, SNPs are found at the heterozygous

positions

� Ex: chr1 AAA

chr2 AAT

� Problem of knowing the phase when sequencing

� For microsatellites, alleles of different size in one individual

Heterozygosity

� Heterozygosity H = 1 – proportion of homozygotes

(Nei)

� k is here the number of alleles, and i indicate the allele number

� The maximum of H is observed when all alleles have the same frequency = 1/k

� For k=2, Hmax = 0.5

� Note that this definition is close to that from WF model or coalescent

� H = 1- proba that two individuals are identical by state

2

1

1k

i

i

H p=

= −∑

Identical by descent and no mutation Identical by state

Inbreeding

Inbreeding

� Inbreeding = a form of non-random mating

= occurs in population where mating individuals are more closely related than

those drawn by chance from the population

� Extrem form = self-fertilization (= selfing) which occurs in plants, fungi,

crustaceans, aphids,…

� Inbreeding decreases heterozygosity

Inbreeding

� A simple model, in a diploid population:

� Random mating

� Selfing with probability ps

� Calculate the probability that two alleles in a single offspring individual derive

from the same allele in the parent generation

� This is the inbreeding coefficient f

� Because both alleles are from the same diploid parent (with proba ps)

� AND the copies of that allele are from the same parental allele (with proba ½)

2

sp

f =

Inbreeding

� What is the probability of identity by descent of two homologous alleles picked in the offspring

generation?

� = average single generation probabbility of coalescence (pc,1)

,1

1 2 2 1

2 1 2 1 2c

Np f

N N N

− = +

− −

Proba that the two alleles are in one

individual offspring, and they have a

proba f to be the same (from above)

Proba that the two alleles are in different

individual offsprings

proba=[1-1/(2N-1)]

and they have a proba (1/2N) to coalesce

Inbreeding


generation?

� = average single generation probabbility of coalescence (pc,1)

,1

1 2 2 1

2 1 2 1 2c

Np f

N N N

− = +

− −

,1

1 1

2 2c

p fN N

+≃

2 ,1

1 1

2 [ 1] 2 c

NeP T p

= ==

Remember that the Identity by descent population size Ne is :

Inbreeding


generation?

1 1 1

2 2 2 1

Nf Ne

Ne N N f= + ⇒ =

+

� As 0 < f <1 => inbreeding reduces the effective population size!

� If f =0, Ne =N following the Wright-Fisher model

Inbreeding

1

NNe

f=

+

� This means that inbreeding will enhance the effect of genetic drift

� Under inbreeding (selfing) alleles will get fixed or lost faster

Inbreeding

� The inbreeding coefficient is defined as the probability of identity by descent,

but from data we can only infer the identity by state (alleles are identical or not)

� Identity by state = homozygosity

� We can define then heterozygosity at two levels (S. Wright):

� Total heterozygosity HT = probability that two randomly chosen alleles in

the population are different

� Individual heterozygosity HI = probability that two alleles from a single

randomly chosen individual are different

Inbreeding

� Total heterozygosity HT = probability that two randomly chosen alleles in

the population are different

� Individual heterozygosity HI = probability that two alleles from a single

randomly chosen individual are different

� On obtains:

� f measures the proportion of variation between alleles that is due to differences

among individuals

T I

T

H Hf

H

−=

Inbreeding coefficient- Proof

� In population with two alleles A and a, in frequency p and q respectively

� Expectation HT=2pq and HI=2pq from HW equilibrium

� If not, the inbreeding coefficient f measures the difference

� Pick one individual and check its genotype (G)

� 2 alleles can be identical by descent (and then also by state if we assume no recent

mutation)

� 2 alleles are not identical by descent, and are different by state with probability 2pq

2 [ ]

2

T I

T

H Hpq P G Aaf

pq H

−− =⇒ = =

2 2[ ] (1 )P G AA pf p f p fpq= = + − = +

2[ ]P G aa q fpq= = +

2 2[ ] 1 2 2 (1 )P G Aa p q fpq pq f= = − − − = −

Inbreeding - exercise

� In the absence of new genetic variation, contributed by mutation or migration, f

increases in finite populations over time.

Drift makes that some alleles get lost, other increase in frequency. The result is that

individuals at later generations have greater and greater probability of carrying the

same ancestral alleles.

� Using Populus (Mendelian genetics – Inbreeding model) look at the

evolution of the coefficient f in a finite population with various sizes

� After how many generations does f =1? How does it depend on N?

� Ft = theoretical expectation for f

� From simulations of drift:

� Fa = actual homozygosity of individuals = 1-HI

� Ff = homozygosity for the entire population = 1-HT

Structured populations

Population structure - definitions

� Most species are found in several populations = also called demes or

subpopulations

� A population is defined as a panmictic unit

� Gene flow = migration = movement of individuals or gametes between

populations that result in genetic exchange between populations

� Real migrants that can be observed (birds, insects, mammals,…) overestimate

the gene flow because most migrants do not reproduce !!!

� A migrant must reproduce in the population to generate gene flow

� We are interested in so-called effective migration

Population structure – look at data

� Exercise 4.2

� Using DnaSP and the data set TNFSF5-humans.nex

� 1) Define the populations: Data => define sequence sets

� 2) Analysis => DNA divergence between populations

� This gives you a basic idea of difference between populations

� Compare Africa to the three others. Which one is the most different from Africa?

� How do we learn more?

Population structure – fixation indices

� Wright and later Nei derived three fixation indices: based on two alleles per locus

(biallelic)

� HS = probability for two random alleles from same subpopulation to be different

by state

� HT = probability for two random alleles from the total population to be different

by state

� HI = probability for two random alleles from one random individual to be

different by state

� One can partition the variance in allele frequencies at these different levels:

S IIS

S

H HF

H

−= T I

IT

T

H HF

H

−=

T SST

T

H HF

H

−=


S IIS

S

H HF

H

−=

T IIT

T

H HF

H

−=

Measures departure from HW expectations within a deme

Values between -1 and 1

Measures departure from HW expectations between

individuals and total population


T SST

T

H HF

H

−=

Measures departure from HW expectations when comparing

subpopulations and the total population

values between 0 and 1

� Drift generates fixation of alleles in populations

� FST measures the difference between subpopulations in the process of allele

fixation under the influence of drift

Schematic of a population subdivision and effect of drift in fixing alleles in populations

Population structure – data analysis statistics

� These can be extended for different type of data

� For allelic variants (microsatellites): GST

� Based on gene (or allelic) diversity measure (HT and HS)

( )T S

ST

T

H HG

H

−=

� This measure does not calculate whether alleles are the same in subpopulations

and total population

� It compares the amount of diversity at subpopulation level to total population

(See later)

Population structure – data analysis statistics

� These can be extended for different type of data

� For sequence data KST an analog of FST

� Can calculate it per locus

� But you can also calculate the FST per SNP as it is a biallelic locus

_

T S

ST Hudson ST

T

F Kπ π

π

−= =

Average number of differences

between sequences in a pop

(Pairwise comparison intrapop)

Average number of differences

between all sequences

(Pairwise comparison interpop)

2

1

1k

i

i

H p=

= −∑T S

ST

T

H HF

H

−=


� Exercise 4.3

� Using DnaSP and the data set TNFSF5-humans.nex

� Analysis => Gene flow and Genetic differentiation

� This gives you a basic idea of difference between populations

� Compare Africa to all three others.

� Look at values of FST, GST

� Does this fit with your previous observations?


� Exercise 4.3

� Hand calculations of one locus = 1 SNP, for different formulae of FST

T SST

T

H HF

H

−=

Pop 1 Pop 2

A A

TT

T

A A

T

AA

_1 S

ST Hudson

T

Fπ

π= −


� Exercise 4.3

� Hand calculations of different formulae of GST, FST

8

48

T SST

T

H HF

H

−= =

Pop 1 Pop 2

A A

TT

T

A A

T

AA

_

31

48

SST Hudson

T

Fπ

π= − =

� This is because π assumes that both sequences are from different individuals, but

heterotygosity does not assume that


� Exercise: Hand calculations of different formulae of GST, FST

� Which species present populations which are genetically differentiated?

� Can you calculate GST? What do you see?

Jost, Molecular Ecology

2008


� Be careful, GST does not measure differentiation when diversity within population

(HS) is too big

� It can take low values despite the presence of private alleles


� Be careful, GST does not measure differentiation when diversity is too big

� Jost has proposed another measure D to capture differentiation

� This is also true for SNP data when calculating GST based on haplotypes

� One should be cautious when interpreting values of FST as it depends on the

effective population size,

=> FST =0.2 can be low in one species and high in another one

Population structure

� If the subpopulations are in HW equilibrium, there is always an excess of

homozygotes in the subpopulations compared to total population (FST >0)

� UNLESS allele frequencies are exactly the same in all subpopulations

� This is the Wahlund effect

� HS = probability for two random alleles from same subpopulation to be different

by state

� HT = probability for two random alleles from the total population to be different

by state

� HI = probability for two random alleles from one random individual to be

different by state

T S

ST

T

H HF

H

−=

Coalescence in structured populations




If coalescence and migration have small probabilities, they cannot occur in the same time

= low migration limit


Using msms to look at the frequency-spectrum for

different migration rates

Population structure models

� Island model (S. Wright): symetric migration between all subpopulations

� d demes, all demes exchangeable and equivalent, each has a size N

� Symmetric migration rate m (proportion of migrants in each deme, 0<m<1)

� We can calculate for low migration:

� TS = mean time of coalescence of a pair of alleles sampled from teh same deme

� TB = mean time of coalescence of alleles from different demes

� There is a probability 1/(d-1) that an allele migrating comes from a given deme


� We find (Charlesworth):

� Which one is bigger ? Why?

� We can calculate also:

� TT = mean time of coalescence of two alleles drawn randomly from the whole

population

2ST dN≈1

22

B

dT dN

m

−≈ +

2( 1)

22

T

dT dN

dm

−≈ +

Why does the time of coalescence in the

whole population depends on m?


� We find (Charlesworth):

� Which one is bigger ? Why?

� We can calculate also:

� TT = mean time of coalescence of two alleles drawn randomly from the whole

population

2ST dN≈1

22

B

dT dN

m

−≈ +

2( 1)

22

T

dT dN

dm

−≈ +

� We can define the effective size of the total population

2 4

TT dNe dN

m= ≈ +

Lets compare this to a one panmictic

population with dN individuals

WHY is it bigger?

Population structure

� Connect FST with migration (Wright, Slatkin, Takahata)

� We can define a coalescent version of FST

T SST

T

T TF

T

−=

2

2

1 1

4 1 41

( 1)

STFNmd Nm

d

≈ ≈+

+−

� In an island model (with all restriction on the assumptions)

� FST is proportional to Nm which is the number of effective migrants!!!

� If Nm<1 then FST > 0.2 and there is some difference between populations

� If Nm>2 then FST < 0.1 and there is little difference between populations


If migration is high, it behaves almost like a single big panmictic population of size dN= strong migration limit


Other population structure models

� Continental island

� One main deme, no drift in the big island

Other population structure models

� Stepping-stone model

� Demes connected only to neighbours

Coalescence in metapopulation

Most (all ?) plant species exhibit metapopulation

� Metapopulation = “a population that is subdivided into a large number of demes, each of

which is subject to extinction and recolonization and receives migrants from other demes” (Wakeley

and Aliacar 2001 Genetics)

Hanski, LaineT. Städler ©


� Metapopulation model (Charlesworth, Wakeley)

� Many demes d, With extinction – recolonization, migration

� The sampling n << d

� Extinction – recolonization affect the genetic diversity and time of coalescence

� Complex formulae (see Charlesworth, Wakeley books)

Coalescence in metapopulation: more complex

Two phases: collecting (long) and scattering (short) (Wakeley and Aliacar 2001 Genetics)

Genealogy depends on the number of demes (n) and migration rate (M)

time

past

present

Deme 1 Deme 2 Deme 3 Deme 4

collecting phase

scattering phase

Species wide sampling

1 individual per deme, over the species range = reflect the species wide evolution

(Wakeley and Aliacar 2001 Genetics, Pannell 2003 Evolution, Städler et al. 2009 Genetics)

time

past

present


collecting phase

scattering phase

Population sample

Several individuals per deme, few populations = reflect the local evolution

time

past

present


collecting phase

scattering phase

Study of structured populations


When studying structured populations: be careful at the sampling and markers used

How was it done? Why? Which information will you get from these samples?

Calculate FST but also Jost D to compare. Can you interpret migration from FST values?

Remember: each locus has its own coalescent tree!!!

Documents

Evolutionary Genetics: Part 5 Inbreeding – Structured ... · Evolutionary Genetics: Part 5 Inbreeding – Structured populations S. peruvianum S. chilense Winter Semester 2012-2013