PCB5065 Advanced GeneticsPopulation Genetics and Quantitative Genetics
Instructor: Rongling Wu, Department of Statistics, 420A Genetics InstituteTel: 2-3806, Email: [email protected]
Tues Nov 14 Population genetics - population structureWed Nov 15 Population genetics - Hardy-Weinberg equilibrium Thurs Nov 16 Population genetics - effective population size Mon Nov 20 Population genetics - linkage disequilibriumTues Nov 21 Population genetics - evolutionary forcesWed Nov 22 Genetic Parameters: MeansThurs Nov 24 Thanksgiving Holiday - No ClassMon Nov 27 Genetic Parameters: (Co)VariancesTues Nov 28 Mating Designs for Parameter EstimationWed Nov 29 Experimental Designs for Parameter EstimationThurs Nov 30 Heritability, Genetic Correlation and Gain from SelectionMon Dec4 Molecular Dissection of Quant. Variation – Linkage AnalysisTues Dec 5 Molecular Dissection of Quant. Variation – Association StudiesWed Dec 6 Take-home exam on population and quantitative genetics
given- due in electronic format ([email protected]) by 5 PM Tuesday Dec 12
Teosinte and Maize
Teosinte branched 1(tb1) is found to affect the differentiation in branch architecture from teosinte to maize (John Doebley 2001)
Approaches used to support the view that modern maize cultivars are domesticated
from the wild type teosinte
Population genetics
• Study the evolutionary or phylogenetic relationships between maize and its wild relative
• Study evolutionary forces that have shaped the structure of and diversity in the maize genome
Quantitative genetics
• Identify the genetic architecture of the differences in morphology between maize and teosinte
• Estimate the number of genes required for the evolution of a new morphological trait from teosinte to maize: few genes of large effect or many genes of small effect?
• Doebley pioneered the use of quantitative trait locus (QTL) mapping approaches to successfully identify genomic regions that are responsible for the separation of maize from its undomesticated relatives.
• Doebley has cloned genes identified through QTL mapping, teosinte branched1 (tb1), which governs kernel structure and plant architecture.
• Ancient Mexicans used several thousand years ago to transform the wild grass teosinte into modern maize through rounds of selective breeding for large ears of corn.
• With genetic information, ‘‘I think in as few as 25 years I can move teosinte fairly far along the road to becoming maize,’’ Doebley predicts (Brownlee, 2004 PNAS vol. 101: 697–699)
Toward biomedical breakthroughs?
Single Nucleotide Polymorphisms (SNPs)
no cancer
cancer
• According to The International HapMap Consortium (2003), the statistical analysis and modeling of the links between DNA sequence variants and phenotypes will play a pivotal role in the characterization of specific genes for various diseases and, ultimately, the design of personalized medications that are optimal for individual patients.
• What knowledge is needed to perform such statistical analyses?
• Population genetics and quantitative genetics, and others…
• The International HapMap Consortium, 2003 The International HapMap Project. Nature 426: 789-94.
• Liu, T., J. A. Johnson, G. Casella and R. L. Wu, 2004 Sequencing complex diseases with HapMap. Genetics 168: 503-511.
Basic Genetics(1) Mendelian genetics
How does a gene transmit from a parent to its progeny (individual)?
(2) Population geneticsHow is a gene segregating in a population (a group of individuals)?
(3) Quantitative geneticsHow is gene segregation related with the phenotype of a character?
(4) Molecular geneticsWhat is the molecular basis of gene segregation and transmission?
(5) Developmental genetics(6) Epigenetics
Mendelian Genetics Probability
Population Genetics Statistics
Quantitative genetics Molecular Genetics
Statistical Genetics Mathematics with biology (our view)
Cutting-edge research at the interface among genetics, evolution and development (Evo-Devo)
Wu, R. L. Functional mapping – how to map and study the genetic architecture of dynamic complex traits. Nature Reviews Genetics (accepted)
Mendel’s Laws
Mendel’s first law• There is a gene with two alleles on a chromosome location
(locus)• These alleles segregate during the formation of the reproductive
cells, thus passing into different gametes
Mendel’s second law• There are two or more pairs of genes on different chromosomes• They segregate independently (partially correct)
Linkage (exception to Mendel’s second law)• There are two or more pairs of genes located on the same
chromosome• They can be linked or associated (the degree of association is
described by the recombination fraction)
Population Genetics
• Different copies of a gene are called alleles; for example A and a at gene A;
• These alleles form three genotypes, AA, Aa and aa;
• The allele (or gene) frequency of an allele is defined as the proportion of this allele among a group of individuals;
• Accordingly, the genotype frequency is the proportion of a genotype among a group of individuals
Calculations of allele frequencies and genotype frequencies
Genotypes Counts Estimates genotype frequenciesAA 224 PAA = 224/294 = 0.762Aa 64 PAa = 64/294 = 0.218aa 6 Paa = 6/294 = 0.020
Total 294 PAA + PAa + Paa = 1
Allele frequenciespA = (2214+64)/(2294)=0.871, pa = (26+64)/(2294)=0.129, pA + pa = 0.871 + 0.129 = 1
Expected genotype frequenciesAA pA
2 = 0.8712 = 0.769Aa 2pApa = 2 0.871 0.129 = 0.224aa pa
2 = 0.1292 = 0.017
Genotypes Counts Estimates of genotype freq.
AA nAA PAA = nAA/n
Aa nAa PAa = nAa/n
aa naa Paa = naa/n
Total n PAA + PAa + Paa = 1
Allele frequencies
pA = (2nAA + nAa)/2n
pa = (2naa + nAa)/2n
Standard error of the estimate of the allele frequency
Var(pA) = pA(1 - pA)/2n
The Hardy-Weinberg Law
• In the Hardy-Weinberg equilibrium (HWE), the relative frequencies of the genotypes will remain unchanged from generation to generation;
• As long as a population is randomly mating, the population can reach HWE from the second generation;
• The deviation from HWE, called Hardy-Weinberg disequilibrium (HWD), results from many factors, such as selection, mutation, admixture and population structure…
Mendelian inheritance at the individual level(1) Make a cross between two individual parents(2) Consider one gene (A) with two alleles A and a AA, Aa, aa
Thus, we have a total of nine possible cross combinations:
Cross Mendelian segregation ratio1. AA AA AA2. AA Aa ½AA + ½Aa3. AA aa Aa4. Aa AA ½AA + ½Aa5. Aa Aa ¼AA + ½Aa + ¼aa 6. Aa aa ½Aa + ½aa7. aa AA Aa8. aa Aa ½Aa + ½aa9. aa aa aa
Mendelian inheritance at the population level• A population, a group of individuals, may contain all these nine
combinations, weighted by the mating frequencies. • Genotype frequencies: AA, PAA(t); Aa, PAa(t); aa, Paa(t)
Cross Mating freq. (t) Mendelian segreg. ratio (t+1)AA Aa aa
1. AA AA PAA(t)PAA(t) 1 0 0
2. AA Aa PAA(t)PAa(t) ½ ½ 0
3. AA aa PAA(t)Paa(t) 0 1 0
4. Aa AA PAa(t)PAA(t) ½ ½ 0
5. Aa Aa PAa(t)PAa(t) ¼ ½ ¼
6. Aa aa PAa(t)Paa(t) 0 ½ ½
7. aa AA Paa(t)PAA(t) 0 1 0
8. aa Aa Paa(t)PAa(t) 0 ½ ½
9. aa aa Paa(t)Paa(t) 0 0 1
PAA(t+1) = 1[PAA(t)]2 + ½ 2[PAA(t)PAa(t)] + ¼[PAa(t)]2 = [PAA(t) + ½PAa(t)]2
Similarly, we havePaa(t+1) = [Paa(t) + ½PAa(t)]2
PAa(t+1) = 2[PAA(t) + ½PAa(t)][Paa(t) + ½PAa(t)]
Therefore, we have[PAa(t+1)]2 = 4PAA(t+1)Paa(t+1)
Furthermore, if random mating continues, we havePAA(t+2) = [PAA(t+1) + ½PAa(t+1)]2 = PAA(t+1)PAa(t+2) = 2[PAA(t+1) + ½PAa(t+1)][Paa(t+1) + ½PAa(t+1)] = PAa(t+1)Paa(t+2) = [Paa(t+1) + ½PAa(t+1)]2 = Paa(t+1)
(1) Genotype (and allele) frequencies are constant from generation to generation,
(2) Genotype frequencies = the product of the allele frequencies, i.e., PAA = pA
2, PAa = 2pApa, Paa = pa2
For a population at Hardy-Weinberg disequilibrium (HWD), we have• PAA = pA
2 + D• PAa = 2pApa – 2D• Paa = pa
2 + D
The magnitude of D determines the degree of HWD.• D = 0 means that there is no HWD.• D has a range of max(-pA
2 , -pa2) D pApa
Concluding remarks
A population with [PAa(t+1)]2 = 4PAA(t+1)Paa(t+1) is said to be in Hardy-Weinberg equilibrium (HWE). The HWE population has the following properties:
Chi-square test for HWE
• Whether or not the population deviates from HWE at a particular locus can be tested using a chi-square test.
• If the population deviates from HWE (i.e., Hardy-Weinberg disequilibrium, HWD), this implies that the population is not randomly mating. Many evolutionary forces, such as mutation, genetic drift and population structure, may operate.
Example 1AA Aa aa Total
Obs 224 64 6 294
Exp n(pA2) = 222.9 n(2pApa) = 66.2 n(pa
2) = 4.9 294
Test statisticsx2 = (obs – exp)2/exp = (224-222.9)2/222.9 + (64-66.2)2/66.2 + (6-
4.9)2/4.9 = 0.32is less than
x2df=1 ( = 0.05) = 3.841
Therefore, the population does not deviate from HWE at this locus.
Why the degree of freedom = 1? Degree of freedom = the number of parameters contained in the alternative hypothesis – the number of parameters contained in the null hypothesis. In this case, df = 2 (pA or pa and D) – 1 (pA or pa) = 1
Example 2AA Aa aa
Total Obs 234 36 6 276
Exp n(pA2) n(2pApa) n(pa
2) = 230.1 = 43.8 = 2.1 276
Test statisticsx2 = (obs – exp)2/exp = (234-230.1)2/230.1 + (36-
43.8)2/43.8 + (6-2.1)2/2.1 = 8.8
is greater than x2df=1 ( = 0.05) = 3.841
Therefore, the population deviates from HWE at this locus.
Linkage disequilibrium• Consider two loci, A and B, with alleles A, a and B,
b, respectively, in a population• Assume that the population is at HWE• If the population is at Hardy-Weinberg equilibrium,
we have
Gene A Gene B
AA: PAA = pA2
BB: PBB = pB2
Aa: PAa = 2pApa Bb: PBb = 2pBpb
Aa: Paa = pa2 bb: Pbb = pb
2
PAA+PAa+Paa = 1 PBB+PBb+Pbb=1
pA + pa = 1 pB + pb = 1
But the population is at Linkage Disequilibrium (for a pair of loci). Then we have
• Two-gene haplotype AB: pAB = pApB + DAB
• Two-gene haplotype Ab: pAb = pApb + DAb
• Two-gene haplotype aB: paB = papB + DaB
• Two-gene haplotype ab: pab = papb + Dab
pAB+pAb+paB+pab = 1
Dij is the coefficient of linkage disequilibrium (LD) between the two genes in the population. The magnitude of D reflects the degree of LD. The larger D, the stronger LD.
pA = pAB+pAb = pApB + DAB + pApb + DAb = pA+DAB+DAb DAB = -DAb
pB = pAB+paB = pB+DAB+DaB DAB = -DaB
pb = pAb+pab = pb+DaB+Dab Dab = -DaB
Finally, we have DAB = -DAb = -DaB = Dab = D.
Re-write four two-gene haplotype frequncies• AB: pAB = pApB + D• Ab: pAb = pApb – D• aB: paB = papB – D• ab: pab = papb + D
D = pABpab - pAbpaB
D = 0 the population is at the linkage equilibrium
How does D transmit from one generation (1) to the next (2)?
D(2) = (1-r)1 D(1)
…
D(t+1) = (1-r)t D(1)
t, D(t+1) r
Conclusions:
- D tends to be zero at the rate depending on the recombination fraction.
- Linkage equilibrium PAB = pApB is approached gradually and without oscillation.
- The larger r, the faster is the rate of convergence, the most rapid being (½)t for unlinked loci (r=0.5).
D(t) = (1-r)tD(0)D(t)/D(0) = (1-r)t
The ratio D(t)/D(0) describes the degree with which LD decays with generation.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5
r
D(t)
/D(0
)
t=2
t=20
t=200
The plot of the ratio D(t)/D(0) against r tells us the evolutionary history of a population – implications for population and evolutionary genetics.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6 8 10 12
t
D(t
)/D
(0)
r=0.001
r=0.01
r=0.1
r=0.2
r=0.5
The plot of the ratio D(t)/D(0) against t tells us the degree of linkage – Implications for high-resolution mapping of human diseases and other complex traits
Proof to D(t+1) = (1-r)1 D(t)
• The four gametes randomly unite to form a zygote. The proportion 1-r of the gametes produced by this zygote are parental (or nonrecombinant) gametes and fraction r are nonparental (or recombinant) gametes. A particular gamete, say AB, has a proportion (1-r) in generation t+1 produced without recombination. The frequency with which this gamete is produced in this way is (1-r)pAB(t).
• Also this gamete is generated as a recombinant from the genotypes formed by the gametes containing allele A and the gametes containing allele B. The frequencies of the gametes containing alleles A or B are pA(t) and pB(t), respectively. So the frequency with which AB arises in this way is rpA(t)pB(t).
• Therefore the frequency of AB in the generation t+1 ispAB(t+1) = (1-r)pAB(t) + rpA(t)pB(t)
By subtracting is pA(t)pB(t) from both sides of the above equation, we have
D(t+1) = (1-r)1 D(t)
WhenceD(t+1) = (1-r)t D(1)
Estimate and test for LDAssuming random mating in the population, we have joint probabilities of the two genes
BB (PBB) Bb (PBb) bb (Pbb)_______________________________________________________________________________________AA (PAA) pAB
2 2pABpAb pAb2
n22 n21 n20
Aa (PAa) 2pABpaB 2(pABpab+pAbpaB) 2pAbpab
n12 n11 n10
aa (Paa) paB2 2pAbpab pab
2
n02 n01 n00
________________________________________________________________________________________
Multinomial pdfH1: D 0 log f(pij|n)= log n!/(n22!…n00!) + n22 log pAB
2 + n21log (2pABpAb) + n20 log pAb2
+ …Estimate pAB, pAb, paB (pab = 1-pAB-pAb-paB) pA, pB, D
H0: D = 0log f(pi,pj|n)= log n!/(n22!…n00!) + n22log(pApB)2 + n21log(2pA
2pBpb)+n20log(pApb)2
+ …
Estimate pA and pB.
Chi-square Test of Linkage Disequilibrium (D)
Test statistic
x2 = 2nD2/(pApapBpb) is compared with the critical threshold value obtained
from the chi-square table x2df=1 (0.05). n is the number
of individuals in the population.
If x2 < x2df=1 (0.05), this means that D is not significantly
different from zero and that the population under study is in linkage equilibrium.
If x2 > x2df=1 (0.05), this means that D is significantly
different from zero and that the population under study is in linkage disequilibrium.
Example
(1) Two genes A with allele A and a, B with alleles B and b, whose population
frequencies are denoted by pA, pa (=1- pA) and pB, pb (=1- pb), respectively
(2) These two genes are associated with each other, having the coefficient of linkage disequilibrium D
Four gametes are observed as follows:
Gamete AB Ab aB ab Total
Obs 474 611 142 773 2n=2000Gamete frequency pAB pAb paB pab
=474/2000 =611/2000 =142/2000 =773/2000
=0.237 =0.305 =0.071 =0.386 1
Estimates of allele frequencies
pA = pAB + pAb = 0.237 + 0.305 = 0.542
pa = paB + pab = 0.071 + 0.386 = 0.458
pB = pAB + paB = 0.237 + 0.071 = 0.308
pb = pAb + pab = 0.305 + 0.386 = 0.692The estimate of DD = pABpab – pAbpaB = 0.237 0.386 – 0.305 0.071 = 0.0699
Test statistics
x2 = 2nD2/ (pApapBpb) =210000.06992/(0.5420.4580.3080.692) = 184.78 is greater than x2
df=1 (0.05) = 3.841.Therefore, the population is in linkage disequilibrium at these two genes under consideration.
A second approach for calculating x2: Gamete AB Ab aB ab TotalObs 474 611 142 773 2n=2000
Exp 2n(pApB) 2n(pApb) 2n(papB) 2n(papb)
=334.2 =750.8 =281.8 =633.2 2000
x2 = (obs – exp)2 /exp = (474-334.2)2/334.2 + (611-750.8)2/750.8 + (142-281.8)2/281.8 + (773-633.2)2/633.2
= 184.78
= 2nD2/ (pApapBpb)
Measures of linkage disequilibrium
(1) D, which has a limitation that its value depends on
the allele frequencies
D = 0.02 is considered to be• large for two genes each with diverse allele
frequencies, e.g., pA = pB = 0.9 vs. pa = pb = 0.1
• small for two genes each with similar allele frequencies, e.g., pA = pB = 0.5 vs. pa = pb = 0.5
(2) To make a comparison between gene pairs with different allele frequencies, we need a new normalized measure.
The range of LD is
max(-pApB, -papb) D min(pApb, papB) The normalized LD (Lewontin 1964) is defined as
D' = D/ Dmax,
where Dmax is the maximum that D can have, which is
Dmax = max(-pApB, -papb) if D < 0,
or min(pApb, papB) if D > 0.
For the above example, we have D' = 0.0699/min(pApb, papB) = 0.0699/min(0.375, 0.141) = 0.496
(3) Linkage disequilibrium measured as the correlation between the A and B alleles
R = D/(pApapBpb), r: [-1, 1] Note: x2= 2nR2 follows the chi-square distribution
with df = 1 under the null hypothesis of D = 0. For the above example, we have
R = 0.0699/(pApbpapB) = 0.3040.
Application of LD analysis D(t+1) = (1-r)tD(t),
This means that when the population undergoes random mating, the LD decays exponentially in a proportion related to the recombination fraction.
(1) Population structure and evolution Estimating D, D' and R the mating history of population
The larger the D’ and R estimates, the more likely the population in nonrandom mating, the more likely the population to have a small size, the more likely the population to be affected by evolutionary forces.
Human origin studies based on LD analysis
Reich, D. E., M. Cargill, S. Bolk, J. Ireland, P. C. Sabeti, D. J. Richter, T. Lavery,
R. Kouyoumjian, S. F. Farhadian, R. Ward and E. S. Lander, 2001 Linkage disequilibrium in the human genome. Nature 411: 199-204.
Dawson, E., G. R. Abecasis, S. Bumpstead, Y. Chen et al. 2002 A first-generation linkage disequilibrium map of human chromosome 22. Nature 418: 544-548.
LD curve for Swedish and Yoruban samples. To minimize ascertainment bias, data are only shown for marker comparisons involving the core SNP. Alleles are paired such that D' > 0 in the Utah population. D' > 0 in the other populations indicates the same direction of allelic association and D' < 0 indicates the opposite association. a, In Sweden, average D' is nearly identical to the average |D'| values up to 40-kb distances, and the overall curve has a similar shape to that of the Utah population (thin line in a and b). b, LD extends less far in the Yoruban sample, with most of the long-range LD coming from a single region, HCF2. Even at 5 kb, the average values of |D'| and D' diverge substantially. To make the comparisons between populations appropriate, the Utah LD curves are calculated solely on the basis of SNPs that had been successfully genotyped and met the minimum frequency criterion in both populations (Swedish and Yoruban) (Reich,te al. 2001)
(2) Fine mapping of disease genes
The detection of LD may imply that the recombination fraction between two genes is small and therefore closer (given the assumption that t is large).
Inbreeding • Individuals that are related to each other by ancestry
are called relatives;• Mating between relatives is called inbreeding;• The consequence of inbreeding is to increase the
frequency of homozygous genotypes in a population, relative to the frequency that would be expected with random mating (Hartl 1999).
The closed degree of inbreeding -- In most human societies: first-cousin mating In many plants: self-fertilization
Genotype frequencies with inbreeding Gene A, with two alleles A and a, in a self-fertilizing () population of plants,
for example, rice or Arabdopsis
AA Aa aaGeneration 1 1/4 1/2 1/4
Generation 2 PAA=1/4×1 +1/2×1/4 PAa=1/2×1/2 Paa=1/2×1/4+1/4×1
= 3/8 =2/8 = 3/8
Randomly mating P0AA =1/4 P0Aa =1/2 P0aa =1/4
The effect of inbreeding is to increase the frequency of homozygous genotypes
AA and aa, but reduce the frequency of heterozygous genotype Aa.
We define F = (P0Aa – PAa)/ P0Aa
as the inbreeding coefficient. Biologically, F measures the degree with which heterozygosity is reduced due to inbreeding, measured as a fraction relative to heterozygosity expected in a random-mating population.
Controlled cross (self fertilization): F = (PAa – P0Aa)/ P0Aa= (1/2 – 1/4)/½ = 1
Natural population: AA Aa aa TotalExample 1 Obs 224 64 6 294
Freq .762 .218 .020 pA=.871, pa=.129 Exp (HWE) .769 .224 .017
F = (P0Aa – PAa)/ P0Aa = (.224-.218)/.224=.027
Example 2 Obs 234 36 6 276Freq .848 .130 .022 pA=.913,
pa=..087Exp (HWE) .719 .159 0
F = (P0Aa – PAa)/ P0Aa = (.159-.130)/.159=.101
Consider an inbred population, in which the actual frequency of heterozygote is written as
PAa = P0Aa – P0AaF = 2pApa – 2pApaF,
with P0Aa = 2pApa at random mating. Because pA = PAA +
1/2PAa and pa = Paa + 1/2PAa, we have
PAA = pA – 1/2PAa = pA – 1/2(2pApa – 2pApaF)
= pA2 + pApaF,
Paa = pa – 1/2PAa = pa – 1/2(2pApa – 2pApaF)
= pa2 + pApaF
Further, we have
PAA = pA2(1-F) + pAF
PAa = 2pApa(1-F),
Paa = pa2(1-F) + paF,
Concluding remarks (1) The genotype frequencies equal the HWE frequencies
multiplied by the factor 1 – F, plus a correction term for the homozygous genotype frequencies multiplied by the factor F;(2) When F = 0 (no inbreeding), the genotype frequencies are the HWE. When F = 1 (complete inbreeding), the population consists entirely of homozygotes AA and aa.
Identical by descent (IBD) Identical by descent (IBD) means two genes that have originated from the replication of one single gene in a previous population. The coefficient of inbreeding is the probability that
the two alleles at any locus in an individual are identical by descent (it expresses the degree of relationship between the individual’s parents). If the two alleles in an individual are IBD, the genotype at the locus is said to be autozygous If they are not IBD, the genotype is said to be allozygous.
AA Aa
Aa Aa AA aa AA Aa Aa aa Aa Aa
AA AA Aa Allozygous Autozygous Autozygous homozygote homozygote heterozygote
pA2(1-F) + pAF
In general
Allozygous Autozygous
PAA = pA2(1-F) + pAF
PAa = 2pApa(1-F) + 0
Paa = pa2(1-F) + paF
Calculation of the inbreeding coefficient from pedigree• A pedigree initiated with a common ancestor A through B, C and D, E to I• How to calculate the coefficient of inbreeding for individual I (FI)?
1/2(1+FA)
A
B C
pB→D pC→E
D E
pD→I pE→I
I
The common ancestor A generates two gametes G1 and G2 during meiosis, but only transmits one gamete for its first offspring B and one gamete for its second offspring C.
A pair of gametes contributed to offspring B and C by A may be G1G1, G1G2, G2G1, G2G2, each with a probability of 1/4 because of Mendelian segregation.
For G1G1 and G2G2, the alleles are clearly IBD, For G1G2 and G2G1, the alleles are IBD only if G1 and
G2 are IBD, and G1 and G2 are IBD only if individual A is
autozygous, which has probability FA (the inbreeding
coefficient of A)
The probability for A to generate IBD alleles for B and D is therefore 1/4 + 1/4 + 1/4FA + 1/4FA = 1/2(1 + FA).
The transmission probability of an allele from other parents, B, C, D, E to their own specified offspring is, based on Mendelian segregation,
pB→D = pC→E = pD→I = pE→I =1/2
Finally, the probability that the two alleles at any locus in individual I are identical by descent is
FI = 1/2 (1 + FA) × pB→D × pC→E × pD→I × pE→I
= (1/2)5(1 + FA)
Evolutionary Forces – The Causes of Evolution
For a Hardy-Weinberg equilibrium (HWE) population, the genotype frequencies will remain unchanged from generation to generation. Two questions may arise that concern HWE.
(1) Do such HWE populations exist in nature?(2) More importantly, if a population had
unchanged genotype frequencies over time, it should be in a stationary status. Thus, wild type teosinte would always be teosinte and never change. But what have made teosinte become cultivar maize (see the figure above)?
First of all, no HWE population exists in nature because many evolutionary forces may operate in a population, which cause the genotype frequencies in the population to change.
Secondly, even if a population is at HWE, this
equilibrium may be quickly violated because of some particular evolutionary forces.
These so-called evolutionary forces that cause the
structure and organization of a population to change include mutation, selection, admixture, division, migration, genetic drift… Next, we will talk about the roles of some of these evolutionary forces in shaping a population.
Mutation Mutation is a change in genetic material, including nucleotides substitution, insertions and deletions, and chromosome rearrangements Mutation has different types, forward mutation and reversible mutation Forward mutation Consider a gene A with two alleles A and a, with allele
frequencies pA(t) and pa(t) in generation t Allele A is mutating to allele a, with the mutation rate per generation denoted by u Forward mutation is a process in which the mutating allele is
the prevalent wild type allele
With the definition of mutation rate u (a fraction u of A alleles undergo mutation and become a alleles, whereas a fraction 1-u of A alleles escape mutation and remain A), we have allele frequency in the next generation t+1
pA(t+1) = pA(t) – pA(t)u = (1-u) pA(t). In general, we have
pA(t+1) = (1-u) pA(t) = (1-u)2pA(t-1) = …
= (1-u)t+1pA(0).
Assuming that the initial population is nearly fixed for A, i.e., pA(0) ≈ 1, and that t+1 is not too large relative to 1/u, we can approximate the allele frequencies by
pA(t+1) ≈ pA(0) – (t+1)u,
pa(t+1) ≈ pa(0) + (t+1)u. • The frequency of the mutant a allele increases
linearly with time and the slope of the line equals u.• Because u is small, the linear increase in pa is
difficult to detect unless a very large population size is used.
Reversible mutation
Reversible mutation allows the mutation from A to a (at the rate u per generation) and from a to A (at the rate v per generation).
Thus, allele A can have two origins in any generation:
One being allele A in the previous generation that escaped mutation to allele a
The second being reversibly mutated from allele a in the previous generation
The allele frequency in the current generation is therefore expressed as
pA(t+1) = (1-u)pA(t) + vpa(t) = (1-u-v)pA(t) + v
pA(t+1) – v/(u+v) = (1-u-v)pA(t) + v - v/(u+v)
= (1-u-v)pA(t) + (uv+v2-v)/(u+v)
= [pA(t) – v/(u+v)](1-u-v)
= [(1-u)tpA(0) – v/(u+v)](1-u-v) = [pA(0) – v/(u+v)](1-u-v)t+1
If pA(0) = v/(u+v), we have
pA(1) = pA(2) = … = pA(t+1) = v/(u+v)
We define
pA = v/(u+v) as an equilibrium frequency (irrespective of the starting
frequencies). To reach this equilibrium, it needs to take a long time for
realistic values of the mutation rates.
Admixture
• Admixture is an evolutionary process in which two or more HWE populations with differing allele frequencies are mixed to produce a new population.
• The consequence of admixture is the
deficiency of heterozygous genotypes relative to the frequency expected with HWE for the average allele frequencies
Consider gene A with two alternative alleles A and a
Subpopulation 1 (HWE) Subpopulation 2 (HWE)
AA Aa aa AA Aa aa
pA2 2pApa pa
2 p’A2 2p’Ap’a p’a
2
Admixture
Admixed population, mixed population, metapopulation, aggregate population (HWD)
AA Aa aa
(pA2 + p′A
2)/2 (2pApa + 2p’Ap’a)/2 (pa2 + p’a
2)/2
Random mating
Fused population, total population (HWE)
AA Aa aa
2pˉApˉa
2ap2
Ap
After admixture, the allele frequencies are changed as
We find
(pA2 + p’A
2)/2 (metapopulation) (pA
2 + p’A2)/2 - (pA- p’A)2/4
= (pA2 + p’A
2)/2 + 2pAp’A/4 - (pA2 + p’A
2)/4
= (pA2 + p’A
2)/4 + 2pAp’A/4
= (pA + p’A)2/4
= p-A
2 (HWE)
)/2p'(pp
)/2p'(pp
aaa
AAA
(pa2 + p’a
2)/2 (metapopulation) (pa
2 + p’a2)/2 - (pa – p’a)
2/4
= (pa2 + p’a
2)/2 + 2pap’a/4 - (pa2 + p’a
2)/4
= (pa2 + p’a
2)/4 + 2pap’a/4
= (pa + p’a)2/4
= p-a2 (HWE)
pApa + p’Ap’a (metapopulation) pApa + p’Ap’a + (pA – p’A)(p’a - pa)/2
= pApa + p’Ap’a + (pAp’a + p’Apa - pApa – p’Ap’a)/2
= (pApa + p’Ap’a + pAp’a + p’Apa)/2
= (pA + p’A)(pa + p’a)/2
= 2q-Aq-
a (HWE)
Discovery 1It can be seen that genotype frequencies are not equal to the products of the allele frequencies for the admixed population so that the mixed population is not in HWE.
Discovery 2
Relative to an HWE population, the aggregate population contains too few heterozygous genotypes and too many homozygous genotypes.
Define the variance in allele frequency (in terms of recessive alleles) among the subpopulation by 2.
Value Frequncy
Supopulation 1pa n
Supopulation 2p’a n’ = n
Mean p-a
Based on the definition of variance, we have
2 = [(pa - p-a)
2 + (p’a - p-a)
2]/2
= (pa2 + p’a
2)/2 + p-a2 - pap
-a – p’ap
-a
= (pa2 + p’a
2)/2 + p-a2 – 2p-
a[(pa+p’a)/2]
= (pa2 + p’a
2)/2 - p-a2
2 is actually the difference between the genotype frequencies (RS) in the
metapopulation (equal to the average genotype frequencies among the subpopulations) and the genotype frequencies (RT) that would be expected
in a total population in HWE., i.e.,
2 = RS - RT 0, so RS = RT + 2 RT
Discovery 3
The average frequency of homozygous recessive genotypes among a group of subpopulations is always greater than the frequency of homozygous recessive genotypes that would be expected with random mating, and excess is numerically equal to the variance in the recessive allele frequency.
The relationship RS = RT + 2 RT is called Wahlund’s principle
Example: Two subpopulations of gray squirrels
For the recessive allele, we have pa = 0.16, p’a = 0 The genotype frequency in the metapopulation is
(0.16 + 0)/2 = 0.08The allele frequency in the metapopulation is
(0.16 + 0)/2 = 0.2The frequency of the homozygous recessive genotype in the
HWE total population is0.22 = 0.04 < 0.08
The variance in allele frequency is(0.16 – 0.2)2 + (0 – 0.2)2 = 0.04, which equals the reduction in the frequency of the homozygous recessive.
Population structure
Similar to 2 = RS – RT = (pa2 + p’a
2)/2 - p-a2 for
homozygous recessive genotypes, we have
2 = DS – DT = (pA2 + p’A
2)/2 - p-A
2
for homozygous dominant genotypes. For heterozygous genotypes, we have
HS – HT = -22
Recall the definition of the inbreeding coefficient
F = (P0AA - PAA)/ P0AA (describe the deficiency of heterozygous genotypes in an inbred population, relative to a population in HWE).
We define
FST = (HT – HS)/HT, as the fixation index in the metapopultion.
Metapopulation ≈ inbred population
Redefine
FST = 2/ p-Ap-
a.
This is a fundamental relation in population genetics that connects the fixation index in a metapopulation with the variance in allele frequencies among the subpopulations. The fixation index can be interpreted in terms of the inbreeding coefficient. Thus, the genotype frequencies in a metapopulation are expressed as
AA: p-A
2 + p-Ap-
aFST = p-A
2(1-FST) + p-AFST
Aa: 2p-Ap-
a - 2p-Ap-
a FST = 2p-Ap-
a(1-FST)
aa: p-a2 + p-
Ap-aFST = p-
a2(1-FST) + p-
aFST
Remarks
• Even though each subpopulation itself is undergoing random mating and is in HWE, there is inbreeding in the metapopulation composed of the aggregate of subpopulations.
• A metapopulation may be composed of many smaller subpopulations each of which may be in HWE (theory for population structure).
Natural Selection
• Selection is the principal process that results in greater adaptation of organisms to their environment
• Through selection the genotypes that are superior in survival and reproduction increase in frequency in the population
Haploid selection: selection at the gamete level
Two alleles A and a, with initial frequencies pA and pa
Haploid progeny (reproduction) 10 A (pA=1/2) 10 a (pa=1/2)
Maturation
Survival (Adults) 9 A 6 aViability (or Absolute fitness) 9/10=0.90 6/10=0.60
Relative fitness wA=0.90/0.90=1 wa=0.60/0.90= 0.67
Selection coefficient 0 s=1–0.67=0.33
New frequencies p’A= 9/15 p’a=6/15
Haploid progeny (reproduction) 12 A 8 a
• Viability or survivorship: the probability of survival, which is also called fitness.
• Fitness has two types: Absolute fitness separately for each genotype and relative fitness (the ability of one genotype to survive relative to another genotype taken as a standard)
• It is impossible to measure absolute fitness because it requires knowing the absolute number of each genotype, whereas relative fitness can be measured by the sampling approach
• Selection coefficient: 1 – relative fitness
In general, the new frequency for allele A is expressed as
In the above example, pA = pa = ½, wA = 1, wa = 2/3, and s =1/3, we have p’A = 1/2/(1-1/21/3) = 3/5 = 9/15.
sp-1
p
s)-(1pp
p
/wwpp
p
wpwp
wpp
a
A
aA
A
AaaA
A
aaAA
AAA
sp-1
s)-(1pp
a
aa
(0)s)p(1(0)p
(0)p)1(p
aA
AA
(0)ps)(1(0)p
(0)p
(1)s)p(1(1)p
(1)p)2(p
a2
A
A
aA
AA
(0)ps)(1(0)p
(0)p(t)p
at
A
AA
…
.
By the method of successive substitutions, we have
tss )1(
1
(0)p
(0)p...
1
1
1)-(tp
1)-(tp
(t)p
(t)p
a
A
a
A
a
A
Taking the natural logarithm at both sides of the above equation, we have
(for a not-too-large s)
• If s is not too large, ln(pA/pa) should be linear with time with a slope equal to the value of s.
• This is one approach by which the selection coefficient can be estimated
st(0)p
(0)pln s)ln(1t
(0)p
(0)pln
(t)p
(t)pln
a
A
a
A
a
A
Example: E. coliGeneration ln(pA/pa)0 0.345 0.5310 1.0120 1.4725 1.4730 1.1035 1.50
Using the linear regression model ln[pA(t)/pa(t)] = ln[pA(0)/pa(0)] + st, we estimate
ln(pA/pa) = 0.52 + 0.0323t (Hartl and Dykhuizen 1981).
Diploid selection: selection at the zygote levelTwo alleles A and a, with initial frequencies pA = ½ and pa = ½
Zygote 5 AA 10 Aa 5 aa
Maturation
Survival (Adults) 5 AA 8 Aa 3 aaAbsolute fitness 5/5= 1 8/10=0.8 3/5=0.6Relative fitness wAA=1 wAa=0.8/1=0.80 waa=0.6/1=0.6Selection coefficient 0 hs=1–0.80=0.20 s=1-0.60=0.40
New frequencies p’A= (25+8)/[2(5+8+3)]=18/32 p’a=(32+8)/[2(5+8+3)]=14/32
Random mating with HWE leads toAA: PAA = (18/32)220 = 6Aa: PAa = 2(18/32)(14/32)20 = 10Aa: Paa = (14/32)220 = 4
Define h = hs/s as the degree of dominance of allele a. We have
• h = 0 means that a is recessive to A,
• h = ½ means that the heterozygous fitness is the arithmetic average of the homozygous fitnesses; in this case, the effects of the alleles are said to be additive effects
• h = 1 means that allele a is dominant to allele A.
• It is possible that h < 0 or h > 1.
In general, the allele frequencies in the next generation after diploid selection are expressed as
where the dominator is the average fitness in the population, symbolized by
aa2aAaaAAA
2A
AaaAAA2A
A wpwpp2wp
wppwpp
aa2aAaaAAA
2A wpwpp2wp w
This equation has no analytical solution, and for this reason it is more useful to calculate the difference
w
])w(wp)w(w[pppppΔp aaAaaAaAAAaA
AAA
Example
• In the initial population, PAA = 0, PAa = 2/3, Paa = 1/3, so we have pA = 1/3 and pa = 2/3. The fitness is measured, wAA = 0, wAa = 0.50 and waa = 1.
• In the second generation, we expect
p’A = [(1/3)20 + (1/3)(2/3)0.50]/
[(1/3)20+2(1/3)(2/3)0.50+(2/3)21]
=1/6.
Time required for changes in gene frequency
With the selection coefficient (s), the degree of dominance (h) and 1 (if selection is weak), the difference in allele frequency can be expressed as
pA = pApas[pAh + pa(1-h)].
w
The time t required for the allele frequency of A to change from pA(0) to pA(t) can be determined in each of the three following special cases:
1. Allele A is a favored dominant, in which case h = 0 and pA = pApa
2s, i.e.,
,
In the special case, pa(0) = pa(t) = 1, we have
t (1/s)ln[pA(t)/pa(t)].
whose integral is
st(0)p
1
(0)p
(0)pln
(t)p
1
(t)p
(t)pln
aa
A
aa
A
(t)s(t)ppdt
dp 2aA
A
2. Allele A is a favored and the alleles are additive, in which case h = 1/2 and pA
= pApas/2, i.e.,
whose integral is
In the special case, pa(0) = pa(t) = 1, we have
t (2/s)ln[pA(t)/pa(t)].
2
spp
dt
dp aAA
t2
s
(0)p
(0)pln
(t)p
(t)pln
a
A
a
A
3. Allele A is a favored recessive, in which case h = 1 and pA = pA
2pas, i.e.,
whose integral is
sppdt
dp 2Aa
A
st(0)p
1
(0)p
(0)pln
(t)p
1
(t)p
(t)pln
Aa
A
Aa
A
ImplicationIf selection is operating on a rare harmful recessive allele (say a), what is the consequence?
• This is the case when allele A is a favored dominant, pA = pApa
2s and pa 0, pa2 0.
• Even if the selection coefficient s is very large, pA still change little.
• In other words, the change in allele frequency of a rare harmful recessive is slow whatever the value of the selection coefficient.
• In humans, the forced sterilization of rare homozygous recessive individuals is not genetically sound, although it is also not morally accepted.
Other evolutionary forces
• Migration: The movement of individuals among subpopulations
• Random genetic drift: Fluctuations in allele frequency that happen by chance, particularly in small populations, as a result of random sampling among gametes
• Mutation-selection balance: Selection and mutation affect a population at the same time
Overviews
• HWE (estimate and test)• LD (test)• Inbreeding coefficient (evolutionary significance)• IBD• Evolutionary forces
Mutation
Admixture
Population structure
Selection