HST.508/Biophysics 170:Quantitative genomics
Module 1: Evolutionary and populationgenetics
Lecture 3: natural selection
Professor Robert C. Berwick
Topics for this module
1. The basic forces of evolution; neutral evolution and drift
2. Computing ‘gene geneaologies’ forwards and backwards;the coalescent
3. Coalescent extensions; Natural selection and itsdiscontents
4. Detecting selection: Molecular evolution; from classicalmethods to modern statistical inference techniques
Harvard-MIT Division of Health Sciences and TechnologyHST.508: Quantitative Genomics, Fall 2005Instructors: Leonid Mirny, Robert Berwick, Alvin Kho, Isaac Kohane
1
Agenda for today
1.Coalescing the coalescent: the Great Obsession;adding complications like demographics,recombination; how you can use the coalescent(simulation, estimation, testing)
2. Natural selection: from the basic dynamical systemequation to the diffusion approximation: how cangenes survive?
Coalescent Summary1. Coalescent theory describes the genealogical relationships
among individuals in a Wright-Fisher population
2. Sample, rather than population.
3. Retrospective (how did things get to be the way theyare?) rather than prospective (what happens if?) – betterfor our situation of sampling from data.
3. That is: the coalescent model differs from the ‘classical’random sampling gene pool model in that it gives us theopportunity to start with polymorphism data and workbackwards – start with simplest model, if doesn’t work,change the model
4. Separate demography (coalescent) from genetics(mutation) - allows to separate the two & so gives usbasic test statistics for diversity/variation (θ, π )
2
The Great Obsession: variation (polymorphism)entangled with descent
Time
Observed sample variation: is it fromdescent (tree), or from biology?
2N genCoalescence
Two ‘competing’ stochastic processes intertwined
1. Gene trees: How long until sample sequences have commonancestor (coalesce)?
Answer: the coalescent models the geneology of a sample of nindividuals drawn from a (putative) population of size N as a randombifurcating tree. The n-1 coalescent times T(n), T(n-1), …, T(1) are(to an approximation) mutually independent, exponentiallydistributed random variables
Rate of coalescence for two lineages is (scaled) at 1, where this is 2Ngenerations; Total rate, for k lineages is ‘k choose 2’
2. Genetics: sprinkle in Poisson mutation process with rate λ=ut,then what is expected distribution of variation?
3
Expected time to coalescence
Time to coalescence for n sequences or genes
4
The structure of the basic coalescent
Expected time to coalescence for 2 genes is 2N; variance 2N(2N–1)For n sequences or genes…If time is measured in units of 2N generations, by t’= t/2NE[Tk, k-1]= 1/(k choose 2); variance is square of thisTime to MRCA for all genes is sum of these times, or 2(1-1/n) [againin units of time measured in 2N, i.e., 4N(1-1/n) unscaled time
10
Estimating nucleotide divergence as θ
Generation t–1
Notation: Ti= time to collapse of i genes, sequences,…
u u
uT uT
2uT= 2u 2N = 4uN =θ
5
Expected # mutations, n allele or sequence case
So this gives us the expected amount ofsequence diversity
Summary results for basic coalescent
• Expected time to coalesce, for 2 alleles, 2N
• Expected time to coalesce, all k alleles (hence avgfixation time)
• Expected # of segregating sites
• Expected amount of sequence diversityE[S
N] = uE[T
C] = !
1
i "1i=2
n
#
E[TC] = iE[T
i
i=2
n
! ] = 4N1
i "1i=2
n
!
6
Estimators for theta = 4Nu
What’s this stuff goodfor?
1.Estimation2.Simulation3.Rejecting null model
7
Basic estimation idea: find coalescents that are‘improbable’ to detect interesting (i.e., unlikely)
patterns of mutations
This helps us untangle two sources of variation:gene/sequence tree divergence from polymorphism
ΘT = Average Pairwise Distance
= (1+3+3+3+2+2+2+2+2)/10=2
A mutation on an interior branch will have higher weight
ΘT estimated from pairwise differences(heterozygosity)
ACCTGAACGTAGTTCGAAGACCTGAACGTAGTTCGAATACCTGACCGTAGTACGAATACATGAACGTAGTACGAATACATGAACGTAGTACGAAT * * * *A B C D
A
B
C
D
1 2 3 4 5
(just the average heterozygosity)
8
ΘW=4Νµ estimated from # segregating sites
ACCTGAACGTAGTTCGAAGACCTGAACGTAGTTCGAATACCTGACCGTAGTACGAATACATGAACGTAGTACGAATACATGAACGTAGTACGAAT * * * *A B C D
A
B
C
D
1 2 3 4 5
Watterson, 1975
Expected number of segregating sites:
ΘW= 4/(1+1/2+1/3+1/4)=24/11=2.1818
Different coalescent patterns (relative branch lengths) yielddifferent estimates for theta even though total branch length
is the same and # segregating sites remains the same
Second type of mutation counted more times when calculating the average pairwise distance – typical when there’s a ‘burst’ after a population bottleneck
Use the difference between the two estimates to figure out a statistical measurethat can pick out these two patterns
9
ΘE = 2
No weight to internal branches
ΘE estimated from external branchesACCTGAACGTAGTTCGAAGACCTGAACGTAGTTCGAATACCTGACCGTAGTACGAATACATGAACGTAGTACGAATACATGAACGTAGTACGAAT * * * *A B C D
A
B
C
D
1 2 3 4 5
Which should we use?????
Consider these coalescent pattern differences & whatthey imply about possible patterns of variation
(heterozygosity) if there are neutral mutations sprinkledon these patterns…
Note that S= # segregating sites remains the same…
Expect: more mutations oninterior branches, sampleheterozygosity higher
Expect: fewer mutations oninterior branches, sampleheterozygosity lower
10
Two estimates of theta
Use of Tajima’s D
11
Human mitachondrial DNA
Factors affecting test power
12
Example 1 – human mitochondrial DNA
52 complete molecules
521 segregating sites
ΘT = 44.2 ΘW = 115.3
Std(ΘT−ΘW)=31.8
D = -2.23 (P<0.01)
Ingman et al. 2000
Example 2 – human Y-chromosome
3 Y-chromosome genes, 40 kb of sequence in 53males
47 polymorphic sites
Tajima’s D: -2.3, -2.0 and –1.8 highly significant
TMRCA: No growth: 84,000 (55,000-149,000) Exponential: 59,000 (40,000-140,000)
With exponential growth more mutations are recentand therefore estimated TMRCA is smaller
Thomson et al. 2000, Shen et al. 2000
13
• Ex: ~1400bp at Sod locus in Drosophila10 taxa
5 were identical. The other 5 had 55 mutations
Q: Is this a chance event, or is there selection for thishaplotype?
Applications– Simulation for model testing
Simulation results1. 10000 coalescentsimulations were performed on 10 taxa
2. 55 mutations placed on the coalescent branches
3. Count the number oftimes 5 lineages areidentical
4.This event happened inonly 1.1% of the cases
Conclusion: selection, orsome other mechanismexplains this data – not theneutral mutations
14
Extensions to the mathematical/computationalmodel
1. Effective population size, not census population size
2. Demographic changes generally: population flux,migration, gene flow
3. Recombination – turns the trees into general networks.
4. Selection–gene copies no longer act ‘independently’
5. Statistical– to get confidence limits, etc., must simulateover many generated ‘trees’ – use likelihood methods(Computer packages: Lamarc; Simcoal2; …)
Complications make the simple coalescent look morecomplicated! Population flow, Recombination,
Migration
15
Demographic correctionsPart 1 – Effective population size, Ne
Ne= # individuals in a theoretical population that,
subjected to the same magnitude of drift, wouldpresent an equivalent level of diversity
Department of Corrections – does this actually work?Estimating N from polymorphism data: must use
effective population size for theta!
16
Patching the model; demographics matters
Effective population size, Ne
Working definition: the size of an ideal population that hasthe same properties with respect to genetic drift as theactual population does
Lots of ways to define what’s in italics…
1. Variance adjustment2. Inbreeding adjustment(first related to # of individuals in offspring generation;second related to # of individuals in parental generation)
17
Fluctuating population size
But Why do we use the harmonicmean???
Example: if population size is 1000 w/ pr 0.9 and 100 w/ pr0.1, arithmetic mean is 901, but the harmonic mean is (0.9 x1/1000 + 0.1 x 1/10)-1 = 91.4, an order of magnitude less!
Thus, if we have a population (like humans, cheetahs) goingthrough a ‘squeeze’, this changes the population sizes, hence θ
In general:Variance effective population size
18
Effective population size must be used to ‘patch’ theWright-Fisher model to keep the variance the same
Variance for N1 is p(1–p)/2N1 with probability rVariance for N2 is p(1–p)/2N2 with probability 1–rAverage these 2 populations together, to get meanvariance, ‘solve’ for Ne
Var[p '] = p(1! p)r
2N1
+1! r2N
2
"
#$%
&' or
Ne=
1
r1
N1
+ (1! r)1
N2
i.e., the harmonic mean of the population sizes (the reciprocal of the averageof the reciprocals) is used because it averages the variation properly!
Always smaller than the mean; Much more sensitive to small numbers
Standard variance is pq/N
Demographic Corrections, part 2: effects oncoalescent
19
The effect of population growth
20
Try different simulations…which matches data best?
21
Why is modeling selection hard with thecoalescent?
Problem: Genealogical and mutation processes no longerindependent!
Two alleles, A and a, A has an advantage of sMutation rate between types = u
Krone and Neuhauser 1997
Summary so far…
A strong bottleneck resembles population growth A weaker bottleneck resembles directional selection for some lociAnd balancing selection for other loci
22
This is where current computer packages take us!
http://evolve.zoo.ox.ac.uk/software.html?id=treevolve
Screenshots removed due to copyright reasons. Please see: University of Oxford, Department of Zoology, Evolutionary Biology Group: http://evolve.zoo.ox.ac.uk/software.html________________________
23
Modeling natural selection: fromthe simple auto mechanics or
algebra of selection to the diffusionapproximation
Evolution by natural selection• Natural selection is the process by which individuals
contribute more or less offspring in the next generation due tofitness differences, which can be caused by differentialviability, mating success,…
• The selection coefficient is the fitness effect of a mutationacross genetic backgrounds & environments. In a haploidpopulation with two alleles A and a, with fitness values w1 andw2 , the selection coefficient is w1–w2. Fitness values take onarbitrary units since they are measured relative to apopulation mean fitness, w-bar, which is set to 1
• If w11, w12, w22, are the fitness values associated with AA, Aa,and aa, then:
1. If w11 < w12 < w22 there is positive, directional selection forAA and negative, directional selection against aa
Let’s do the basic algebra, and then thegeneral case…
24
Sewall Wright’s adaptive landscape:Understanding the formula
!! =
Sn
1+1
2+
1
3+"+
1
n "1
=11
2.93= 4.78 (4Nu) for locus
!! for nucleotide site=
4.78
768= 0.0062 }
direction
mean fitness
Some dissection…
Variance component of allele Awithin genotype
Why variance? Draw from pool ofA, a gametes many many times:binomial sampling – frequency of Awithin a genotype is either 1, 1/2, or0; variance is p(1-p)/2(“heterozygosity”)
Slope of fitness function dividedby mean population fitness – apotential function?
25
The new reality game show - “Survivor”1 gene in 2 different forms (alleles)
w11 p2 w12 2pq w22 q
2
genotype
frequency
afterselection
Viability
AA Aa aa
p2 q22pq
w11 w12 w22
survivors
Intuitively, w is a ‘growth rate’
Note that if Nt = # before selection, the total # after selectionis:
w11 p2 w12 2pq w22 q
2
genotype
frequency
afterselection
relativefitness
AA Aa aa
p2 q22pq
w11 w12 w22
What is the average (marginal) fitness of A’s?
This is the expectation that A will survive
26
Two allele case: we can now calculate p – p′ i.e., thechange in allele frequency, or evolution
Think about what this means: what if w1 is greater than average fitness? Less?
To derive the rest of the ‘jet fuel’ formula
27
The ‘jet fuel’ formula
Rate proportional to differencein relativefitnesses
Adaptation is not instantaneous:The ratio of p to (1-p) changes by w1/w2 every generation
After t generations,
Getting a feel for the dynamicsGenotype: AA Aa aaRelative fitness: 1
1-hs 1-s1–hs= w12/w111–s = w22 /w11
s= selection coefficient. Measure of fitness of AA relative to aa.If positive, aa is less fit than AA; if negative, aa is more fith= heterozygous effect. Measure of fitness of heterozygote relativeto selective difference between the two homozygotes –a measure ofdominance:
h=0, A dominant, a recessiveh=1, a dominant, A recessive0< h <1 incomplete dominanceh<0 overdominanceh >1 underdominance
28
Dynamical system analysis of ‘adaptive topography’or mean fitness vs. p - nondegenerate case
Equillibrium value of p
The delta p equation in these terms (relative fitnesses)
where
h determines where allele frequency ends up;s determines how quickly it gets there
There turn out to be three kinds of selection:dominant (AA> Aa > aa);overdominant (Aa> AA, aa);underdominant (AA, aa > Aa)
29
Some dissection…
Variance component of allele Awithin genotype
Why variance? Draw from pool ofA, a gametes many many times:binomial sampling – frequency of Awithin a genotype is either 1, 1/2, or0; variance is p(1-p)/2(“heterozygosity”)
Slope of fitness function dividedby mean population fitness – apotential function?
Some exploration, fitness AA is 1.0; Aa = 0.95, aa= 0.90
Screenshots removed due to copyright reasons.
30
Plot avg fitness vs p to get feel for the dynamics…Note that avg fitness is a quadratic function so it can have at most 1 minimumor maximum…
‘Degenerate’ case: quadratic mean fitness, withw
12= (w
11+w
22)/2
One locus, 2 allele case: graphs, p vs. w
p p p
‘Degenerate’ case: quadratic mean fitness, withw
12= (w
11+w
22)/2
w w w
0 1 0 01 1
frequency A frequency A frequency A
Avg fitness Avg fitness Avg fitness
Directional selection Directional selection Zip selection
31
The four nonlinear cases - selection at one locus, 2 alleles - adaptive topography
! p̂ =w22" w
12
(w11" w
12) + (w
22" w
12)=w22" w
12
w22" w
12
= 1
p
w
underdominancep
w
overdominance
Directional selection
p
w
frequency A0 1
Avg fitness w11 < w12 = w22
p
w
0 1frequency A
Avg fitness w11 = w12 > w22
p
w
p
w
The four nonlinear cases - selection at one locus, 2 alleles - adaptivetopography
0 0 11
frequency A frequency A
Avg fitness Avg fitness
Balancing selection
Disruptive selection
32
!!pi= !
p1
p2
p3
"
#
$$$
%
&
'''
=1
2w
p1(1( p
1) ( p
1p2
( p1p3
( p2p1
p2(1( p
2) ( p
2p3
( p3p1
( p3p2
p3(1( p
3)
"
#
$$$
%
&
'''
)w
)p1
)w
)p2
)w
)p3
"
#
$$$$$$$
%
&
'''''''
!!p =
G
w"w
But…
33
Climb every mountain? Some surprising results
• The power of selection: what is the fixation probability for a new mutation?
• If no selection, the pr of loss in a single generation is 1/e or 0.3679
• In particular: suppose new mutation has 1% selection advantage as heterozygote – this is a huge difference
• Yet this will have only a 2% chance of ultimate fixation, starting from 1 copy (in a finite population aPoisson # of offspring, mean 1+s/2, the Pr of extinction in a single generation is e-1(1-s/2), e.g., 0.3642 fors= 0.01)
• Specifically, to be 99% certain a new mutation will fix, for s= 0.001, we need about 4605 allele copies(independent of population size N !!)
• Also very possible for a deleterious mutation to fix, if 2Ns is close to 1
• Why? Intuition: look at the shape of the selection curve – flat at the start, strongest at the middle
• To understand this, we’ll have to dig into how variation changes from generation to generation, in finitepopulations
Regime 1: very lowcopy # Regime 2:
Frequencymatters
The fate of selected mutations
2Ns (compare to Nu factor)
34
Time to fixation for selected genes:can we find this in face of
population size, mutation, drift?
pdf forgene freqp
Mutationrate to pMean
fitness
EffectivePopulationsize
!̂ (p) = Cw2Ne (1" p)4Neu"1 p4Nev"1
The fixation probability of selected alleles – large population(no effects from ‘demographic stochastics’)
Pr{Fixation}= 1–Pr{extinction} = 2s
35
For N large, this is Poisson with mean 1+s, so the # of Aawith k surviving offspring has probability:
Assume binomial draw with N trials, pr of success on each trial is (1+s)/N
Substitute back for pk
This is a Taylor series expansion of ex so we can rewrite as:
If s is small, then expand RHS as power series in lambda, droppingterms beyond square, ie, lambda is near 1
Solved when either λ= 1 or
So when s is small, pr of survival of new mutant is either verynearly 2s or else 0 (if s less than 0)When s=0.01, only 1 new mutant in 50 will succeed inspreading, despite that all are advantageous; if s=0.1, whichfixes very rapidly in deterministic case, only 1 in 6 will win
36
So, 2s turns out to be a good approximation to the exactfixation probability for small s
# of copies of allele matters - must get over theinitial ‘hump’
Pr that n copies go extinct (since all lineages are independent):
Eg, once 100 copies present, s=0.01, pr loss is only 0.14; with 1000copies, less than 3 x 10-9
Tells us about time course of selection with new mutation: it doesfollow two regimes…
What about branching process vs. deterministic equations - thedifference is between the # of copies and the gene frequency
How do we put back drift?
37
And a General Rule
But what about the interaction with drift??????
The whole banana: thediffusion approximationto evolutionary processes
38
Our goal: how can we model the full interaction ofstochastic forces and selection, mutation,
migration…?
Our answer: write an approximating differentialequation that involves all these ‘forces’
Set to 0 and solve to find equilibrium allele frequencydistribution for p
Diffusion Theory: Conceptual Framework
The prob distribution of the # populations – i.e., the probdensity – with different allele frequencies shifts under thedirectional effects of selection & mutation (& migration) andflattens out under the effects of drift
u v
Integrate to find:
39
Two ‘classes’ of evolutionary ‘processes’ pushing apopulation into and out of a time slice of allelefrequencies from p to p+e (think of heat/water
diffusing along a pipe)
1.Directional (‘mean’) processes, M(p): nonzeroexpected change in allele frequency within any onepopulation (selection, mutation, migration,recombination) – measured by expected change overone generation.
2.Nondirectional (‘variance’) processes, V(p):produce expected changed of zero but causedistribution to spread – all driftlike processes –measured by expected variance in next generation
Intuitive formulation of this differential equation (the Kolmogorov forward equation)
Ask: How can we figure out the change in density at a region densitycentered at p*?
p*+e p* p*–e p
Answer: it can changeeither due to M(p) orV(p) – consider in turnwhat each can do byfiguring out the netinflow-outflow thateach can produce
InflowOutflow
40
Contribution from M(p) Inflow – outflow calculation into slice centered at p* for
directional evolutionary process, M(p) – shifts entiredensity distribution over. Note that M(p) is the rate of flow
The direction of flow is fixed; what matters is the magnitude orvolume of the region from which it originates – the difference involume to the left of p* and the volume slice at p*
p*+e p* p*–e p
InflowOutflow
Therefore: (1) flow into thisregion is given by densitycentered at p*-e times rate offlow at p*-e, or:
(2) Flow out of this regionis given by density centered atp* times rate of flow at p, or:
Putting these together:
41
For nondirectional processes V(p), populations canmove either way – net flow is determined by the
difference between the differential flow to the left ofp* and the differential flow to the right of p*, i.e., the
2nd derivative wrt p
Further, there’s a factorof only 1/2 because: only halfthe populations diffusing out from, say, p*-e go to p*
The Kolmogorov forward equation
Now solve for equilibrium by setting this 0….
42
Solution for equilibrium frequency
Setting this to 0 and integrating first term over all values of p(since the eqn holds for all values of p, we get:
This can be solved by standard means…
The Grail Quest ends…
Well, almost… let’s solve this for particular case of M and V
(selection + mutational change f/back)
(variance in Wright-Fisher model)
43
!̂ (p) = Cw2Ne (1" p)4Neu"1 p4Nev"1
Drift wins when 4Neu« 1
!̂ (p) = Cw2Ne (1" p)4Neu"1 p4Nev"1
Figure removed due to copyright reasons.
44
Balancing selection Directional selection
Drift wins when 4Ne« 1Cannot say how effective selection is without knowingeffective population size!!!
!̂ (p)"e4Nesp(1# p)
p(1# p)
!̂ (p)"e4Nesp
p(1# p)
For next time:OK, how do we use this stuff tofigure out whether selection’sbeen at work????
Figures removed due to copyright reasons.
45
To think about from Nature
“Protein sequences evolve through random mutagenesiswith selection for optimal fitness” – Russ, Lowery,Mishra, Yaffe, Ranganathan, sept. 2005, 437:22, p. 579.Natural-like function in artificial WW domains.
46