Download pdf - HST.508/Biophysics 170: Quantitative genomics …...HST.508/Biophysics 170: Quantitative genomics Module 1: Evolutionary and population genetics Lecture 3: natural selection Professor

HST.508/Biophysics 170:Quantitative genomics

Module 1: Evolutionary and populationgenetics

Lecture 3: natural selection

Professor Robert C. Berwick

Topics for this module

1. The basic forces of evolution; neutral evolution and drift

2. Computing ‘gene geneaologies’ forwards and backwards;the coalescent

3. Coalescent extensions; Natural selection and itsdiscontents

4. Detecting selection: Molecular evolution; from classicalmethods to modern statistical inference techniques

Harvard-MIT Division of Health Sciences and TechnologyHST.508: Quantitative Genomics, Fall 2005Instructors: Leonid Mirny, Robert Berwick, Alvin Kho, Isaac Kohane

1

Agenda for today

1.Coalescing the coalescent: the Great Obsession;adding complications like demographics,recombination; how you can use the coalescent(simulation, estimation, testing)

2. Natural selection: from the basic dynamical systemequation to the diffusion approximation: how cangenes survive?

Coalescent Summary1. Coalescent theory describes the genealogical relationships

among individuals in a Wright-Fisher population

2. Sample, rather than population.

3. Retrospective (how did things get to be the way theyare?) rather than prospective (what happens if?) – betterfor our situation of sampling from data.

3. That is: the coalescent model differs from the ‘classical’random sampling gene pool model in that it gives us theopportunity to start with polymorphism data and workbackwards – start with simplest model, if doesn’t work,change the model

4. Separate demography (coalescent) from genetics(mutation) - allows to separate the two & so gives usbasic test statistics for diversity/variation (θ, π )

2

The Great Obsession: variation (polymorphism)entangled with descent

Time

Observed sample variation: is it fromdescent (tree), or from biology?

2N genCoalescence

Two ‘competing’ stochastic processes intertwined

1. Gene trees: How long until sample sequences have commonancestor (coalesce)?

Answer: the coalescent models the geneology of a sample of nindividuals drawn from a (putative) population of size N as a randombifurcating tree. The n-1 coalescent times T(n), T(n-1), …, T(1) are(to an approximation) mutually independent, exponentiallydistributed random variables

Rate of coalescence for two lineages is (scaled) at 1, where this is 2Ngenerations; Total rate, for k lineages is ‘k choose 2’

2. Genetics: sprinkle in Poisson mutation process with rate λ=ut,then what is expected distribution of variation?

3

Expected time to coalescence

Time to coalescence for n sequences or genes

4

The structure of the basic coalescent

Expected time to coalescence for 2 genes is 2N; variance 2N(2N–1)For n sequences or genes…If time is measured in units of 2N generations, by t’= t/2NE[Tk, k-1]= 1/(k choose 2); variance is square of thisTime to MRCA for all genes is sum of these times, or 2(1-1/n) [againin units of time measured in 2N, i.e., 4N(1-1/n) unscaled time

10

Estimating nucleotide divergence as θ

Generation t–1

Notation: Ti= time to collapse of i genes, sequences,…

u u

uT uT

2uT= 2u 2N = 4uN =θ

5

Expected # mutations, n allele or sequence case

So this gives us the expected amount ofsequence diversity

Summary results for basic coalescent

• Expected time to coalesce, for 2 alleles, 2N

• Expected time to coalesce, all k alleles (hence avgfixation time)

• Expected # of segregating sites

• Expected amount of sequence diversityE[S

N] = uE[T

C] = !

1

i "1i=2

n

#

E[TC] = iE[T

i

i=2

n

! ] = 4N1

i "1i=2

n

!

6

Estimators for theta = 4Nu

What’s this stuff goodfor?

1.Estimation2.Simulation3.Rejecting null model

7

Basic estimation idea: find coalescents that are‘improbable’ to detect interesting (i.e., unlikely)

patterns of mutations

This helps us untangle two sources of variation:gene/sequence tree divergence from polymorphism

ΘT = Average Pairwise Distance

= (1+3+3+3+2+2+2+2+2)/10=2

A mutation on an interior branch will have higher weight

ΘT estimated from pairwise differences(heterozygosity)

ACCTGAACGTAGTTCGAAGACCTGAACGTAGTTCGAATACCTGACCGTAGTACGAATACATGAACGTAGTACGAATACATGAACGTAGTACGAAT * * * *A B C D

A

B

C

D

1 2 3 4 5

(just the average heterozygosity)

8

ΘW=4Νµ estimated from # segregating sites

ACCTGAACGTAGTTCGAAGACCTGAACGTAGTTCGAATACCTGACCGTAGTACGAATACATGAACGTAGTACGAATACATGAACGTAGTACGAAT * * * *A B C D

A

B

C

D

1 2 3 4 5

Watterson, 1975

Expected number of segregating sites:

ΘW= 4/(1+1/2+1/3+1/4)=24/11=2.1818

Different coalescent patterns (relative branch lengths) yielddifferent estimates for theta even though total branch length

is the same and # segregating sites remains the same

Second type of mutation counted more times when calculating the average pairwise distance – typical when there’s a ‘burst’ after a population bottleneck

Use the difference between the two estimates to figure out a statistical measurethat can pick out these two patterns

9

ΘE = 2

No weight to internal branches

ΘE estimated from external branchesACCTGAACGTAGTTCGAAGACCTGAACGTAGTTCGAATACCTGACCGTAGTACGAATACATGAACGTAGTACGAATACATGAACGTAGTACGAAT * * * *A B C D

A

B

C

D

1 2 3 4 5

Which should we use?????

Consider these coalescent pattern differences & whatthey imply about possible patterns of variation

(heterozygosity) if there are neutral mutations sprinkledon these patterns…

Note that S= # segregating sites remains the same…

Expect: more mutations oninterior branches, sampleheterozygosity higher

Expect: fewer mutations oninterior branches, sampleheterozygosity lower

10

Two estimates of theta

Use of Tajima’s D

11

Human mitachondrial DNA

Factors affecting test power

12

Example 1 – human mitochondrial DNA

52 complete molecules

521 segregating sites

ΘT = 44.2 ΘW = 115.3

Std(ΘT−ΘW)=31.8

D = -2.23 (P<0.01)

Ingman et al. 2000

Example 2 – human Y-chromosome

3 Y-chromosome genes, 40 kb of sequence in 53males

47 polymorphic sites

Tajima’s D: -2.3, -2.0 and –1.8 highly significant

TMRCA: No growth: 84,000 (55,000-149,000) Exponential: 59,000 (40,000-140,000)

With exponential growth more mutations are recentand therefore estimated TMRCA is smaller

Thomson et al. 2000, Shen et al. 2000

13

• Ex: ~1400bp at Sod locus in Drosophila10 taxa

5 were identical. The other 5 had 55 mutations

Q: Is this a chance event, or is there selection for thishaplotype?

Applications– Simulation for model testing

Simulation results1. 10000 coalescentsimulations were performed on 10 taxa

2. 55 mutations placed on the coalescent branches

3. Count the number oftimes 5 lineages areidentical

4.This event happened inonly 1.1% of the cases

Conclusion: selection, orsome other mechanismexplains this data – not theneutral mutations

14

Extensions to the mathematical/computationalmodel

1. Effective population size, not census population size

2. Demographic changes generally: population flux,migration, gene flow

3. Recombination – turns the trees into general networks.

4. Selection–gene copies no longer act ‘independently’

5. Statistical– to get confidence limits, etc., must simulateover many generated ‘trees’ – use likelihood methods(Computer packages: Lamarc; Simcoal2; …)

Complications make the simple coalescent look morecomplicated! Population flow, Recombination,

Migration

15

Demographic correctionsPart 1 – Effective population size, Ne

Ne= # individuals in a theoretical population that,

subjected to the same magnitude of drift, wouldpresent an equivalent level of diversity

Department of Corrections – does this actually work?Estimating N from polymorphism data: must use

effective population size for theta!

16

Patching the model; demographics matters

Effective population size, Ne

Working definition: the size of an ideal population that hasthe same properties with respect to genetic drift as theactual population does

Lots of ways to define what’s in italics…

1. Variance adjustment2. Inbreeding adjustment(first related to # of individuals in offspring generation;second related to # of individuals in parental generation)

17

Fluctuating population size

But Why do we use the harmonicmean???

Example: if population size is 1000 w/ pr 0.9 and 100 w/ pr0.1, arithmetic mean is 901, but the harmonic mean is (0.9 x1/1000 + 0.1 x 1/10)-1 = 91.4, an order of magnitude less!

Thus, if we have a population (like humans, cheetahs) goingthrough a ‘squeeze’, this changes the population sizes, hence θ

In general:Variance effective population size

18

Effective population size must be used to ‘patch’ theWright-Fisher model to keep the variance the same

Variance for N1 is p(1–p)/2N1 with probability rVariance for N2 is p(1–p)/2N2 with probability 1–rAverage these 2 populations together, to get meanvariance, ‘solve’ for Ne

Var[p '] = p(1! p)r

2N1

+1! r2N

2

"

#$%

&' or

Ne=

1

r1

N1

+ (1! r)1

N2

i.e., the harmonic mean of the population sizes (the reciprocal of the averageof the reciprocals) is used because it averages the variation properly!

Always smaller than the mean; Much more sensitive to small numbers

Standard variance is pq/N

Demographic Corrections, part 2: effects oncoalescent

19

The effect of population growth

20

Try different simulations…which matches data best?

21

Why is modeling selection hard with thecoalescent?

Problem: Genealogical and mutation processes no longerindependent!

Two alleles, A and a, A has an advantage of sMutation rate between types = u

Krone and Neuhauser 1997

Summary so far…

A strong bottleneck resembles population growth A weaker bottleneck resembles directional selection for some lociAnd balancing selection for other loci

22

This is where current computer packages take us!

http://evolve.zoo.ox.ac.uk/software.html?id=treevolve

Screenshots removed due to copyright reasons. Please see: University of Oxford, Department of Zoology, Evolutionary Biology Group: http://evolve.zoo.ox.ac.uk/software.html________________________

23

http://evolve.zoo.ox.ac.uk/software.html

Modeling natural selection: fromthe simple auto mechanics or

algebra of selection to the diffusionapproximation

Evolution by natural selection• Natural selection is the process by which individuals

contribute more or less offspring in the next generation due tofitness differences, which can be caused by differentialviability, mating success,…

• The selection coefficient is the fitness effect of a mutationacross genetic backgrounds & environments. In a haploidpopulation with two alleles A and a, with fitness values w1 andw2 , the selection coefficient is w1–w2. Fitness values take onarbitrary units since they are measured relative to apopulation mean fitness, w-bar, which is set to 1

• If w11, w12, w22, are the fitness values associated with AA, Aa,and aa, then:

1. If w11 < w12 < w22 there is positive, directional selection forAA and negative, directional selection against aa

Let’s do the basic algebra, and then thegeneral case…

24

Sewall Wright’s adaptive landscape:Understanding the formula

!! =

Sn

1+1

2+

1

3+"+

1

n "1

=11

2.93= 4.78 (4Nu) for locus

!! for nucleotide site=

4.78

768= 0.0062 }

direction

mean fitness

Some dissection…

Variance component of allele Awithin genotype

Why variance? Draw from pool ofA, a gametes many many times:binomial sampling – frequency of Awithin a genotype is either 1, 1/2, or0; variance is p(1-p)/2(“heterozygosity”)

Slope of fitness function dividedby mean population fitness – apotential function?

25

The new reality game show - “Survivor”1 gene in 2 different forms (alleles)

w11 p2 w12 2pq w22 q

2

genotype

frequency

afterselection

Viability

AA Aa aa

p2 q22pq

w11 w12 w22

survivors

Intuitively, w is a ‘growth rate’

Note that if Nt = # before selection, the total # after selectionis:

w11 p2 w12 2pq w22 q

2

genotype

frequency

afterselection

relativefitness

AA Aa aa

p2 q22pq

w11 w12 w22

What is the average (marginal) fitness of A’s?

This is the expectation that A will survive

26

Two allele case: we can now calculate p – p′ i.e., thechange in allele frequency, or evolution

Think about what this means: what if w1 is greater than average fitness? Less?

To derive the rest of the ‘jet fuel’ formula

27

The ‘jet fuel’ formula

Rate proportional to differencein relativefitnesses

Adaptation is not instantaneous:The ratio of p to (1-p) changes by w1/w2 every generation

After t generations,

Getting a feel for the dynamicsGenotype: AA Aa aaRelative fitness: 1

1-hs 1-s1–hs= w12/w111–s = w22 /w11

s= selection coefficient. Measure of fitness of AA relative to aa.If positive, aa is less fit than AA; if negative, aa is more fith= heterozygous effect. Measure of fitness of heterozygote relativeto selective difference between the two homozygotes –a measure ofdominance:

h=0, A dominant, a recessiveh=1, a dominant, A recessive0< h <1 incomplete dominanceh<0 overdominanceh >1 underdominance

28

Dynamical system analysis of ‘adaptive topography’or mean fitness vs. p - nondegenerate case

Equillibrium value of p

The delta p equation in these terms (relative fitnesses)

where

h determines where allele frequency ends up;s determines how quickly it gets there

There turn out to be three kinds of selection:dominant (AA> Aa > aa);overdominant (Aa> AA, aa);underdominant (AA, aa > Aa)

29

Some dissection…

Variance component of allele Awithin genotype

Why variance? Draw from pool ofA, a gametes many many times:binomial sampling – frequency of Awithin a genotype is either 1, 1/2, or0; variance is p(1-p)/2(“heterozygosity”)

Slope of fitness function dividedby mean population fitness – apotential function?

Some exploration, fitness AA is 1.0; Aa = 0.95, aa= 0.90

Screenshots removed due to copyright reasons.

30

Plot avg fitness vs p to get feel for the dynamics…Note that avg fitness is a quadratic function so it can have at most 1 minimumor maximum…

‘Degenerate’ case: quadratic mean fitness, withw

12= (w

11+w

22)/2

One locus, 2 allele case: graphs, p vs. w

p p p

‘Degenerate’ case: quadratic mean fitness, withw

12= (w

11+w

22)/2

w w w

0 1 0 01 1

frequency A frequency A frequency A

Avg fitness Avg fitness Avg fitness

Directional selection Directional selection Zip selection

31

The four nonlinear cases - selection at one locus, 2 alleles - adaptive topography

! p̂ =w22" w

12

(w11" w

12) + (w

22" w

12)=w22" w

12

w22" w

12

= 1

p

w

underdominancep

w

overdominance

Directional selection

p

w

frequency A0 1

Avg fitness w11 < w12 = w22

p

w

0 1frequency A

Avg fitness w11 = w12 > w22

p

w

p

w

The four nonlinear cases - selection at one locus, 2 alleles - adaptivetopography

0 0 11

frequency A frequency A

Avg fitness Avg fitness

Balancing selection

Disruptive selection

32

!!pi= !

p1

p2

p3

"

#

$$$

%

&

'''

=1

2w

p1(1( p

1) ( p

1p2

( p1p3

( p2p1

p2(1( p

2) ( p

2p3

( p3p1

( p3p2

p3(1( p

3)

"

#

$$$

%

&

'''

)w

)p1

)w

)p2

)w

)p3

"

#

$$$$$$$

%

&

'''''''

!!p =

G

w"w

But…

33

Climb every mountain? Some surprising results

• The power of selection: what is the fixation probability for a new mutation?

• If no selection, the pr of loss in a single generation is 1/e or 0.3679

• In particular: suppose new mutation has 1% selection advantage as heterozygote – this is a huge difference

• Yet this will have only a 2% chance of ultimate fixation, starting from 1 copy (in a finite population aPoisson # of offspring, mean 1+s/2, the Pr of extinction in a single generation is e-1(1-s/2), e.g., 0.3642 fors= 0.01)

• Specifically, to be 99% certain a new mutation will fix, for s= 0.001, we need about 4605 allele copies(independent of population size N !!)

• Also very possible for a deleterious mutation to fix, if 2Ns is close to 1

• Why? Intuition: look at the shape of the selection curve – flat at the start, strongest at the middle

• To understand this, we’ll have to dig into how variation changes from generation to generation, in finitepopulations

Regime 1: very lowcopy # Regime 2:

Frequencymatters

The fate of selected mutations

2Ns (compare to Nu factor)

34

Time to fixation for selected genes:can we find this in face of

population size, mutation, drift?

pdf forgene freqp

Mutationrate to pMean

fitness

EffectivePopulationsize

!̂ (p) = Cw2Ne (1" p)4Neu"1 p4Nev"1

The fixation probability of selected alleles – large population(no effects from ‘demographic stochastics’)

Pr{Fixation}= 1–Pr{extinction} = 2s

35

For N large, this is Poisson with mean 1+s, so the # of Aawith k surviving offspring has probability:

Assume binomial draw with N trials, pr of success on each trial is (1+s)/N

Substitute back for pk

This is a Taylor series expansion of ex so we can rewrite as:

If s is small, then expand RHS as power series in lambda, droppingterms beyond square, ie, lambda is near 1

Solved when either λ= 1 or

So when s is small, pr of survival of new mutant is either verynearly 2s or else 0 (if s less than 0)When s=0.01, only 1 new mutant in 50 will succeed inspreading, despite that all are advantageous; if s=0.1, whichfixes very rapidly in deterministic case, only 1 in 6 will win

36

So, 2s turns out to be a good approximation to the exactfixation probability for small s

# of copies of allele matters - must get over theinitial ‘hump’

Pr that n copies go extinct (since all lineages are independent):

Eg, once 100 copies present, s=0.01, pr loss is only 0.14; with 1000copies, less than 3 x 10-9

Tells us about time course of selection with new mutation: it doesfollow two regimes…

What about branching process vs. deterministic equations - thedifference is between the # of copies and the gene frequency

How do we put back drift?

37

And a General Rule

But what about the interaction with drift??????

The whole banana: thediffusion approximationto evolutionary processes

38

Our goal: how can we model the full interaction ofstochastic forces and selection, mutation,

migration…?

Our answer: write an approximating differentialequation that involves all these ‘forces’

Set to 0 and solve to find equilibrium allele frequencydistribution for p

Diffusion Theory: Conceptual Framework

The prob distribution of the # populations – i.e., the probdensity – with different allele frequencies shifts under thedirectional effects of selection & mutation (& migration) andflattens out under the effects of drift

u v

Integrate to find:

39

Two ‘classes’ of evolutionary ‘processes’ pushing apopulation into and out of a time slice of allelefrequencies from p to p+e (think of heat/water

diffusing along a pipe)

1.Directional (‘mean’) processes, M(p): nonzeroexpected change in allele frequency within any onepopulation (selection, mutation, migration,recombination) – measured by expected change overone generation.

2.Nondirectional (‘variance’) processes, V(p):produce expected changed of zero but causedistribution to spread – all driftlike processes –measured by expected variance in next generation

Intuitive formulation of this differential equation (the Kolmogorov forward equation)

Ask: How can we figure out the change in density at a region densitycentered at p*?

p*+e p* p*–e p

Answer: it can changeeither due to M(p) orV(p) – consider in turnwhat each can do byfiguring out the netinflow-outflow thateach can produce

InflowOutflow

40

Contribution from M(p) Inflow – outflow calculation into slice centered at p* for

directional evolutionary process, M(p) – shifts entiredensity distribution over. Note that M(p) is the rate of flow

The direction of flow is fixed; what matters is the magnitude orvolume of the region from which it originates – the difference involume to the left of p* and the volume slice at p*

p*+e p* p*–e p

InflowOutflow

Therefore: (1) flow into thisregion is given by densitycentered at p*-e times rate offlow at p*-e, or:

(2) Flow out of this regionis given by density centered atp* times rate of flow at p, or:

Putting these together:

41

For nondirectional processes V(p), populations canmove either way – net flow is determined by the

difference between the differential flow to the left ofp* and the differential flow to the right of p*, i.e., the

2nd derivative wrt p

Further, there’s a factorof only 1/2 because: only halfthe populations diffusing out from, say, p*-e go to p*

The Kolmogorov forward equation

Now solve for equilibrium by setting this 0….

42

Solution for equilibrium frequency

Setting this to 0 and integrating first term over all values of p(since the eqn holds for all values of p, we get:

This can be solved by standard means…

The Grail Quest ends…

Well, almost… let’s solve this for particular case of M and V

(selection + mutational change f/back)

(variance in Wright-Fisher model)

43

!̂ (p) = Cw2Ne (1" p)4Neu"1 p4Nev"1

Drift wins when 4Neu« 1

!̂ (p) = Cw2Ne (1" p)4Neu"1 p4Nev"1

Figure removed due to copyright reasons.

44

Balancing selection Directional selection

Drift wins when 4Ne« 1Cannot say how effective selection is without knowingeffective population size!!!

!̂ (p)"e4Nesp(1# p)

p(1# p)

!̂ (p)"e4Nesp

p(1# p)

For next time:OK, how do we use this stuff tofigure out whether selection’sbeen at work????

Figures removed due to copyright reasons.

45

To think about from Nature

“Protein sequences evolve through random mutagenesiswith selection for optimal fitness” – Russ, Lowery,Mishra, Yaffe, Ranganathan, sept. 2005, 437:22, p. 579.Natural-like function in artificial WW domains.

46