Sampling distributions of alleles under models of neutral evolution

Preview:

Citation preview

Sampling distributions of alleles under models of neutral evolution

1. Genetic drift and mutation2. Coalescent3. Pairwise differences and numbers of segregating sites4. Population with time-varying size

Plan

Mathematical model for sampling distributions

of alleles

Genetic drift Mutation

Genetic drift

Alleles:

A1: A2:

Replication = sampling with replacement

A1 – becomes fixed

A2 – becomes lost

G1

G2

Gn

...

Mutation

Gk

Gk+1

Mutation introducesgenetic variability tothe evolution process

MutationMutation follows a Poisson process with intensity measured per locus (per site) per generation. Spatial characterization of places and effects caused, further specifies a mutation model. Most often applied are: infinite sites model, where it is assumed that each mutation takes place at a DNA site that never mutated before; infinite alleles model, where each mutation produces an allele never present in a population before; recurrent mutation model, where multiple changes of the nucleotide at a site are possible; stepwise mutation model, where mutation acts bidirectionally, increasing or reducing the number of repeats of a fixed DNA motif.

Infinite sites model

Mutation configuration in the infinite sites

model is fully described by a map between numbers of

sequences and numbers of mutations

1

2

3

4

5

Mutations

1 2 3 4 5 6

Seq

uen

ce

s

Statistics of mutations (segregating sites)

Number of segregating sites

1

2

3

4

5

Mutations

1 2 3 4 5 6

Seq

uen

ce

s

S=6

Pairwise differences

1

2

3

4

5

1 2 3 4 5 6

Seq

uen

ces

No of differencesd23 = 3

Mutations

Average number of pairwise differences = 3

Histogram of pairwise differences

No o

f p

air

s

No of differences

0

1

2

0 1 2 3 4 5 6

3

Classes of mutations

1

2

3

4

5

1 2 3 4 5 6

Mutation of class 2

Seq

uen

ces

Mutations

Histogram of classes of mutations

Fre

qu

en

cy

Class of mutation

0

0.5

1 2

1

Coalescence method

One looks at the past of an n - sample of sequences taken at present. Possible events that happen in the past are coalescences leading to common ancestors of sequences, and mutationsalong branches of ancestral tree.

Coalescence method

Present

Past

Generation 1, (=1)

Generation 2, (=2)

Generation k, (=k)

.

.

…….

n - sample

Pop

ula

tion

size

2N

2N

2N

Coalescence – pairwise statistics

Two sequences. For each sequence draw randomly a parent in generation 1 (=1), then for each parent draw randomly a (grand) parent in generation 2, (=2) …. . COMMON ANCESTOR2(i) - probability that a COMMON ANCESTOR of the two sequences lived in generation i (=i)

N2

1)1(2

)2

11(

2

1)2(2 NN

12 )

2

11(

2

1)( k

NNk

Coalescence – continuous time approximations

Population time scale 1 unit = 2N generations

Nt

2

tetp )(2

Mutational time scale 1 unit = 1/2 generations

2t Netpt

4 ,1

)(2

Coalescence n-samplek independent, exponentially distributed random variables mutation intensityN population's effective size

= 4N product parameter t = 2 mutational time scale ( - is time in number of

generations).

n

kkn s

kk

ssp2

2 )2

exp(2

),...,(

)2

exp(2

)( kk s

kk

sp

Coalescence method

The use of coalescence

theory allows efficient

formulation of appropriate models and

gives a good basis for

approaching model analysis problems, like

hypotheses testing or

parameter estimation.

s5

s4

s3

6

5

4

3

2

1t4

t2

1 2 3 4 5

s2

t3

t5

Independence of metrics (coalescence times) and

topology

Topologies of trees (with ordered

branches) are all equally probable.

Metrics (distributions of branch

lengths) of trees are determined by

coalescence process which, in turn,

depends on population parameters.

Coalescence – statistics of pairwise differences

Assume mutational time – scale. Then mutations occur with intensity = 1/2. Let A2 denote a Z+ random variable defined by number of segregating sites between sample 1 and sample 2.  T – random variable given by coalescence time t. Conditional probability that A2=n is Poisson with =t ! n

te

nt

P[A2=n | T=t] =

n

nAP

11

1][ 2

0

22 ][)(n

nsnAPs

)1(2 )|( stetTs

sss

11

1

1

)1(1

1)(2

Coalescence – population with time varying size

Population with time-varying size

Population's effective size N(t) changes in time, then product parameter is also a time function (t)= 4N(t)

Joint probability density function:

.0 ,

))(

exp(

)(),...,(

132

2

222

1

nn

n

k

t

t

k

k

k

n

tttt

d

tttp

k

k

How the history of population size

N(t) (t)is encoded in histograms

of pairwise differences and mutation classes ?

Pairwise differences

no of differences

0 5 10 150

12

34

56

7

time t

(t

)

Pairwise differences I

0 5 10 15 20 250

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

freq

uen

cy

no of differences

freq

uen

cy

Pairwise differences II

0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

time t

(t

)

0 5 10 15 20 25 300

20

40

60

80

100

120

no of differences

freq

uen

cy

Pairwise differences III

0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

time t

(t

)

0 5 10 150

50

100

150

200

250

Mutation classes

Frequencies are computed under the assumption

that mutaion intensity is low

Mutation classes I

0 5 10 150

12

34

56

7

time t

SNP type

N(t

)fr

eq

uen

cy

1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.3

N(t)=const

SNP type

time t

N(t

)fr

eq

uen

cy

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 5 10 150

50

100

150

200

250

N(t)=N0exp(rt)

0.5

N0r=10

Mutation classes II

time t

SNP type

N(t

)fr

eq

uen

cy

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 5 10 15 20 25 300

20

40

60

80

100

120

0.6

Mutation classes III

Conclusions

Different histories of population sizes lead to different sampling distributions of alleles

Parametric models of different form (exponential, stepwise, logistic) can lead to similar (difficult to distinguish) distributions of alleles

Estimation of population size history from DNA data can be unstable

Models versus data

Parametric and nonparametric estimation of

population size histories from DNA samples

Testing hypotheses on values of parameters

under parametric models, testing hypotheses

of time constant versus time varying

scenario

Models versus data

0 2 4 6 8 10 12 14 16 18 200

50100150200250300350400450

0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

Data on worldwide distribution of mtDNA pairwise differences R. Cann et. al. 1987

Estimation of history of human population size

Models versus data II

2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6 Histogram of classes of mutations. Data on worldwide distribution of mtDNA pairwise differences R. Cann et. al. 1987

Models versus data III

Data on types of 44 SNPs randomly located in the genome Picoult, Newberg 2000

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

12

34

5

67

8

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.10.20.30.40.50.60.70.80.9

1

Parametric estimates of N(t) based on the above data

Recommended