Coalescent Module- Faro July 26th-28th 04 Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Coalescent Module- Faro July 26th-28th 04www.coalescent.dk

Monday H: The Basic CoalescentW: Forest FireW: The Coalescent + History, Geography & SelectionH: The Coalescent with Recombination

TuesdayH: Recombination cont. W: The Coalescent & CombinatoricsHW: Computer Session H: The Coalescent & Human Evolution

WednesdayH: The Coalescent & Statistics HW: Linkage Disequilibrium Mapping

http://www.coalescent/

globin

Exon 2Exon 1 Exon 3

5’ flanking 3’ flanking

(chromosome 11)

Zooming in!(from Harding + Sanger)

*5.000

*20

6*104 bp

3*109 bp

*103

3*103 bp

ATTGCCATGTCGATAATTGGACTATTTTTTTTTT 30 bp

From Cavalli-Sforza,2001

Human Migrations

Data: -globin from sampled humans. From Griffiths, 2001

Assume:

1. At most 1 substitution per position.

2.No recombination

Reducing nucleotide columns to bi-partitions gives a bijection between data & unrooted gene trees.

C G

Africa Non-Africa

0.2 Mu

tation rate: 2.5

Rate of com

mon

ancestry: 1

Past

Present

Simplified model of human sequence evolution.

Wait to com

mon

ancestry: 2N

e

From Griffiths, 2001

Models and their benefits.

Models + Data

1. probability of data (statistics...)

2. probability of individual histories

3. hypothesis testing

4. parameter estimation

Fixed Parameters: Population Structure, Mutation, Selection, Recombination,...

Reproductive Structure

Genealogies of non-sequenced data

Genealogies of sequenced data

Parameter Estimation

Model Testing

Coalescent Theory in Biologywww. coalescent.dk

TGTTGT CATAGTCGTTAT

Haploid Model

Diploid Model

Wright-Fisher Model of Population Reproduction

i. Individuals are made by sampling with replacement in the previous generation.

ii. The probability that 2 alleles have same ancestor in previous generation is 1/2N

Individuals are made by sampling a chromosome from the female and one from the male previous generation with replacement

Assumptions

1. Constant population size

2. No geography

3. No Selection

4. No recombination

10 Alleles’ Ancestry for 15 generations

Mean, E(X2) = 2N.

Ex.: 2N = 20.000, Generation time 30 years, E(X2) = 600000 years.

Waiting for most recent common ancestor - MRCA

P(X2 = j) = (1-(1/2N))j-1 (1/2N)

Distribution until 2 alleles had a common ancestor, X2?:

P(X2 > j) = (1-(1/2N))jP(X2 > 1) = (2N-1)/2N = 1-(1/2N)

1 2N 1 2N

1 1

2

j

1 2N

1

2

j

P(k):=P{k alleles had k distinct parents}

1 2N

1

2N *(2N-1) *..* (2N-(k-1)) =: (2N)[k]

(2N)k

k -> any k -> k k -> k-1

Ancestor choices:

€

P(k) =2N[k ]

(2N)k≈ (k 2 < 2N) 1−

k

2

⎛

⎝ ⎜

⎞

⎠ ⎟/2N ≈ e

−k

2

⎛

⎝ ⎜

⎞

⎠ ⎟/ 2N

€

k

2

⎛

⎝ ⎜

⎞

⎠ ⎟(2N)[k−1]

k -> j

€

Sk, j (2N)[ j ]

For k << 2N:

Sk,j - the number of ways to group k labelled objects into j groups.(Stirling Numbers of second kind.

Geometric/Exponential DistributionsThe Geometric Distribution: {1,..} Geo(p): P{Z=j)=pj(1-p) P{Z>j)=pj E(Z)=1/p.

The Exponential Distribution: R+ Exp (a) Density: f(t) = ae-at, P(X>t)= e-at

Properties: X ~ Exp(a) Y ~ Exp(b) independent

i. P(X>t2|X>t1) = P(X>t2-t1) (t2 > t1)

ii. E(X) = 1/a.

iii. P(Z>t)=(≈)P(X>t) small a (p=e-a).

iv. P(X < Y) = a/(a + b).

v. min(X,Y) ~ Exp (a + b).

2 56 3 0.0

1.0

1.0 corresponds to 2N generations

1 40

2N

0

6 6/2Ne

tc:=td/2Ne

€

Xk is exp[k

2

⎛

⎝ ⎜

⎞

⎠ ⎟] distributed. E(Xk ) =1/

k

2

⎛

⎝ ⎜

⎞

⎠ ⎟

Discrete Continuous Time

Probability for two genes being identical:

P(Coalescence < Mutation) = 1/(1+).

m mutation pr. nucleotide pr.generation. L: seq. length

µ = m*L Mutation pr. allele pr.generation. 2Ne - allele number.

:= 4N*µ -- Mutation intensity in scaled process.

Adding Mutations

sequence

time

Discrete timeDiscrete sequence

Continuous timeContinuous sequence

1/L

1/(2Ne)time

sequence

/2 /2

mutation mutation coalescence

Note: Mutation rate and population size usually appear together as a product, making separate estimation difficult.

1

The Standard Coalescent

Two independent Processes

Continuous: Exponential Waiting Times

Discrete: Choosing Pairs to Coalesce.

1 2 3 4 5

Waiting Coalescing

4--5

3--(4,5)

(1,2)--(3,(4,5))

1--2

€

Exp5

2

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

€

Exp4

2

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

€

Exp2

2

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

€

Exp3

2

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

{1}{2}{3}{4}{5}

{1,2}{3,4,5}

{1,2,3,4,5}

{1}{2}{3}{4,5}

{1}{2}{3,4,5}

)1(

2

2/1

−=⎟⎟

⎠

⎞⎜⎜⎝

⎛kk

k

Expected Height and Total Branch Length

Expected Total height of tree: Hk= 2(1-1/k)

i.Infinitely many alleles finds 1 allele in finite time. ii. In takes less than twice as long for k alleles to find 1 ancestors as it does for 2 alleles.

Expected Total branch length in tree, Lk:

2*(1 + 1/2 + 1/3 +..+ 1/(k-1)) ca= 2*ln(k-1)

1

2

3

k

1/3

1 2

1

2/(k-1)

Time Epoch Branch Lengths

B. The Paint Box & exchangable distributions on Partitions.

C. All coalescents are restrictions of “The Coalescent” – a process with entrance boundary infinity.

D. Robustness of “The Coalescent”: If offspring distribution is exchangeable and Var(1) --> 2 & E(1m) < Mm for all m, then genealogies follows ”The Coalescent” in distribution.

E. A series of combinatorial results.

Kingman (Stoch.Proc. & Appl. 13.235-248 + 2 other articles,1982)

A. Stochastic Processes on Equivalence Relations.

={(i,i);i= 1,..n} ={(i,j);i,j=1,..n}

1 if s < tqs,t = 0 otherwise

This defines a process, Rt , going from to through equivalence relations on {1,..,n}.

Effective Populations Size, Ne.

In an idealised Wright-Fisher model:

i. loss of variation per generation is 1-1/(2N).

ii. Waiting time for random alleles to find a common ancestor is 2N.

Factors that influences Ne:

i. Variance in offspring. WF: 1. If variance is higher, then effective population size is smaller.

ii. Population size variation - example k cycle:

N1, N2,..,Nk. k/Ne= 1/N1+..+ 1/Nk. N1 = 10 N2= 1000 => Ne= 50.5

iii. Two sexes Ne = 4NfNm/(Nf+Nm)I.e. Nf- 10 Nm -1000 Ne - 40

6 Realisations with 25 leaves

Observations:

Variation great close to root.

Trees are unbalanced.

Sampling more sequences

The probability that the ancestor of the sample of size n is in a sub-sample of size k is

Letting n go to infinity gives (k-1)/(k+1), i.e. even for quite small samples it is quite large.

€

(n+1)(k−1)(n−1)(k+1)

Three Models of Alleles and Mutations.

Infinite Allele Infinite Site Finite Site

acgtgcttacgtgcgtacctgcattcctgcattcctgcat

acgtgcttacgtgcgtacctgcattcctggcttcctgcat

i. Only identity, non-identity is determinable

ii. A mutation creates a new type.

i. Allele is represented by a line.

ii. A mutation always hits a new position.

i. Allele is represented by a sequence.

ii. A mutation changes nucleotide at chosen position.

1 2 3 4 5

11)}1{( →

12)}2,1{( →

21)}2(),1{( →21)}2(),1{( →

1121)}3,2(),1{( →1121)}3,2(),1{( →

2121)}5,4)(3(),2,1{( →

1321)}5,4)(3(),2(),1{( →

Infinite Allele Model

Final Aligned Data Set:

Infinite Site Model

0

1

2

3

4

5

6

7

8

1

1

4

2

5

3

1

5

5

0

1

2

3

4

5

6

7

8

1

1

4

2

5

3

1

5

5

Number of paths:

2 22

3 4 4 6 2

7 7 14 8 2

28 22 10

50 32

82

2

1

345

2

1

3

45

2

{ }, ,

Ignoring mutation position

Ignoring sequence label

Ignoring mutation position

Ignoring sequence label

Labelling and unlabelling:positions and sequences

€

2θ

5(4 + θ)

€

1

(4 + θ)9 coalescence events incompatible with data

4 classes of mutation events incompatible with data

The forward-backward argument

Infinite Site Model: An example

Theta=2.12

2

3 2 3

5 54

910 5

19 14

33

Impossible Ancestral States

Final Aligned Data Set:

acgtgcttacgtgcgtacctgcattcctgcattcctgcats s s

Finite Site Model

1) Only substitutions. s1 TCGGTA s1 TCGGA s2 TGGT-T s2 TGGTT

2) Processes in different positions of the molecule are independent.

3) A nucleotide follows a continuous time Markov Chain.

4) Time reversibility: I.e. πi Pi,j(t) = πj Pj,i(t), where πi is the stationary distribution of i. This implies that

Simplifying assumptions

*)( 21121 2121) l(l)PP(N)(l)*P(lPaP ,NNa,Na,N

a

+=∑

5) The rate matrix, Q, for the continuous time Markov Chain is the same at all times.

=

a

N1 N2

l2+l1l1 l2 N2N1

Evolutionary Substitution Process

t1 t2

CCA

Pi,j(t) = probability of going from i to j in time t.

ijji q

P=>−

)(lim ,

0 iiii q

P−=

−>−

1)(lim ,

0

Jukes-Cantor 69: Total Symmetry.

-3* -3* -3* -3*

TO A C G TFROM

A. Stationary Distribution: (.25,.25,.25,.25)

B. Expected number of substitutions: 3t

ACGT

0 t

)1(41),( 4

,t

t eGCP

−−=

Higher CellsChimp Mouse Fish E.coli

ATTGTGTATATAT….CAG

ATTGCGTATCTAT….CCG

History of Coalescent Approach to Data Analysis

1930-40s: Genealogical arguments well known to Wright & Fisher.

1964: Crow & Kimura: Infinite Allele Model

1968: Motoo Kimura proposes neutral explanation of molecular evolution & population variation. So does King & Jukes

1971: Kimura & Otha proposes infinite sites model.

1972: Ewens’ Formula: Probability of data under infinite allele model.

1975: Watterson makes explicit use of “The Coalescent”

1982: Kingman introduces “The Coalescent”.

1983: Hudson introduces “The Coalescent with Recombination”

1983: Kreitman publishes first major population sequences.

History of Coalescent Approach to Data Analysis

1987-95: Griffiths, Ethier & Tavare calculates site data probability under infinite site model.

1994-: Griffiths-Tavaré + Kuhner-Yamoto-Felsenstein introduces highly computer intensitive simulation techniquees to estimate parameters in population models.

1996- Krone-Neuhauser introduces selection in Coalescent

1998- Donnelly, Stephens, Fearnhead et al.: Major accelerations in coalescent based data analysis.

2000-: Several groups combines Coalescent Theory & Gene Mapping.

2002: HapMap project is started.

Basic Coalescent Summary

i. Genealogical approach to population genetics.

ii. ”The Coalescent” - generic probability distribution on allele trees.

iii. Combining ”The Coalescent” with Allele/Mutation Models allows the calculation the probability of data.

Documents

Coalescent Module- Faro July 26th-28th 04 Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography