37
Coalescent Module- Faro July 26th-28th 04 www.coalescent .dk Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography & Selection H: The Coalescent with Recombination Tuesday H: Recombination cont. W: The Coalescent & Combinatorics HW: Computer Session H: The Coalescent & Human Evolution Wednesday H: The Coalescent & Statistics HW: Linkage Disequilibrium Mapping

Coalescent Module- Faro July 26th-28th 04 Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Embed Size (px)

Citation preview

Page 1: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Coalescent Module- Faro July 26th-28th 04www.coalescent.dk

Monday H: The Basic CoalescentW: Forest FireW: The Coalescent + History, Geography & SelectionH: The Coalescent with Recombination

TuesdayH: Recombination cont. W: The Coalescent & CombinatoricsHW: Computer Session H: The Coalescent & Human Evolution

WednesdayH: The Coalescent & Statistics HW:  Linkage Disequilibrium Mapping

Page 2: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

globin

Exon 2Exon 1 Exon 3

5’ flanking 3’ flanking

(chromosome 11)

Zooming in!(from Harding + Sanger)

*5.000

*20

6*104 bp

3*109 bp

*103

3*103 bp

ATTGCCATGTCGATAATTGGACTATTTTTTTTTT 30 bp

Page 3: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

From Cavalli-Sforza,2001

Human Migrations

Page 4: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Data: -globin from sampled humans. From Griffiths, 2001

Assume:

1. At most 1 substitution per position.

2.No recombination

Reducing nucleotide columns to bi-partitions gives a bijection between data & unrooted gene trees.

C G

Page 5: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Africa Non-Africa

0.2 Mu

tation rate: 2.5

Rate of com

mon

ancestry: 1

Past

Present

Simplified model of human sequence evolution.

Wait to com

mon

ancestry: 2N

e

Page 6: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

From Griffiths, 2001

Page 7: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Models and their benefits.

Models + Data

1. probability of data (statistics...)

2. probability of individual histories

3. hypothesis testing

4. parameter estimation

Page 8: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Fixed Parameters: Population Structure, Mutation, Selection, Recombination,...

Reproductive Structure

Genealogies of non-sequenced data

Genealogies of sequenced data

Parameter Estimation

Model Testing

Coalescent Theory in Biologywww. coalescent.dk

TGTTGT CATAGTCGTTAT

Page 9: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Haploid Model

Diploid Model

Wright-Fisher Model of Population Reproduction

i. Individuals are made by sampling with replacement in the previous generation.

ii. The probability that 2 alleles have same ancestor in previous generation is 1/2N

Individuals are made by sampling a chromosome from the female and one from the male previous generation with replacement

Assumptions

1. Constant population size

2. No geography

3. No Selection

4. No recombination

Page 10: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

10 Alleles’ Ancestry for 15 generations

Page 11: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Mean, E(X2) = 2N.

Ex.: 2N = 20.000, Generation time 30 years, E(X2) = 600000 years.

Waiting for most recent common ancestor - MRCA

P(X2 = j) = (1-(1/2N))j-1 (1/2N)

Distribution until 2 alleles had a common ancestor, X2?:

P(X2 > j) = (1-(1/2N))jP(X2 > 1) = (2N-1)/2N = 1-(1/2N)

1 2N 1 2N

1 1

2

j

1 2N

1

2

j

Page 12: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

P(k):=P{k alleles had k distinct parents}

1 2N

1

2N *(2N-1) *..* (2N-(k-1)) =: (2N)[k]

(2N)k

k -> any k -> k k -> k-1

Ancestor choices:

P(k) =2N[k ]

(2N)k≈ (k 2 < 2N) 1−

k

2

⎝ ⎜

⎠ ⎟/2N ≈ e

−k

2

⎝ ⎜

⎠ ⎟/ 2N

k

2

⎝ ⎜

⎠ ⎟(2N)[k−1]

k -> j

Sk, j (2N)[ j ]

For k << 2N:

Sk,j - the number of ways to group k labelled objects into j groups.(Stirling Numbers of second kind.

Page 13: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Geometric/Exponential DistributionsThe Geometric Distribution: {1,..} Geo(p): P{Z=j)=pj(1-p) P{Z>j)=pj E(Z)=1/p.

The Exponential Distribution: R+ Exp (a) Density: f(t) = ae-at, P(X>t)= e-at

Properties: X ~ Exp(a) Y ~ Exp(b) independent

i. P(X>t2|X>t1) = P(X>t2-t1) (t2 > t1)

ii. E(X) = 1/a.

iii. P(Z>t)=(≈)P(X>t) small a (p=e-a).

iv. P(X < Y) = a/(a + b).

v. min(X,Y) ~ Exp (a + b).

Page 14: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

2 56 3 0.0

1.0

1.0 corresponds to 2N generations

1 40

2N

0

6 6/2Ne

tc:=td/2Ne

Xk is exp[k

2

⎝ ⎜

⎠ ⎟] distributed. E(Xk ) =1/

k

2

⎝ ⎜

⎠ ⎟

Discrete Continuous Time

Page 15: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Probability for two genes being identical:

P(Coalescence < Mutation) = 1/(1+).

m mutation pr. nucleotide pr.generation. L: seq. length

µ = m*L Mutation pr. allele pr.generation. 2Ne - allele number.

:= 4N*µ -- Mutation intensity in scaled process.

Adding Mutations

sequence

time

Discrete timeDiscrete sequence

Continuous timeContinuous sequence

1/L

1/(2Ne)time

sequence

/2 /2

mutation mutation coalescence

Note: Mutation rate and population size usually appear together as a product, making separate estimation difficult.

1

Page 16: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

The Standard Coalescent

Two independent Processes

Continuous: Exponential Waiting Times

Discrete: Choosing Pairs to Coalesce.

1 2 3 4 5

Waiting Coalescing

4--5

3--(4,5)

(1,2)--(3,(4,5))

1--2

Exp5

2

⎝ ⎜ ⎜

⎠ ⎟ ⎟

Exp4

2

⎝ ⎜ ⎜

⎠ ⎟ ⎟

Exp2

2

⎝ ⎜ ⎜

⎠ ⎟ ⎟

Exp3

2

⎝ ⎜ ⎜

⎠ ⎟ ⎟

{1}{2}{3}{4}{5}

{1,2}{3,4,5}

{1,2,3,4,5}

{1}{2}{3}{4,5}

{1}{2}{3,4,5}

Page 17: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

)1(

2

2/1

−=⎟⎟

⎞⎜⎜⎝

⎛kk

k

Expected Height and Total Branch Length

Expected Total height of tree: Hk= 2(1-1/k)

i.Infinitely many alleles finds 1 allele in finite time. ii. In takes less than twice as long for k alleles to find 1 ancestors as it does for 2 alleles.

Expected Total branch length in tree, Lk:

2*(1 + 1/2 + 1/3 +..+ 1/(k-1)) ca= 2*ln(k-1)

1

2

3

k

1/3

1 2

1

2/(k-1)

Time Epoch Branch Lengths

Page 18: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

B. The Paint Box & exchangable distributions on Partitions.

C. All coalescents are restrictions of “The Coalescent” – a process with entrance boundary infinity.

D. Robustness of “The Coalescent”: If offspring distribution is exchangeable and Var(1) --> 2 & E(1m) < Mm for all m, then genealogies follows ”The Coalescent” in distribution.

E. A series of combinatorial results.

Kingman (Stoch.Proc. & Appl. 13.235-248 + 2 other articles,1982)

A. Stochastic Processes on Equivalence Relations.

={(i,i);i= 1,..n} ={(i,j);i,j=1,..n}

1 if s < tqs,t = 0 otherwise

This defines a process, Rt , going from to through equivalence relations on {1,..,n}.

Page 19: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Effective Populations Size, Ne.

In an idealised Wright-Fisher model:

i. loss of variation per generation is 1-1/(2N).

ii. Waiting time for random alleles to find a common ancestor is 2N.

Factors that influences Ne:

i. Variance in offspring. WF: 1. If variance is higher, then effective population size is smaller.

ii. Population size variation - example k cycle:

N1, N2,..,Nk. k/Ne= 1/N1+..+ 1/Nk. N1 = 10 N2= 1000 => Ne= 50.5

iii. Two sexes Ne = 4NfNm/(Nf+Nm)I.e. Nf- 10 Nm -1000 Ne - 40

Page 20: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

6 Realisations with 25 leaves

Observations:

Variation great close to root.

Trees are unbalanced.

Page 21: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Sampling more sequences

The probability that the ancestor of the sample of size n is in a sub-sample of size k is

Letting n go to infinity gives (k-1)/(k+1), i.e. even for quite small samples it is quite large.

(n+1)(k−1)(n−1)(k+1)

Page 22: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Three Models of Alleles and Mutations.

Infinite Allele Infinite Site Finite Site

acgtgcttacgtgcgtacctgcattcctgcattcctgcat

acgtgcttacgtgcgtacctgcattcctggcttcctgcat

i. Only identity, non-identity is determinable

ii. A mutation creates a new type.

i. Allele is represented by a line.

ii. A mutation always hits a new position.

i. Allele is represented by a sequence.

ii. A mutation changes nucleotide at chosen position.

Page 23: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

1 2 3 4 5

11)}1{( →

12)}2,1{( →

21)}2(),1{( →21)}2(),1{( →

1121)}3,2(),1{( →1121)}3,2(),1{( →

2121)}5,4)(3(),2,1{( →

1321)}5,4)(3(),2(),1{( →

Infinite Allele Model

Page 24: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Final Aligned Data Set:

Infinite Site Model

Page 25: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

0

1

2

3

4

5

6

7

8

1

1

4

2

5

3

1

5

5

Page 26: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

0

1

2

3

4

5

6

7

8

1

1

4

2

5

3

1

5

5

Number of paths:

2 22

3 4 4 6 2

7 7 14 8 2

28 22 10

50 32

82

2

Page 27: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

1

345

2

1

3

45

2

{ }, ,

Ignoring mutation position

Ignoring sequence label

Ignoring mutation position

Ignoring sequence label

Labelling and unlabelling:positions and sequences

5(4 + θ)

1

(4 + θ)9 coalescence events incompatible with data

4 classes of mutation events incompatible with data

The forward-backward argument

Page 28: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Infinite Site Model: An example

Theta=2.12

2

3 2 3

5 54

910 5

19 14

33

Page 29: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Impossible Ancestral States

Page 30: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Final Aligned Data Set:

acgtgcttacgtgcgtacctgcattcctgcattcctgcats s s

Finite Site Model

Page 31: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

1) Only substitutions. s1 TCGGTA s1 TCGGA s2 TGGT-T s2 TGGTT

2) Processes in different positions of the molecule are independent.

3) A nucleotide follows a continuous time Markov Chain.

4) Time reversibility: I.e. πi Pi,j(t) = πj Pj,i(t), where πi is the stationary distribution of i. This implies that

Simplifying assumptions

*)( 21121 2121) l(l)PP(N)(l)*P(lPaP ,NNa,Na,N

a

+=∑

5) The rate matrix, Q, for the continuous time Markov Chain is the same at all times.

=

a

N1 N2

l2+l1l1 l2 N2N1

Page 32: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Evolutionary Substitution Process

t1 t2

CCA

Pi,j(t) = probability of going from i to j in time t.

ijji q

P=>−

)(lim ,

0 iiii q

P−=

−>−

1)(lim ,

0

Page 33: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Jukes-Cantor 69: Total Symmetry.

-3* -3* -3* -3*

TO A C G TFROM

A. Stationary Distribution: (.25,.25,.25,.25)

B. Expected number of substitutions: 3t

ACGT

0 t

)1(41),( 4

,t

t eGCP

−−=

Higher CellsChimp Mouse Fish E.coli

ATTGTGTATATAT….CAG

ATTGCGTATCTAT….CCG

Page 34: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

History of Coalescent Approach to Data Analysis

1930-40s: Genealogical arguments well known to Wright & Fisher.

1964: Crow & Kimura: Infinite Allele Model

1968: Motoo Kimura proposes neutral explanation of molecular evolution & population variation. So does King & Jukes

1971: Kimura & Otha proposes infinite sites model.

1972: Ewens’ Formula: Probability of data under infinite allele model.

1975: Watterson makes explicit use of “The Coalescent”

1982: Kingman introduces “The Coalescent”.

1983: Hudson introduces “The Coalescent with Recombination”

1983: Kreitman publishes first major population sequences.

Page 35: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

History of Coalescent Approach to Data Analysis

1987-95: Griffiths, Ethier & Tavare calculates site data probability under infinite site model.

1994-: Griffiths-Tavaré + Kuhner-Yamoto-Felsenstein introduces highly computer intensitive simulation techniquees to estimate parameters in population models.

1996- Krone-Neuhauser introduces selection in Coalescent

1998- Donnelly, Stephens, Fearnhead et al.: Major accelerations in coalescent based data analysis.

2000-: Several groups combines Coalescent Theory & Gene Mapping.

2002: HapMap project is started.

Page 36: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography

Basic Coalescent Summary

i. Genealogical approach to population genetics.

ii. ”The Coalescent” - generic probability distribution on allele trees.

iii. Combining ”The Coalescent” with Allele/Mutation Models allows the calculation the probability of data.

Page 37: Coalescent Module- Faro July 26th-28th 04  Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography