Upload
tia-wayson
View
214
Download
0
Embed Size (px)
Citation preview
Coalescent Module- Faro July 26th-28th 04www.coalescent.dk
Monday H: The Basic CoalescentW: Forest FireW: The Coalescent + History, Geography & SelectionH: The Coalescent with Recombination
TuesdayH: Recombination cont. W: The Coalescent & CombinatoricsHW: Computer Session H: The Coalescent & Human Evolution
WednesdayH: The Coalescent & Statistics HW: Linkage Disequilibrium Mapping
globin
Exon 2Exon 1 Exon 3
5’ flanking 3’ flanking
(chromosome 11)
Zooming in!(from Harding + Sanger)
*5.000
*20
6*104 bp
3*109 bp
*103
3*103 bp
ATTGCCATGTCGATAATTGGACTATTTTTTTTTT 30 bp
From Cavalli-Sforza,2001
Human Migrations
Data: -globin from sampled humans. From Griffiths, 2001
Assume:
1. At most 1 substitution per position.
2.No recombination
Reducing nucleotide columns to bi-partitions gives a bijection between data & unrooted gene trees.
C G
Africa Non-Africa
0.2 Mu
tation rate: 2.5
Rate of com
mon
ancestry: 1
Past
Present
Simplified model of human sequence evolution.
Wait to com
mon
ancestry: 2N
e
From Griffiths, 2001
Models and their benefits.
Models + Data
1. probability of data (statistics...)
2. probability of individual histories
3. hypothesis testing
4. parameter estimation
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,...
Reproductive Structure
Genealogies of non-sequenced data
Genealogies of sequenced data
Parameter Estimation
Model Testing
Coalescent Theory in Biologywww. coalescent.dk
TGTTGT CATAGTCGTTAT
Haploid Model
Diploid Model
Wright-Fisher Model of Population Reproduction
i. Individuals are made by sampling with replacement in the previous generation.
ii. The probability that 2 alleles have same ancestor in previous generation is 1/2N
Individuals are made by sampling a chromosome from the female and one from the male previous generation with replacement
Assumptions
1. Constant population size
2. No geography
3. No Selection
4. No recombination
10 Alleles’ Ancestry for 15 generations
Mean, E(X2) = 2N.
Ex.: 2N = 20.000, Generation time 30 years, E(X2) = 600000 years.
Waiting for most recent common ancestor - MRCA
P(X2 = j) = (1-(1/2N))j-1 (1/2N)
Distribution until 2 alleles had a common ancestor, X2?:
P(X2 > j) = (1-(1/2N))jP(X2 > 1) = (2N-1)/2N = 1-(1/2N)
1 2N 1 2N
1 1
2
j
1 2N
1
2
j
P(k):=P{k alleles had k distinct parents}
1 2N
1
2N *(2N-1) *..* (2N-(k-1)) =: (2N)[k]
(2N)k
k -> any k -> k k -> k-1
Ancestor choices:
€
P(k) =2N[k ]
(2N)k≈ (k 2 < 2N) 1−
k
2
⎛
⎝ ⎜
⎞
⎠ ⎟/2N ≈ e
−k
2
⎛
⎝ ⎜
⎞
⎠ ⎟/ 2N
€
k
2
⎛
⎝ ⎜
⎞
⎠ ⎟(2N)[k−1]
k -> j
€
Sk, j (2N)[ j ]
For k << 2N:
Sk,j - the number of ways to group k labelled objects into j groups.(Stirling Numbers of second kind.
Geometric/Exponential DistributionsThe Geometric Distribution: {1,..} Geo(p): P{Z=j)=pj(1-p) P{Z>j)=pj E(Z)=1/p.
The Exponential Distribution: R+ Exp (a) Density: f(t) = ae-at, P(X>t)= e-at
Properties: X ~ Exp(a) Y ~ Exp(b) independent
i. P(X>t2|X>t1) = P(X>t2-t1) (t2 > t1)
ii. E(X) = 1/a.
iii. P(Z>t)=(≈)P(X>t) small a (p=e-a).
iv. P(X < Y) = a/(a + b).
v. min(X,Y) ~ Exp (a + b).
2 56 3 0.0
1.0
1.0 corresponds to 2N generations
1 40
2N
0
6 6/2Ne
tc:=td/2Ne
€
Xk is exp[k
2
⎛
⎝ ⎜
⎞
⎠ ⎟] distributed. E(Xk ) =1/
k
2
⎛
⎝ ⎜
⎞
⎠ ⎟
Discrete Continuous Time
Probability for two genes being identical:
P(Coalescence < Mutation) = 1/(1+).
m mutation pr. nucleotide pr.generation. L: seq. length
µ = m*L Mutation pr. allele pr.generation. 2Ne - allele number.
:= 4N*µ -- Mutation intensity in scaled process.
Adding Mutations
sequence
time
Discrete timeDiscrete sequence
Continuous timeContinuous sequence
1/L
1/(2Ne)time
sequence
/2 /2
mutation mutation coalescence
Note: Mutation rate and population size usually appear together as a product, making separate estimation difficult.
1
The Standard Coalescent
Two independent Processes
Continuous: Exponential Waiting Times
Discrete: Choosing Pairs to Coalesce.
1 2 3 4 5
Waiting Coalescing
4--5
3--(4,5)
(1,2)--(3,(4,5))
1--2
€
Exp5
2
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
€
Exp4
2
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
€
Exp2
2
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
€
Exp3
2
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
{1}{2}{3}{4}{5}
{1,2}{3,4,5}
{1,2,3,4,5}
{1}{2}{3}{4,5}
{1}{2}{3,4,5}
)1(
2
2/1
−=⎟⎟
⎠
⎞⎜⎜⎝
⎛kk
k
Expected Height and Total Branch Length
Expected Total height of tree: Hk= 2(1-1/k)
i.Infinitely many alleles finds 1 allele in finite time. ii. In takes less than twice as long for k alleles to find 1 ancestors as it does for 2 alleles.
Expected Total branch length in tree, Lk:
2*(1 + 1/2 + 1/3 +..+ 1/(k-1)) ca= 2*ln(k-1)
1
2
3
k
1/3
1 2
1
2/(k-1)
Time Epoch Branch Lengths
B. The Paint Box & exchangable distributions on Partitions.
C. All coalescents are restrictions of “The Coalescent” – a process with entrance boundary infinity.
D. Robustness of “The Coalescent”: If offspring distribution is exchangeable and Var(1) --> 2 & E(1m) < Mm for all m, then genealogies follows ”The Coalescent” in distribution.
E. A series of combinatorial results.
Kingman (Stoch.Proc. & Appl. 13.235-248 + 2 other articles,1982)
A. Stochastic Processes on Equivalence Relations.
={(i,i);i= 1,..n} ={(i,j);i,j=1,..n}
1 if s < tqs,t = 0 otherwise
This defines a process, Rt , going from to through equivalence relations on {1,..,n}.
Effective Populations Size, Ne.
In an idealised Wright-Fisher model:
i. loss of variation per generation is 1-1/(2N).
ii. Waiting time for random alleles to find a common ancestor is 2N.
Factors that influences Ne:
i. Variance in offspring. WF: 1. If variance is higher, then effective population size is smaller.
ii. Population size variation - example k cycle:
N1, N2,..,Nk. k/Ne= 1/N1+..+ 1/Nk. N1 = 10 N2= 1000 => Ne= 50.5
iii. Two sexes Ne = 4NfNm/(Nf+Nm)I.e. Nf- 10 Nm -1000 Ne - 40
6 Realisations with 25 leaves
Observations:
Variation great close to root.
Trees are unbalanced.
Sampling more sequences
The probability that the ancestor of the sample of size n is in a sub-sample of size k is
Letting n go to infinity gives (k-1)/(k+1), i.e. even for quite small samples it is quite large.
€
(n+1)(k−1)(n−1)(k+1)
Three Models of Alleles and Mutations.
Infinite Allele Infinite Site Finite Site
acgtgcttacgtgcgtacctgcattcctgcattcctgcat
acgtgcttacgtgcgtacctgcattcctggcttcctgcat
i. Only identity, non-identity is determinable
ii. A mutation creates a new type.
i. Allele is represented by a line.
ii. A mutation always hits a new position.
i. Allele is represented by a sequence.
ii. A mutation changes nucleotide at chosen position.
1 2 3 4 5
11)}1{( →
12)}2,1{( →
21)}2(),1{( →21)}2(),1{( →
1121)}3,2(),1{( →1121)}3,2(),1{( →
2121)}5,4)(3(),2,1{( →
1321)}5,4)(3(),2(),1{( →
Infinite Allele Model
Final Aligned Data Set:
Infinite Site Model
0
1
2
3
4
5
6
7
8
1
1
4
2
5
3
1
5
5
0
1
2
3
4
5
6
7
8
1
1
4
2
5
3
1
5
5
Number of paths:
2 22
3 4 4 6 2
7 7 14 8 2
28 22 10
50 32
82
2
1
345
2
1
3
45
2
{ }, ,
Ignoring mutation position
Ignoring sequence label
Ignoring mutation position
Ignoring sequence label
Labelling and unlabelling:positions and sequences
€
2θ
5(4 + θ)
€
1
(4 + θ)9 coalescence events incompatible with data
4 classes of mutation events incompatible with data
The forward-backward argument
Infinite Site Model: An example
Theta=2.12
2
3 2 3
5 54
910 5
19 14
33
Impossible Ancestral States
Final Aligned Data Set:
acgtgcttacgtgcgtacctgcattcctgcattcctgcats s s
Finite Site Model
1) Only substitutions. s1 TCGGTA s1 TCGGA s2 TGGT-T s2 TGGTT
2) Processes in different positions of the molecule are independent.
3) A nucleotide follows a continuous time Markov Chain.
4) Time reversibility: I.e. πi Pi,j(t) = πj Pj,i(t), where πi is the stationary distribution of i. This implies that
Simplifying assumptions
*)( 21121 2121) l(l)PP(N)(l)*P(lPaP ,NNa,Na,N
a
+=∑
5) The rate matrix, Q, for the continuous time Markov Chain is the same at all times.
=
a
N1 N2
l2+l1l1 l2 N2N1
Evolutionary Substitution Process
t1 t2
CCA
Pi,j(t) = probability of going from i to j in time t.
ijji q
P=>−
)(lim ,
0 iiii q
P−=
−>−
1)(lim ,
0
Jukes-Cantor 69: Total Symmetry.
-3* -3* -3* -3*
TO A C G TFROM
A. Stationary Distribution: (.25,.25,.25,.25)
B. Expected number of substitutions: 3t
ACGT
0 t
)1(41),( 4
,t
t eGCP
−−=
Higher CellsChimp Mouse Fish E.coli
ATTGTGTATATAT….CAG
ATTGCGTATCTAT….CCG
History of Coalescent Approach to Data Analysis
1930-40s: Genealogical arguments well known to Wright & Fisher.
1964: Crow & Kimura: Infinite Allele Model
1968: Motoo Kimura proposes neutral explanation of molecular evolution & population variation. So does King & Jukes
1971: Kimura & Otha proposes infinite sites model.
1972: Ewens’ Formula: Probability of data under infinite allele model.
1975: Watterson makes explicit use of “The Coalescent”
1982: Kingman introduces “The Coalescent”.
1983: Hudson introduces “The Coalescent with Recombination”
1983: Kreitman publishes first major population sequences.
History of Coalescent Approach to Data Analysis
1987-95: Griffiths, Ethier & Tavare calculates site data probability under infinite site model.
1994-: Griffiths-Tavaré + Kuhner-Yamoto-Felsenstein introduces highly computer intensitive simulation techniquees to estimate parameters in population models.
1996- Krone-Neuhauser introduces selection in Coalescent
1998- Donnelly, Stephens, Fearnhead et al.: Major accelerations in coalescent based data analysis.
2000-: Several groups combines Coalescent Theory & Gene Mapping.
2002: HapMap project is started.
Basic Coalescent Summary
i. Genealogical approach to population genetics.
ii. ”The Coalescent” - generic probability distribution on allele trees.
iii. Combining ”The Coalescent” with Allele/Mutation Models allows the calculation the probability of data.