Upload
clifford-bryan
View
220
Download
3
Embed Size (px)
DESCRIPTION
Genome Evolution. Amos Tanay 2010 Rates and transition probabilities The process’s rate matrix: Transitions differential equations (backward form):
Citation preview
Genome Evolution. Amos Tanay 2010
Genome evolution
Lecture 6: Inference through sampling. Basic phylogenetics
Genome Evolution. Amos Tanay 2010
The paradigm
Alignment
Ancestral Inference on a phylogenetic tree
Tree
Learning a model
Evolutionary rates
Detecting selection and functionPhylogenetics
Genome Evolution. Amos Tanay 2010
Rates and transition probabilities
The process’s rate matrix:
nininnn
ni
i
ni
i
qqqq
qqqq
qqqq
Q
......................
..
..
210
1121
110
002010
0
Transitions differential equations (backward form):
)(]1)([)()(
)()()()()(
tPsPtPsP
tPtPsPtPtsP
ijiiik
kjik
ijk
kjikijij
)()()('0 tPqtPqtPs ijiiik
kjikij
)exp()()()(' QttPtQPtP
Genome Evolution. Amos Tanay 2010
Matrix exponential
The differential equation:
)exp()()()(' QttPtQPtP
Series solution:
)exp(!1
!))'(exp(
!1)exp(
0
1
0
0
QtQtQi
QtQiiQt
tQi
Qt
i
i
ii
i
i
i
i
i
1-path 2-path 3-path 4-path 5-path
Summing over different path lengths:
Genome Evolution. Amos Tanay 2010
Computing the matrix exponential using spectral decomposition
t
t
t
ii
i
i
i
i
i
i
i
ne
ee
t
tti
ti
Q
tQi
Qt
00
0000
)exp(
)exp()!1()(
!1
!1)exp(
2
1
00
0
The eigenvalues determine the process convergence properties
The largest eigenvalue must be 1 and it associated eigenvector is the stationary distribution of the process.
the second largest dominates the convergence of the process
Genome Evolution. Amos Tanay 2010
Computing the matrix exponential
i
i
itQi
Qt
0 !
1)exp(
Series methods: just take the first k summandsreasonable when ||A||<=1if the terms are converging, you are ok
can do scaling/squaring:
Eigenvalues/decomposition:good when the matrix is symmetricproblems when having similar eigenvalues
Multiple methods with other types of B (e.g., triangular)
m
mQ
Q ee
0!1
iQi
1SSeB
Genome Evolution. Amos Tanay 2010
Models for nucleotide substitutions
A
C T
G
A
C T
G
Jukes-Kantor Kimura
How to model the evolution of a nucleotide?
We discussed its potential allele frequency dynamics and fixation probability
The rate of substitution in a neutral locus: A beneficial mutation with s>0:
NNK
212
But mutations can happen at different rates for different nucleotides. The two simplest models describing substitution rates are dated from the 60’s when sequence data was very scarce – we will discuss more sophisticated models later:
sNsNK 422
ii paxxiii Qtxx ,)exp()pa|Pr(
Once we assume the evolutionary duration, we can work with probabilities
XipaXi
T
Genome Evolution. Amos Tanay 2010
The simple tree model
H2
S3
S2 S1
H1
Sequences of extant and ancestral species are random variables, with Val(X) = {A,C,G,T}
Extant Species Sj1,., Ancestral species Hj
1,..(n-1)
Tree T: Parents relation pa Si , pa Hi
(pa S1 = H1 ,pa S3 = H2 ,The root: H2)
For multiple loci we can assume independence and use the same parameters (today):
),Pr(),Pr( jjj hshs
ii paxxiii Qtxx ,)exp()pa|Pr(
)pa|Pr()Pr(),Pr( !ji
jirootiroot
jj xxhhs )|Pr()|Pr()|Pr(
)|Pr()Pr()Pr(
111223
212
hshshshhhs
In the triplet:
Structure
The model is defined using conditional probability distributions and the root “prior” probability distribution
Joint distribution
The model parameters can be the conditional probability distribution tables (CPDs)
Or we can have a single rate matrix Q and branch lengths:
96.001.002.001.001.096.001.002.002.001.096.001.001.002.001.096.0
)|Pr( yx
Genome Evolution. Amos Tanay 2010
Ancestral inference
Alignment
Ancestral Inference on a phylogenetic tree
Tree
Learning a model
Evolutionary rates
)pa|Pr()Pr()|,Pr( !ji
jirootiroot
jj xxhhs
h
shPs )|,()|Pr(
The Total probability of the data s:
This is also called the likelihood L(). Computing Pr(s) is the inference problem
)|Pr()|,(),|Pr(
sshPsh Given the total probability it is easy
to compute:
)|Pr(/),(),|Pr(|
sshPsxhxhh
ii
Easy!
Exponential?
Marginalization over hi
We assume the model (structure, parameters) is given, and denote it by :
Posterior of hi given the data
Total probability of the data
Genome Evolution. Amos Tanay 2010
Tree models
?
A
C A
?
xhh
ii
shPsxh|
),()|Pr(
Given partial observations s:
)),,Pr(( ACA
The Total probability of the data:
)),,(|Pr( 1 ACAAh
96.001.002.001.001.096.001.002.002.001.096.001.001.002.001.096.0
)|Pr( yx
Uniform prior
Genome Evolution. Amos Tanay 2010
Algorithm (Following Felsenstein 1981):
Up(i):if(extant) { up[i][a] = (a==Si ? 1: 0); return}up(r(i)), up(l(i))iter on a up[i][a] = b,c Pr(Xl(i)=b|Xi=a) up[l(i)][b]
Pr(Xr(i)=c|Xi=a) up[r(i)][c]Down(i):
down[i][a]= b,c Pr(Xsib(i)=b|Xpar(i)=c) up[sib(i)][b] Pr(Xi=a|Xpar(i)=c) down[par(i)][c]
down(r(i)), down(l(i))Algorithm:
up(root);LL = 0;foreach a {
L += log(Pr(root=a)up[root][a])down[root][a]=Pr(root=a)
}down(r(root));down(l(root));
Dynamic programming to compute the total probability?
S3
S2 S1
? up[4]
up[5]
Felsentstein
Genome Evolution. Amos Tanay 2010
Algorithm (Following Felsenstein 1981):
Up(i):if(extant) { up[i][a] = (a==Si ? 1: 0); return}up(r(i)), up(l(i))iter on a up[i][a] = b,c Pr(Xl(i)=b|Xi=a) up[l(i)][b]
Pr(Xr(i)=c|Xi=a) up[r(i)][c]Down(i):
down[i][a]= b,c Pr(Xsib(i)=b|Xpar(i)=c) up[sib(i)][b] Pr(Xi=a|Xpar(i)=c) down[par(i)][c]
down(r(i)), down(l(i))Algorithm:
up(root);LL = 0;foreach a {
L += log(Pr(root=a)up[root][a])down[root][a]=Pr(root=a)
}down(r(root));down(l(root));
?
S3
S2 S1
? down[4]
down5]
up[3]
P(hi|s) = up[i][c]*down[i][c]/(jup[i][j]down[i][j])
Felsentstein
Computing marginals and posteriors
Genome Evolution. Amos Tanay 2010
Transition posteriors: not independent!
A CA
C
DATA
96.001.002.001.001.096.001.002.002.001.096.001.001.002.001.096.0
)|Pr( yxDown:(0.25),(0.25),(0.25),(0.25)
Up:(0.01)(0.96),(0.01)0.96),(0.01)(0.02),(0.02)(0.01)
Up:(0.01)(0.96),(0.01)0.96),(0.01)(0.02),(0.02)(0.01)
Genome Evolution. Amos Tanay 2010
Practical inference can be hard
)pa|Pr()Pr()|,Pr( !ji
jirootiroot
jj xxhhs
We want to perform inference in an extended tree model expressing context effects:
2 31
4 5 6
7 8 9
With undirected cycles, the model is well defined but inference becomes hard
We want to perform inference on the tree structure itself!
Each structure impose a probability on the observed data, so we can perform inference on the space of all possible tree structures, or tree structures + branch lengths
)'Pr()'|Pr(/)Pr()|Pr()|Pr('
DDD
1 2 3
What makes these examples difficult?
Genome Evolution. Amos Tanay 2010
Learning from complete data using the Maximum Likelihood
A G C TAGCT
)|Pr(maxarg)|(maxarg DDL
Likelihood: function of parameters (and not a distribution!)
Transforming learning into an optimization problem
A G C TAGCT
Simplest version: “dice problem”. Counts are transformed to probabilities
Proof: using Lagrange multipliers (in the Ex)
Y
X
…AGACGAATAACGAGTAA……AGACGAATATCGACTAA….
n p
We assume alignment positions represent independent observation from the same model
Genome Evolution. Amos Tanay 2010
Expectation-Maximization
3
2 1
5
4
5
3
5
4 1
4
2
4
)),|((maxarg )(1
iiipaki SXXELL
i
)|Pr()|Pr(sib
]][[]][[pa]][[sib1
)|,(
)()(
)(
aXbXaXcX
bXupaXdowncXupZ
SbXaXE
ipaiipai
loci ciii
iipa
Decompose
X
paX
sibX
Inference bring us back to the complete data scenario
Genome Evolution. Amos Tanay 2010
The simple tree algorithm: summary
),|Pr(/),|,Pr(),|Pr( pa,..,1
pa,..,1
1pa
kji
Nj
kjii
Nj
kii sxsxxxx
)Pr()(0:1?)(
)|Pr()()|Pr()()(
)|Pr()()|Pr()()(
sibpa
rightleft
,pasibsibpapa
,rightrightleftleft
rootrootdownobservedsup
xxxupxxxdownxdown
xxxupxxxupxup
i
xxiiiiiii
xxiiiiiii
ii
ii
root
xiiiiiiiii
iii
rootdownrootups
xxxxxupxdownxupZ
sxx
xupxdownZ
sx
i
)()()|Pr(
)|Pr()|Pr()()()(1)|,Pr(
)()(1)|Pr(
sib
papasibpasibpa
Inference using dynamic programming (up-down Message passing):
Marginal/Posterior probabilities:
The EM update rule (conditional probabilities per lineage):
Genome Evolution. Amos Tanay 2010
Learning: Sharing the rate matrix
3
2 1
5
45
3
5
4 1
4
2
4
kijn
ijkjik QttQsL ))(exp(),|( ,,
Use generic non linear optimization methods: (BFGS)
?))log(exp(
?),|(
uQt
ttQsLL
k
Genome Evolution. Amos Tanay 2010
Bayesian learning vs. Maximum likelihood
)|(maxarg DLML Maximum likelihood estimator
Introducing prior beliefs on the process(Alternatively: think of virtual evidence)Computing posterior probabilities on the parameters
Parameter Space
MLE
No prior beliefs
dD
DPME
MAP
)|(Pr
)Pr()|Pr(maxarg
Parameter Space
PME
Beliefs
MAP
Genome Evolution. Amos Tanay 2010
Prior parameterization: Dirichlet
Multinomial distribution (e.g., A,C,G,T)
Quantify your belief as “virtual evidence”
And given new evidence:
It make sense that: ANn iiPME
TGCA ,,,
TGCA nnnn ,,,
dDPME )|(PrWhich prior to use:
10,)1()(
1)|( 1
1
iiii
k
ii
ZD
)()()(
ii
iiZ
The Dirichlet distribution
ANn iiPME
Claim: Using the Dirichlet as prior,
Genome Evolution. Amos Tanay 2010
Learning as inference
S
H
),|Pr( Hs
)Pr()|Pr( sPriorPosterior
10,)1()(
1)|( 1
1
iiii
k
ii
ZD
)()()(
ii
iiZ
The Dirichlet distribution
Genome Evolution. Amos Tanay 2010
Variable rates: gamma priors
2
1
)),,(var(
),,(
,,0,)(
),,(
xg
xg
xxexgx
FastSlow
drrgrtrtQhxtQxP niiloci ),,(),..,,|,Pr(),,|(0
121
pnPpnP
npenPnp
))(var()(
!)(
Genome Evolution. Amos Tanay 2010
Symmetric and reversible Markov processes
Definition: we call a Markov process symmetric if its rate matrix is symmetric:
jiij QQji ,
What would a symmetric process converge to?
Definition: A reversible Markov process is one for which:
)|Pr()|Pr( iXjXiXjX stts
i j j i
Time: t s
Claim: A Markov process is reversible iff such that:i
jijiji qq If this holds, we say the process is in detailed balance and the p are its stationary distribution.
i j
qji
qij
Proof: Bayes law and the definition of reversibility
Genome Evolution. Amos Tanay 2010
Reversibility
Claim: A Markov process is reversible iff we can write:
ijjij sq where S is a symmetric matrix.
Q,tQ,t’ Q,t’ Q,tQ,t+t’
Claim: A Markov process is reversible iff such that:i
jijiji qq If this holds, we say the process is in detailed balance.
i j
qji
qijProof: Bayes law and the definition of reversibility
Genome Evolution. Amos Tanay 2010
Markov Chain Monte Carlo (MCMC)
We don’t know how to sample from P(h)=P(h|s) (or any complex distribution for that matter)
The idea: think of P(h|s) as the stationary distribution of a Reversible Markov chain
)()|()()|( yPyxxPxy
Find a process with transition probabilities for which:
Then sample a trajectory ,,...,, 21 myyy
)()(1lim xPxyCn in
Theorem: (C a counter)
Process must be irreducible (you can reach from anywhere to anywhere with p>0)
(Start from anywhere!)
Genome Evolution. Amos Tanay 2010
The Metropolis(-Hastings) Algorithm
Why reversible? Because detailed balance makes it easy to define the stationary distribution in terms of the transitions
So how can we find appropriate transition probabilities?
)()|()()|( yPyxxPxy
)|()|( xyFyxF
))(/)(,1min( xPyP
We want:
Define a proposal distribution:
And acceptance probability:
)|()(
)1,)()(min()|())(),(min()|(
))()(,1min()|()()|()(
yxyPyPxPyxFyPxPxyF
xPyPxyFxPxyxP
What is the big deal? we reduce the problem to computing ratios between P(x) and P(y)
x yF
))(/)(,1min( xPyP
Genome Evolution. Amos Tanay 2010
Acceptance ratio for a BN
To sample from:
)),(/),'(,1min())|(/)|'(,1min( shPshPshPshP
We affected only the CPDs of hi and its children
Definition: the minimal Markov blanket of a node in BN include its children, Parents and Children’s parents.
To compute the ratio, we care only about the values of hi and its Markov Blanket
For example, if the proposal distribution changes only one variable h i what would be the ratio?
?))|,..,,,..,Pr(/)|,..,',,..,Pr(,1min( 11111111 shhhhhshhhhh niiiniii
)|Pr( shWe will only have to compute:
Genome Evolution. Amos Tanay 2010
Gibbs sampling
)|..,,,..,Pr(/),,..,',..,Pr()|,..,,..,Pr(),..,,,..,|'Pr()|,..,,..,Pr(),|'()|(
11111111
111111
shhhhshhhshhhshhhhhshhhshhshP
niinini
niiini
A very similar (in fact, special case of the metropolis algorithm):
Start from any state hdo { Chose a variable Hi
Form ht+1 by sampling a new hi from }
This is a reversible process with our target stationary distribution:
Gibbs sampling is easy to implement for BNs:
ihiijjiii
iijjiii
niiihhhhh
hhhhhshhhhh
'' pa
pa1111
)ˆ,''|Pr()ˆ|''Pr(
)ˆ,'|Pr()ˆ|'Pr(),..,,,..,|'Pr(
ih
)|Pr( ti hh
Genome Evolution. Amos Tanay 2010
Sampling in practice
)()(1lim xPxyCn in
How much time until convergence to P?(Burn-in time)
Mixing
Burn in Sample
Consecutive samples are still correlated! Should we sample only every n-steps?
We sample while fixing the evidence. Starting from anywere but waiting some time before starting to collect data
A problematic space would be loosely connected:
Examples for bad spaces
Genome Evolution. Amos Tanay 2010
Inferring/learning phylogenetic trees
Distance based methods: computing pairwise distances and building a tree based only on those (how would you implement this?)
More elaborated methods use a scoring scheme that take the whole tree into account, using Parsimony or likelihood
Likelihood methods: universal rate matrices (BLOSSOM62, PAM)
Searching for the optimal tree (min parsimony or max likelihood) is NP hard
Many search heuristics were developed, finding a high quality solution and repeating computation using partial dataset to test for the robustness of particular features (Bootstrap)
Bayesian inference methods – assuming some prior on trees (e.g. uniform) and trying to sample trees from the probability space P(|D).
Using MCMC, we only need a proposal distribution that span all possible trees and a way to compute the likelihood ratio between two trees (polynomial for the simple tree model)
From sampling we can extract any desirable parameter on the tree (number of time X,Y and Z are in the same clade)
Genome Evolution. Amos Tanay 2010
Curated set of universal proteins
Eliminating Lateral transfer
Multiple alignment and removal of bad domains
Maximum likelihood inference, with 4 classes of rate and a fixed matrix
BootstrapValidation
Ciccarelli et al 2005
Genome Evolution. Amos Tanay 2010
How much DNA? (Following M. Lynch)
Viral particles in the oceans: 1030 times 104 bases = 1034
Global number of prokaryotic cells: 1030 times 3x106 bases = 1036
107 eukaryotic species (~1.6 million were characterized)
One may assume that they each occupy the same biomass,
For human, 6x109 (population) times 6x109 (genome) times 1013 (cells) = 1032
Assuming average eukaryotic genome size is 1% of the human, we have 1037 bases
Genome Evolution. Amos Tanay 2010
RNABased
Genomes
RibosomeProteins
Genetic Code
DNABased
GenomesMembranes Diversity!
? ?
3.4 – 3.8 BYA – fossils??3.2 BYA – good fossils
3 BYA – metanogenesis2.8 BYA – photosynthesis....1.7-1.5 BYA – eukaryotes..0.55 BYA – camberian explosion 0.44 BYA – jawed vertebrates0.4 – land plants0.14 – flowering plants0.10 - mammals
Think ecology..
Genome Evolution. Amos Tanay 2010
EUKARYOTESPROKARYOTES
Presence of a nuclear membrane(Also present in the Planktomycetes)
Organelles derived from endosymbionts(also in b-protebacteria)
Cytoskeleton and vesicle transportTubulin-related protein, no microtubules
Trans-splicing-
Introns in protein coding genes, spliceosomeRare – almost never in coding
Expansion of untranslated regions of transcriptsShort UTRs
Translation initiation by scanning for startRibosome binds directly to a Shine-Delgrano sequence
mRNA surveillanceNonsense mediated decay pathway is absent
Multiple linear chromosomes, telomeresSingle linear chromosomes in a few eubacteria
Mitosis, MeiosisAbsent
Gene number expansion-
Expansion of cell sizeSome exceptions, but cells are small
Genome Evolution. Amos Tanay 2010
Genome Evolution. Amos Tanay 2010
Biknots
Uniknots
Eukaryotes
Genome Evolution. Amos Tanay 2010
Eukaryotes
Uniknots – one flagela at some developmental stage
FungiAnimalsAnimal parasitesAmoebas
Biknots – ancestrally two flagellas
Green plantsRed algeaCiliates, plasmoudiumBrown algeaMore amobea
Strange biology!
A big bang phylogeny: speciations across a short time span? Ambiguity – and not much hope for really resolving it
Genome Evolution. Amos Tanay 2010
Vertebrates
Sequenced Genomes phylogenyFossil based, large scale phylogeny
Genome Evolution. Amos Tanay 2010
Mar
mos
et
Mac
aque
Ora
ngut
an
Chi
mp
Hum
an
Bab
oon
Gib
bon
Gor
illa
0.5%0.5%
0.8%
1.5%3%
9%
1.2%
Primates
Genome Evolution. Amos Tanay 2010
Flies
Genome Evolution. Amos Tanay 2010
Yeasts
Genome Evolution. Amos Tanay 2010
Inference by forward sampling
• We did not cover these slides this year
Genome Evolution. Amos Tanay 2010
Sampling from a BN
Naively: If we could draw h,s’ according to the distribution Pr(h,s’) then: Pr(s) ~ (#samples with s)/(# samples)
Forward sampling:use a topological order on the network. Select a node whose parents are already determined sample from its conditional distribution (all parents already determined!)
Claim: Forward sampling is correct:
2 31
How to sample from the CPD?
4 5 67 8 9
),Pr()],(1[ shshEP
Genome Evolution. Amos Tanay 2010
Focus on the observations
Naïve sampling is terribly inefficient, why?
What is the sampling error?
Why don’t we constraint the sampling to fit the evidence s?
2 31
4 5 6
7 8 9
Two tasks: P(s) and P(f(h)|s), how to approach each/both?
This can be done, but we no longer sample from P(h,s), and not from P(h|s) (why?)
Genome Evolution. Amos Tanay 2010
Likelihood weighting
Likelihood weighting: weight = 1use a topological order on the network. Select a node whose parents are already determined if the variable was not observed: sample from its conditional distributionelse: weight *= P(xi|paxi), and fix the observation
Store the sample x and its weight w[x]
Pr(h|s) = (total weights of samples with h) / (total weights)
7 8 9
),|Pr( 211 ij
iij shs
),|Pr( 1ij
iij shs
),|Pr( 11 ij
iij shs ),|Pr( 211 i
jii
j shs),|Pr( 1i
jii
j shs),|Pr( 11 i
jii
j shs
Weight=
Genome Evolution. Amos Tanay 2010
Importance sampling
M
mmP hf
MfE
1
)(1][
])()()([)]([ )()( HQ
HPHfEHfE HQHP
M
D mhwmhfM
fE
mhhD
1
])[(])[(1][ˆ
]}[],..,1[{
Unnormalized Importance sampling:
])[(])[()(
mhQmhPhw
)(|)(|)( HPHfHQ To minimize the variance, use a Q distribution is proportional to the target function:
22 )])([(]))()([())()(( HfEHwHfEHwHfVar PQQ
Our estimator from M samples is:
But it can be difficult or inefficient to sample from P. Assume we sample instead from Q, then:
Claim:
Prove it!
Genome Evolution. Amos Tanay 2010
Correctness of likelihood weighting: Importance sampling
Unnormalized Importance sampling with the likelihood weighting proposal distribution Q and any function on the hidden variables:
Proposition: the likelihood weighting algorithm is correct (in the sense that it define an estimator with the correct expected value)
For the likelihood waiting algorithm, our proposal distribution Q is defined by fixing the evidence at the nodes in a set E and ignoring the CPDs of variable with evidence.
We sample from Q just like forward sampling from a Bayesian network that eliminated all edges going into evidence nodes!
)pa|Pr(),(),()( iiEi
xxshQshPhw
)pa|Pr()|( iiEixxDxQ
M
hhD mhwmhM
Eii
1
])[(])[(11]1[ˆ
Genome Evolution. Amos Tanay 2010
Normalized Importance sampling
hh
HQ hPhQhPhQHwE )(')()(')()]([)(
/)]()([)]([ )()( HwHfEXfE HQHP
M
M
D
mhwM
mhwmhfMfE
mhhD
1
1
])[(1
])[(])[(1
][ˆ
]}[],..,1[{Sample:
NormalizedImportance sampling:
])[(])[(')(
mhQmhPhw
When sampling from P(h|s) we don’t know P, so cannot compute w=P/Q
We do know P(h,s)=P(h|s)P(s)=P(h|s)=P’(h)
So we will use sampling to estimate both terms:
)][(1/)][()(11)|(ˆ'
11MM
D mhwM
mhwhM
shP
Using the likelihood weighting Q, we can compute posterior probabilities in one pass (no need to sample P(s) and P(h,s) separately):
Genome Evolution. Amos Tanay 2010
Likelihood weighting is effective here:
But not here:
observed
unobserved
Limitations of forward sampling