Genome Evolution. Amos Tanay 2010 Genome evolution Lecture 6: Inference through sampling. Basic phylogenetics

Genome Evolution. Amos Tanay 2010

Genome evolution

Lecture 6: Inference through sampling. Basic phylogenetics


The paradigm

Alignment

Ancestral Inference on a phylogenetic tree

Tree

Learning a model

Evolutionary rates

Detecting selection and functionPhylogenetics


Rates and transition probabilities

The process’s rate matrix:

nininnn

ni

i

ni

i

qqqq

qqqq

qqqq

Q

......................

..

..

210

1121

110

002010

0

Transitions differential equations (backward form):

)(]1)([)()(

)()()()()(

tPsPtPsP

tPtPsPtPtsP

ijiiik

kjik

ijk

kjikijij

)()()('0 tPqtPqtPs ijiiik

kjikij

)exp()()()(' QttPtQPtP


Matrix exponential

The differential equation:

)exp()()()(' QttPtQPtP

Series solution:

)exp(!1

!))'(exp(

!1)exp(

0

1

0

0

QtQtQi

QtQiiQt

tQi

Qt

i

i

ii

i

i

i

i

i

1-path 2-path 3-path 4-path 5-path

Summing over different path lengths:


Computing the matrix exponential using spectral decomposition

t

t

t

ii

i

i

i

i

i

i

i

ne

ee

t

tti

ti

Q

tQi

Qt

00

0000

)exp(

)exp()!1()(

!1

!1)exp(

2

1

00

0

The eigenvalues determine the process convergence properties

The largest eigenvalue must be 1 and it associated eigenvector is the stationary distribution of the process.

the second largest dominates the convergence of the process


Computing the matrix exponential

i

i

itQi

Qt

0 !

1)exp(

Series methods: just take the first k summandsreasonable when ||A||<=1if the terms are converging, you are ok

can do scaling/squaring:

Eigenvalues/decomposition:good when the matrix is symmetricproblems when having similar eigenvalues

Multiple methods with other types of B (e.g., triangular)

m

mQ

Q ee

0!1

iQi

1SSeB


Models for nucleotide substitutions

A

C T

G

A

C T

G

Jukes-Kantor Kimura

How to model the evolution of a nucleotide?

We discussed its potential allele frequency dynamics and fixation probability

The rate of substitution in a neutral locus: A beneficial mutation with s>0:

NNK

212

But mutations can happen at different rates for different nucleotides. The two simplest models describing substitution rates are dated from the 60’s when sequence data was very scarce – we will discuss more sophisticated models later:

sNsNK 422

ii paxxiii Qtxx ,)exp()pa|Pr(

Once we assume the evolutionary duration, we can work with probabilities

XipaXi

T


The simple tree model

H2

S3

S2 S1

H1

Sequences of extant and ancestral species are random variables, with Val(X) = {A,C,G,T}

Extant Species Sj1,., Ancestral species Hj

1,..(n-1)

Tree T: Parents relation pa Si , pa Hi

(pa S1 = H1 ,pa S3 = H2 ,The root: H2)

For multiple loci we can assume independence and use the same parameters (today):

),Pr(),Pr( jjj hshs

ii paxxiii Qtxx ,)exp()pa|Pr(

)pa|Pr()Pr(),Pr( !ji

jirootiroot

jj xxhhs )|Pr()|Pr()|Pr(

)|Pr()Pr()Pr(

111223

212

hshshshhhs

In the triplet:

Structure

The model is defined using conditional probability distributions and the root “prior” probability distribution

Joint distribution

The model parameters can be the conditional probability distribution tables (CPDs)

Or we can have a single rate matrix Q and branch lengths:

96.001.002.001.001.096.001.002.002.001.096.001.001.002.001.096.0

)|Pr( yx


Ancestral inference

Alignment

Ancestral Inference on a phylogenetic tree

Tree

Learning a model

Evolutionary rates

)pa|Pr()Pr()|,Pr( !ji

jirootiroot

jj xxhhs

h

shPs )|,()|Pr(

The Total probability of the data s:

This is also called the likelihood L(). Computing Pr(s) is the inference problem

)|Pr()|,(),|Pr(

sshPsh Given the total probability it is easy

to compute:

)|Pr(/),(),|Pr(|

sshPsxhxhh

ii

Easy!

Exponential?

Marginalization over hi

We assume the model (structure, parameters) is given, and denote it by :

Posterior of hi given the data

Total probability of the data


Tree models

?

A

C A

?

xhh

ii

shPsxh|

),()|Pr(

Given partial observations s:

)),,Pr(( ACA

The Total probability of the data:

)),,(|Pr( 1 ACAAh

96.001.002.001.001.096.001.002.002.001.096.001.001.002.001.096.0

)|Pr( yx

Uniform prior


Algorithm (Following Felsenstein 1981):

Up(i):if(extant) { up[i][a] = (a==Si ? 1: 0); return}up(r(i)), up(l(i))iter on a up[i][a] = b,c Pr(Xl(i)=b|Xi=a) up[l(i)][b]

Pr(Xr(i)=c|Xi=a) up[r(i)][c]Down(i):

down[i][a]= b,c Pr(Xsib(i)=b|Xpar(i)=c) up[sib(i)][b] Pr(Xi=a|Xpar(i)=c) down[par(i)][c]

down(r(i)), down(l(i))Algorithm:

up(root);LL = 0;foreach a {

L += log(Pr(root=a)up[root][a])down[root][a]=Pr(root=a)

}down(r(root));down(l(root));

Dynamic programming to compute the total probability?

S3

S2 S1

? up[4]

up[5]

Felsentstein


Algorithm (Following Felsenstein 1981):

Up(i):if(extant) { up[i][a] = (a==Si ? 1: 0); return}up(r(i)), up(l(i))iter on a up[i][a] = b,c Pr(Xl(i)=b|Xi=a) up[l(i)][b]

Pr(Xr(i)=c|Xi=a) up[r(i)][c]Down(i):

down[i][a]= b,c Pr(Xsib(i)=b|Xpar(i)=c) up[sib(i)][b] Pr(Xi=a|Xpar(i)=c) down[par(i)][c]

down(r(i)), down(l(i))Algorithm:

up(root);LL = 0;foreach a {

L += log(Pr(root=a)up[root][a])down[root][a]=Pr(root=a)

}down(r(root));down(l(root));

?

S3

S2 S1

? down[4]

down5]

up[3]

P(hi|s) = up[i][c]*down[i][c]/(jup[i][j]down[i][j])

Felsentstein

Computing marginals and posteriors


Transition posteriors: not independent!

A CA

C

DATA

96.001.002.001.001.096.001.002.002.001.096.001.001.002.001.096.0

)|Pr( yxDown:(0.25),(0.25),(0.25),(0.25)

Up:(0.01)(0.96),(0.01)0.96),(0.01)(0.02),(0.02)(0.01)

Up:(0.01)(0.96),(0.01)0.96),(0.01)(0.02),(0.02)(0.01)


Practical inference can be hard

)pa|Pr()Pr()|,Pr( !ji

jirootiroot

jj xxhhs

We want to perform inference in an extended tree model expressing context effects:

2 31

4 5 6

7 8 9

With undirected cycles, the model is well defined but inference becomes hard

We want to perform inference on the tree structure itself!

Each structure impose a probability on the observed data, so we can perform inference on the space of all possible tree structures, or tree structures + branch lengths

)'Pr()'|Pr(/)Pr()|Pr()|Pr('

DDD

1 2 3

What makes these examples difficult?


Learning from complete data using the Maximum Likelihood

A G C TAGCT

)|Pr(maxarg)|(maxarg DDL

Likelihood: function of parameters (and not a distribution!)

Transforming learning into an optimization problem

A G C TAGCT

Simplest version: “dice problem”. Counts are transformed to probabilities

Proof: using Lagrange multipliers (in the Ex)

Y

X

…AGACGAATAACGAGTAA……AGACGAATATCGACTAA….

n p

We assume alignment positions represent independent observation from the same model


Expectation-Maximization

3

2 1

5

4

5

3

5

4 1

4

2

4

)),|((maxarg )(1

iiipaki SXXELL

i

)|Pr()|Pr(sib

]][[]][[pa]][[sib1

)|,(

)()(

)(

aXbXaXcX

bXupaXdowncXupZ

SbXaXE

ipaiipai

loci ciii

iipa

Decompose

X

paX

sibX

Inference bring us back to the complete data scenario


The simple tree algorithm: summary

),|Pr(/),|,Pr(),|Pr( pa,..,1

pa,..,1

1pa

kji

Nj

kjii

Nj

kii sxsxxxx

)Pr()(0:1?)(

)|Pr()()|Pr()()(

)|Pr()()|Pr()()(

sibpa

rightleft

,pasibsibpapa

,rightrightleftleft

rootrootdownobservedsup

xxxupxxxdownxdown

xxxupxxxupxup

i

xxiiiiiii

xxiiiiiii

ii

ii

root

xiiiiiiiii

iii

rootdownrootups

xxxxxupxdownxupZ

sxx

xupxdownZ

sx

i

)()()|Pr(

)|Pr()|Pr()()()(1)|,Pr(

)()(1)|Pr(

sib

papasibpasibpa

Inference using dynamic programming (up-down Message passing):

Marginal/Posterior probabilities:

The EM update rule (conditional probabilities per lineage):


Learning: Sharing the rate matrix

3

2 1

5

45

3

5

4 1

4

2

4

kijn

ijkjik QttQsL ))(exp(),|( ,,

Use generic non linear optimization methods: (BFGS)

?))log(exp(

?),|(

uQt

ttQsLL

k


Bayesian learning vs. Maximum likelihood

)|(maxarg DLML Maximum likelihood estimator

Introducing prior beliefs on the process(Alternatively: think of virtual evidence)Computing posterior probabilities on the parameters

Parameter Space

MLE

No prior beliefs

dD

DPME

MAP

)|(Pr

)Pr()|Pr(maxarg

Parameter Space

PME

Beliefs

MAP


Prior parameterization: Dirichlet

Multinomial distribution (e.g., A,C,G,T)

Quantify your belief as “virtual evidence”

And given new evidence:

It make sense that: ANn iiPME

TGCA ,,,

TGCA nnnn ,,,

dDPME )|(PrWhich prior to use:

10,)1()(

1)|( 1

1

iiii

k

ii

ZD

)()()(

ii

iiZ

The Dirichlet distribution

ANn iiPME

Claim: Using the Dirichlet as prior,


Learning as inference

S

H

),|Pr( Hs

)Pr()|Pr( sPriorPosterior

10,)1()(

1)|( 1

1

iiii

k

ii

ZD

)()()(

ii

iiZ

The Dirichlet distribution


Variable rates: gamma priors

2

1

)),,(var(

),,(

,,0,)(

),,(

xg

xg

xxexgx

FastSlow

drrgrtrtQhxtQxP niiloci ),,(),..,,|,Pr(),,|(0

121

pnPpnP

npenPnp

))(var()(

!)(


Symmetric and reversible Markov processes

Definition: we call a Markov process symmetric if its rate matrix is symmetric:

jiij QQji ,

What would a symmetric process converge to?

Definition: A reversible Markov process is one for which:

)|Pr()|Pr( iXjXiXjX stts

i j j i

Time: t s

Claim: A Markov process is reversible iff such that:i

jijiji qq If this holds, we say the process is in detailed balance and the p are its stationary distribution.

i j

qji

qij

Proof: Bayes law and the definition of reversibility


Reversibility

Claim: A Markov process is reversible iff we can write:

ijjij sq where S is a symmetric matrix.

Q,tQ,t’ Q,t’ Q,tQ,t+t’

Claim: A Markov process is reversible iff such that:i

jijiji qq If this holds, we say the process is in detailed balance.

i j

qji

qijProof: Bayes law and the definition of reversibility


Markov Chain Monte Carlo (MCMC)

We don’t know how to sample from P(h)=P(h|s) (or any complex distribution for that matter)

The idea: think of P(h|s) as the stationary distribution of a Reversible Markov chain

)()|()()|( yPyxxPxy

Find a process with transition probabilities for which:

Then sample a trajectory ,,...,, 21 myyy

)()(1lim xPxyCn in

Theorem: (C a counter)

Process must be irreducible (you can reach from anywhere to anywhere with p>0)

(Start from anywhere!)


The Metropolis(-Hastings) Algorithm

Why reversible? Because detailed balance makes it easy to define the stationary distribution in terms of the transitions

So how can we find appropriate transition probabilities?

)()|()()|( yPyxxPxy

)|()|( xyFyxF

))(/)(,1min( xPyP

We want:

Define a proposal distribution:

And acceptance probability:

)|()(

)1,)()(min()|())(),(min()|(

))()(,1min()|()()|()(

yxyPyPxPyxFyPxPxyF

xPyPxyFxPxyxP

What is the big deal? we reduce the problem to computing ratios between P(x) and P(y)

x yF

))(/)(,1min( xPyP


Acceptance ratio for a BN

To sample from:

)),(/),'(,1min())|(/)|'(,1min( shPshPshPshP

We affected only the CPDs of hi and its children

Definition: the minimal Markov blanket of a node in BN include its children, Parents and Children’s parents.

To compute the ratio, we care only about the values of hi and its Markov Blanket

For example, if the proposal distribution changes only one variable h i what would be the ratio?

?))|,..,,,..,Pr(/)|,..,',,..,Pr(,1min( 11111111 shhhhhshhhhh niiiniii

)|Pr( shWe will only have to compute:


Gibbs sampling

)|..,,,..,Pr(/),,..,',..,Pr()|,..,,..,Pr(),..,,,..,|'Pr()|,..,,..,Pr(),|'()|(

11111111

111111

shhhhshhhshhhshhhhhshhhshhshP

niinini

niiini

A very similar (in fact, special case of the metropolis algorithm):

Start from any state hdo { Chose a variable Hi

Form ht+1 by sampling a new hi from }

This is a reversible process with our target stationary distribution:

Gibbs sampling is easy to implement for BNs:

ihiijjiii

iijjiii

niiihhhhh

hhhhhshhhhh

'' pa

pa1111

)ˆ,''|Pr()ˆ|''Pr(

)ˆ,'|Pr()ˆ|'Pr(),..,,,..,|'Pr(

ih

)|Pr( ti hh


Sampling in practice

)()(1lim xPxyCn in

How much time until convergence to P?(Burn-in time)

Mixing

Burn in Sample

Consecutive samples are still correlated! Should we sample only every n-steps?

We sample while fixing the evidence. Starting from anywere but waiting some time before starting to collect data

A problematic space would be loosely connected:

Examples for bad spaces


Inferring/learning phylogenetic trees

Distance based methods: computing pairwise distances and building a tree based only on those (how would you implement this?)

More elaborated methods use a scoring scheme that take the whole tree into account, using Parsimony or likelihood

Likelihood methods: universal rate matrices (BLOSSOM62, PAM)

Searching for the optimal tree (min parsimony or max likelihood) is NP hard

Many search heuristics were developed, finding a high quality solution and repeating computation using partial dataset to test for the robustness of particular features (Bootstrap)

Bayesian inference methods – assuming some prior on trees (e.g. uniform) and trying to sample trees from the probability space P(|D).

Using MCMC, we only need a proposal distribution that span all possible trees and a way to compute the likelihood ratio between two trees (polynomial for the simple tree model)

From sampling we can extract any desirable parameter on the tree (number of time X,Y and Z are in the same clade)


Curated set of universal proteins

Eliminating Lateral transfer

Multiple alignment and removal of bad domains

Maximum likelihood inference, with 4 classes of rate and a fixed matrix

BootstrapValidation

Ciccarelli et al 2005


How much DNA? (Following M. Lynch)

Viral particles in the oceans: 1030 times 104 bases = 1034

Global number of prokaryotic cells: 1030 times 3x106 bases = 1036

107 eukaryotic species (~1.6 million were characterized)

One may assume that they each occupy the same biomass,

For human, 6x109 (population) times 6x109 (genome) times 1013 (cells) = 1032

Assuming average eukaryotic genome size is 1% of the human, we have 1037 bases


RNABased

Genomes

RibosomeProteins

Genetic Code

DNABased

GenomesMembranes Diversity!

? ?

3.4 – 3.8 BYA – fossils??3.2 BYA – good fossils

3 BYA – metanogenesis2.8 BYA – photosynthesis....1.7-1.5 BYA – eukaryotes..0.55 BYA – camberian explosion 0.44 BYA – jawed vertebrates0.4 – land plants0.14 – flowering plants0.10 - mammals

Think ecology..


EUKARYOTESPROKARYOTES

Presence of a nuclear membrane(Also present in the Planktomycetes)

Organelles derived from endosymbionts(also in b-protebacteria)

Cytoskeleton and vesicle transportTubulin-related protein, no microtubules

Trans-splicing-

Introns in protein coding genes, spliceosomeRare – almost never in coding

Expansion of untranslated regions of transcriptsShort UTRs

Translation initiation by scanning for startRibosome binds directly to a Shine-Delgrano sequence

mRNA surveillanceNonsense mediated decay pathway is absent

Multiple linear chromosomes, telomeresSingle linear chromosomes in a few eubacteria

Mitosis, MeiosisAbsent

Gene number expansion-

Expansion of cell sizeSome exceptions, but cells are small



Biknots

Uniknots

Eukaryotes


Eukaryotes

Uniknots – one flagela at some developmental stage

FungiAnimalsAnimal parasitesAmoebas

Biknots – ancestrally two flagellas

Green plantsRed algeaCiliates, plasmoudiumBrown algeaMore amobea

Strange biology!

A big bang phylogeny: speciations across a short time span? Ambiguity – and not much hope for really resolving it


Vertebrates

Sequenced Genomes phylogenyFossil based, large scale phylogeny


Mar

mos

et

Mac

aque

Ora

ngut

an

Chi

mp

Hum

an

Bab

oon

Gib

bon

Gor

illa

0.5%0.5%

0.8%

1.5%3%

9%

1.2%

Primates


Flies


Yeasts


Inference by forward sampling

• We did not cover these slides this year


Sampling from a BN

Naively: If we could draw h,s’ according to the distribution Pr(h,s’) then: Pr(s) ~ (#samples with s)/(# samples)

Forward sampling:use a topological order on the network. Select a node whose parents are already determined sample from its conditional distribution (all parents already determined!)

Claim: Forward sampling is correct:

2 31

How to sample from the CPD?

4 5 67 8 9

),Pr()],(1[ shshEP


Focus on the observations

Naïve sampling is terribly inefficient, why?

What is the sampling error?

Why don’t we constraint the sampling to fit the evidence s?

2 31

4 5 6

7 8 9

Two tasks: P(s) and P(f(h)|s), how to approach each/both?

This can be done, but we no longer sample from P(h,s), and not from P(h|s) (why?)


Likelihood weighting

Likelihood weighting: weight = 1use a topological order on the network. Select a node whose parents are already determined if the variable was not observed: sample from its conditional distributionelse: weight *= P(xi|paxi), and fix the observation

Store the sample x and its weight w[x]

Pr(h|s) = (total weights of samples with h) / (total weights)

7 8 9

),|Pr( 211 ij

iij shs

),|Pr( 1ij

iij shs

),|Pr( 11 ij

iij shs ),|Pr( 211 i

jii

j shs),|Pr( 1i

jii

j shs),|Pr( 11 i

jii

j shs

Weight=


Importance sampling

M

mmP hf

MfE

1

)(1][

])()()([)]([ )()( HQ

HPHfEHfE HQHP

M

D mhwmhfM

fE

mhhD

1

])[(])[(1][ˆ

]}[],..,1[{

Unnormalized Importance sampling:

])[(])[()(

mhQmhPhw

)(|)(|)( HPHfHQ To minimize the variance, use a Q distribution is proportional to the target function:

22 )])([(]))()([())()(( HfEHwHfEHwHfVar PQQ

Our estimator from M samples is:

But it can be difficult or inefficient to sample from P. Assume we sample instead from Q, then:

Claim:

Prove it!


Correctness of likelihood weighting: Importance sampling

Unnormalized Importance sampling with the likelihood weighting proposal distribution Q and any function on the hidden variables:

Proposition: the likelihood weighting algorithm is correct (in the sense that it define an estimator with the correct expected value)

For the likelihood waiting algorithm, our proposal distribution Q is defined by fixing the evidence at the nodes in a set E and ignoring the CPDs of variable with evidence.

We sample from Q just like forward sampling from a Bayesian network that eliminated all edges going into evidence nodes!

)pa|Pr(),(),()( iiEi

xxshQshPhw

)pa|Pr()|( iiEixxDxQ

M

hhD mhwmhM

Eii

1

])[(])[(11]1[ˆ


Normalized Importance sampling

hh

HQ hPhQhPhQHwE )(')()(')()]([)(

/)]()([)]([ )()( HwHfEXfE HQHP

M

M

D

mhwM

mhwmhfMfE

mhhD

1

1

])[(1

])[(])[(1

][ˆ

]}[],..,1[{Sample:

NormalizedImportance sampling:

])[(])[(')(

mhQmhPhw

When sampling from P(h|s) we don’t know P, so cannot compute w=P/Q

We do know P(h,s)=P(h|s)P(s)=P(h|s)=P’(h)

So we will use sampling to estimate both terms:

)][(1/)][()(11)|(ˆ'

11MM

D mhwM

mhwhM

shP

Using the likelihood weighting Q, we can compute posterior probabilities in one pass (no need to sample P(s) and P(h,s) separately):


Likelihood weighting is effective here:

But not here:

observed

unobserved

Limitations of forward sampling

Documents

Genome Evolution. Amos Tanay 2010 Genome evolution Lecture 6: Inference through sampling. Basic phylogenetics