The distance between sequences, Part I. Foundations M. Elizabeth Corey, Ph.D. UCSC Extension and UC Berkeley Extension

The distance between sequences, Part I.

Foundations

M. Elizabeth Corey, Ph.D.

UCSC Extension and UC Berkeley Extension

A simple start

Suppose we have two sequences, A and B

A = {a1, a2, …, am}

and

B = {b1, b2, …, bn}

and we want to know how similar they are.

What is the basis for their similarity?

The practical measure

What we usually do is obtain an alignment and then score using the sum of the pairwise scores:

n

iiiAB basS

1

),(

The nice metric

Wouldn’t it be nice if we could simply say that the distance between sequences was the geometric sum of the distances between loci in the sequences?

n

iiiAB baD

1

2)(

1 a

2 b

3 c

4 n

5 j

6 r

7 q

8 c

9 l

10c

11 r

12 p

13m

1 a 1

2 j 1

3 c 1 1 1

4 j 1

5 n 1

6 r 1

7 c 1 1

8 k

9 c 1 1

10 r 1 1

11 b 1

12 p 1

Dynamic programming methods• Compuational method

dating from the 40’s, introduced to biology as “Needleman-Wunsch” in 1969.

• A numerical value is assigned to every cell in the array giving the similarity/dissimilarity of residues

• The example shown– match = +1– mismatch = null (value 0)

a b c n j r q c l c r p m

a 1

j 1

c 1 1 1

j 1

n 1

r 1 4 3 3 2 2 0 0

c 3 3 4 3 3 3 3 4 3 3 1 0 0

k 3 3 3 3 3 3 3 3 3 2 1 0 0

c 2 2 3 2 2 2 2 3 2 3 1 0 0

r 2 1 1 1 1 2 1 1 1 1 2 0 0

b 1 2 1 1 1 1 1 1 1 1 1 0 0

p 0 0 0 0 0 0 0 0 0 0 0 1 0

Dynamic programming methods

• GOAL: For each cell find the maximum possible score for an alignment ending at that point

• Searchs subrow and subcolumn, as shown, for highest score

• Adds this to the score for the current row

• Proceeds row by row through the array

Maximum bipartite matchingSeries of solutions, starting with Dijskta, 1950’s

Find the set of matches that provide maximum flow.

Each match, ai to bj, has a capacity equal to its pairwise score.

A Bs(a1, b1) EVO

Alignment’s not really the problem

• Optimal alignment falls into a set of problems with a long history in computer science.

• The underlying metric for distances between sequences falls in the province of biology.

Beguiled by a matrix(PAM)

PAM• PAM starts with closely related sequences

from 34 superfamilies, grouped into 71 evolutionary trees.

• PAM rests on a measure of amino acid “mutability”.

• PAM attempts to capture a representative slice of evolutionary behavior.

PAM (From Dayhoff, Schwartz and Orcutt)

• Obtain alignments for homologous proteins• Compute scoring matrix elements using:

where aij is substitution frequency, mi is the mutability of i and is a proportionality constant.

• Extrapolate to longer evolutionary distances by using {S()}n

i0

0

m – 1)(

,)(

ii

iij

ijiij

s

jia

ams

Limitations of PAM matrices

PAM matrices are built from alignments with > 85% identity.

The entries in the initial scoring matrix, S(t=1) arise from short time interval substitutions; raising S(1) to a higher power may not capture some interesting substitutions with longer rate constants.

The Gutzwiller temptation

• An abstract dynamic system (M, , t)

– a measurable space, M, composed of the set of all sequences.

– a measure based on transition probabilities– a group of automorphisms, t, that map M onto

itself, that preserves and where the variable t runs through the integers.

What’s Bernoulli got to do with it?

• A scheme with subshift– The measure on M is generated by the sets Ai,j,k

= {a |ai = j, ai+1 = k} whose measure is given by a matrix of transition probabilities pjk >= 0.

– A future event a1 depends on a0; hence, memory.

– Realized in the geodesic flow on a compact closed surface of constant negative curvature.

System behaviors

• Ergodicity: Transition probabilities are positive recurrent and aperiodic.

• Mixing: Inheritance and Mendelian exceptions lead to mixing.

• K-systems: Speciation events rigidly segregate M; other segregations exist.

Our salad days

• Jukes-Cantor

• HGY

• Kimura 2-Parameter

• PAM

• BLOSUM

General Stationary Time-reversible Model

. pCrCA pGrGA pTrTA

pArAC . pGrGC pTrTC

pArAG pCrCG . pTrTG

pArAT pCrCT pGrGT .

R =

Time reversibility: pirij = pjrji

(Diagonal elements such that rows sum to zero)

General Stationary Time-reversible Model

P(t) = eRt

Given rates, one can find transition probabilities, and vice-versa.

Jukes-Cantor

-3a a a a

a -3a a a

a a -3a a

a a a -3a

R =

Kimura 2-Parameter

. b a b

b . b a

a b . b

b a b .

R =

a/b = transition/transversion bias

A C G T

HKY (Hasegawa, Kishino, Yano)

. pC pG pT

pA . pG pT

pA pC . pT

pA pC pG .

R =

= transversion / transition

The BLOSUMn matrices• Start with multiple, ungapped alignments of

proteins found using PROTOMAT.• Build clusters by placing together sequences with

N% identity. • Measure the score for each pair defined as:

sij = 2*log2(pij/eij)

eij is expected probability of occurrence of the i,j pair

pij is observed probability of the i,j pair.

LimitationsNaive approach: measure frequencies of

aligned pairs and gaps in randomly selected confirmed alignments to get pij, use a “random” set of sequences to obtain eij.

• Difficulty 1: it is difficult to get a good random sample of sequences or alignments – databases are biased.

• Difficulty 2: When sequences diverge from a common ancestor recently, pij is small and s is strongly negative. When sequences diverged long ago, pij tends to eij and s approaches zero.

A short compendium of distances and scores

• Jukes-Cantor distance

• Kimura distance

• Dayhoff evolutionary distance

• BLOSUM scores

• Profile scores

• Average scores

References

• Gu, X. & Li, W, 1996. A general additive distance with time-reversibility and rate variation among nucleotide sites. Proc. Natl. Acad. Sci. USA 93: 4671-4676.

• Hasegawa, M., Kishino, H., & Yano, T., 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22: 160-174.

• Sanderson, M. J. & Shaffer, H. B., 2002. Troubleshooting molecular phylogenetic analyses. Annu. Rev. Ecol. Syst. 33: 49-72.

The distance between sequences, Part II.

Careful Measures



Exceptions to Mendel’s LawsThe theory: a chromosomal basis of inheritance Some so-called exceptions:• linkage and recombination• gene conversion• transposition and mobile genetic elements• A plethora of other mutations: point mutations,

reversions, deletions, frameshifts, duplications, inversions

“Exceptions” do not result in rejection of Mendelian genetics but a better understanding of the mechanisms underlying Mendelian inheritance.

Mutation frequencies(#mutations/generation)

• Frequency of point mutation: 10-7 to 10-8

• Reversion of point mutations: ~10-8. Sometimes called back mutation, sometimes called convergence.

• Reversion of deletion mutations: undetectably small.

“Loss of function” mutations result in grossly lower biological fitness. The rate of extinction due to gross “loss of function” is much great than the rate of reversion, so the line will die long before reversion can occur. In the aggregate, the record will show a pseudo-reversion.

Mutation frequencies

• Deletions: 10-6 – dependent on chromosomal region. Caveat: May be underestimated; less detectable because they are often lethal .

• Frameshifts: 10-6 – often repaired.• Duplications: 10-3 - E. coli: approximately 0.l% of

a culture for a given region of the chromosome.• Inversions: hard to detect, not always mutations• Gene Conversions: still unknown. Reparative.mutators increase mutation frequencies by ~100, they

work on “hot spots”

Protein-based inheritance – Prions

• Proteins that change their shape in response to fluctuating environmental pressures, and then maintain that shape during mitosis and meiosis, constitute a form of cellular memory.

• Various structural conformations are propagated outside of the traditional genetic framework.

Hsp90 and Sup35

• A buffer for silent polymorphisms: Hsp90– promotes the folding of signal tranducers– buffers the effects of many silent polymorphisms– may serve as a capacitor of evolutionary change –

storing and releasing genetic variation

• “Epigenetic inheritance”: The Sup35 prion

James Joyce’s List

Milk

Call mom!

Lettuce

Plumb the smithy of my soul for the unborn race-consciousness…

Rent

--------------------------------------

Thriving in fluctuating environments by exploiting pre-existing genetic variations.

References Recent Publications on Conformational Change

and Evolution • Queitsch, C., Sangster, T.A. and Lindquist, S. 2002. Hsp90 as a

capacitor of phenotypic variation. Nature 417: 618-624.

• Jensen, M.A., True, H.L., Chernoff, Y.O., and Lindquist, S., 2001 Molecular Population Genetics and Evolution of a Prion-like Protein in Saccaromyces cerevisiae. Genetics 159: 527-525.

• True, H.L., and Lindquist, S.L. 2000. A yeast prion provides an exploratory mechanism for genetic variation and phenotypic diversity. Nature 407: 477-483.

• Rutherford, S.L. and Lindquist, S. 1998. Hsp90 as a capacitor for morphological evolution. Nature 396: 336-342.

Mutations and time

Take a series of sequences and figure out how different they are by counting up their substitutions.

A

B

C

5 substitutions

3 substitutions

6 substitutions

Mutations and time

What process takes us from A to B to C?

A

B

C

gene conversion

frameshift (repairable)2 point accepted mutations

No direct ancestry

Counting mutations

Consider a counting process {N(t), t T} where N(ti) – N(tj) is the number of mutations in the time interval (ti,tj].

A

B

C

N(AB) = 1 GC

N(BC) = 1 FS, 2 PM

No direct ancestrybut we can still count

substitutions: N (AC) = 6 PM

Times on the edges of the tree

The “interoccurrence” times between mutations, 1 = 0, , 1 = t2 – t1, … i = ti – ti-1,

are exponential variables with mean 1/ such that

P[i > h] = e-h

and

P[i <= h] = 1 - e-h

for h>= 0.

Edge timesGene conversions gc = 1 gc/2,000* years

Frame shifts fs = 1 shift/5,000 years

Point mutations pm = 1 pm/10,000 years

*Just an wild guess

A

B

C

1/gc = 2,000 yrs

2/pm+ 1/fs = 25,000 yrs

1/pm = 60,000 yrs

Edge timesPopulation of A = 105

Population of B = 106

Population of C = I don’t care.

A

B

C

1/Na gc = 20 * 10-2 yrs

2/ Nb pm+ 1/ Nb fs = 25*10-3 yrs

1/pm = 60 * 10-2 yrs

Calculating divergence times

Doolittle, D.F., Fend, D-F, Tsang, S., Cho, G and Little, E. “Determining Divergence Times of the Major Kingdoms of Living Organisms With a Protein Clock.” Science, 271, pp. 470-477, 1996.

Calculating divergence times

Task: Build a model for evolutionary time based on pairwise distances, dij, and the fossil record– Start with the vertebrate fossil record - the

biogeochemistry gives reliable times.– Map the fossil-based phylogeny to the sequence

based phylogeny and compare edge lengths.– Adjust the sequence-based time model to match

the vertebrate fossil record.

Using the fossil record

Vertebrates: Time of first appearance in fossil record versus sequence similarity

0

5

10

15

20

25

30

0 100 200 300 400 500 600

Time (ma)

Dis

tan

ce M

easu

re

Readjusting the clock

After sampling the vertebrate fossil-record and fitting the sequence data to the fossil-record , they maintain the same clock.

Result: Eukaryotes and Prokaryotes diverged about 2.5 billion years ago.

On fitting the fossil record to sequence data

Challenges: unequal rates of change in different species due to:– different reproductive cycles in different species

– different base population sizes in different species.

Obtaining bacterial mutation rates using vertebrate mutation rates when we are looking at the evolution of populations: how viable is it?

Population mutation

Suppose an average rate of mutation per site is about 10-7 (ignoring duplications).

Compare lengths of reproductive cycles: – Prokaryotes (blue-green algae and bacteria): 20 minutes

to an hour per generation.– Humans: US, average time to first child is 24.8 years.

How many times does a bacteria reproduce in the time it takes a human being to reproduce?

24 * 365 * 25 = 219,000

So if we are comparing bacterial mutation rates to human mutation rates and we looking at aggregate populations, we have to adjust by a factor of 106.

Population mutationSize of the base population on planet earth:

5 * 1030 prokaryotes (UG, Bill Whitman) - including about a mole of bacteria

3 * 109 humans

How many bacteria are there, propagating how fast, in comparison to humans? Worst case ratio?

Calculate using base population * rate of generation * number of mutable

genes(1023 * 106*103)

-------------------------- = 1018

(109 * 1*104)

One final issue: The Success Question

When mutations succeed, they succeed within an ecological niche.

So when we ask “When did a species arise?”, it is not enough to ask about the likelihood of a certain kind of mutation, one must also ask: what is the likelihood that that mutation arose in a niche that would support it?

So, don’t forget about acceptance rates.

The FOXP2 point mutation

Enard et al, “Molecular evolution of FOXP2, a gene involved in speech and

language”, Nature, Vol. 418, August 22, 2002

Silent/expressed mutations in FOXP2

Edge labels are: Amino Acid / DNA substitutions

OHG

HG

Human Gorilla

Orangutan

0/7

0/2

1/2

2/2

Selective sweeps

Measures for determining the existence of a sweep:– Tajima’s D: from Genetics, 1989 (conservative)– Fay and Wu’s H: from “Hitchhiking under

positive Darwinian selection”, Genetics, 2000.

Also, Griffiths and Tavare estimate selection using linked SNP data

Population mutation rates

ia = 4Na i - the population mutation rate for site i in species a, where Na is the effective population size of species a and i is the mutation rate per generation at site i.

0

0.2

1 4 7 10 13 16

# of pointmutations

Tajima’s D for FOXP2

0.03%

S/an = 0.079%

S is the sample size

an is the number of segregating sites.

Discovering different rate constants

Finding the time of appearance of the FOXP2 segregation

• Sample current human population worldwide.

• Generate trees with different times for the human sequence data.

• Measure the likelihood of the different trees.

Multiple rates

The automorphism mapping M onto itself, used to be a simple shift operation.

Now, it incorporates several underlying processes, including: – mutation of the bases (mutation rate)

– expression of the mutations (expression rate)

– stabilization of a conformational phenotype (stabilization rate)

– success of the substitution (acceptance rate)

The distance between sequences,

Part III.

Algorithms for phylogenies



• Phylogenies provide measures of similarity and can lay a foundation for scoring alignments.

• Rate structures provide indicators for motifs. • Branch points allow us to identify and classify

interesting bases.– If the branch points are in phenotypic trees, the

mutating bases can be used as phenotypic identifiers.

– If the branch points are in genotypic trees, mutating (nonsilent) bases can be used as genetic identifiers.

Motivation

What goes into a phylogeny?

Distance measures (UPGMA, NN)

Site info (MLE and Parsimony)

Substitution scores

Equilibrium distributions for MLE

Pairwise Alignment Multiple

Alignment

Phylogenies

Transitional probability data

What do we get in return?

Guide trees

Rates and probabilities

Scoring matrices Scoring matrices

Pairwise Alignment Multiple

Alignment

Phylogenies

Transitional probability data

Part III: Goals

• Depict methods for finding guide trees for progressive multiple alignment.

• Clarify the differences between MLE, Maximum Parsimony and Distance Methods and identify the optimization techniques appropriate for each.

• Define a new approach for faster identification of near-optimal phylogenies.

Progressive multiple alignment

• Choose a set of scores for sequence comparison– Alignment scores from Needleman-Wunsch, Smith-Waterman and

variants.– Consensus word score from BLAST, PSI-BLAST and others– Substitution (scoring) matrices – PAM, BLOSUM, Jukes-Cantor,

etc.

• Construct a reputable guide tree– Hierachical clustering (UPGMA, Neighbor-Joining, Fitch and

Margoliash)– Maximum Parsimony (simple or weighted).– Maximum Likelihood Estimation (MLE)

• Use the guide tree to produce an alignment

Tree evaluation - Parsimony

• Given a semi-labeled tree, it is possible to determine the tree’s internal nodes (ancestral sequences) using a parsimony algorithm.

• Evaluation function: A summation of the scored mutations in the parsimonious tree.

Parsimony - Illustrated

ABC ADC

A(B or D)Cnode 1, cost is 1

ABE ACC

A(B or C) (E or C)node 2, cost is 2

ABCnode 3, cost is 3

Example: Simple ParsimonyInitialization:

Set the cost, C = 0. Set k = 2n-1, where n is the number of sequences.

Recursion to compute node, Nk:if k is a leaf node, Nk= sequence kif k is not a leaf node

Compute Ni and Nj for the daughter nodes of Nk.

where the intersection of Ni and Nj is nonempty,

otherwise increment the cost by the number of nonmatching residues and set

Termination: Minimum cost of tree = C.

jik NNN

jik NNN

Tree evaluation – Distance methods

• Given a set of alignment scores, but without assuming a tree topology, it is possible to determine a tree and its edge lengths using a distance method. This is sometimes called minimum evolution and includes the hierarchical clustering methods.

• Evaluation function: The sum of the edge lengths.

Hierarchical Clustering – Illustrated UPGMA

21

34

5

21

34

5

21

34

5

21

34

5

1 26 t1= t2= ½d12

1 26

4 57

From Durbin et al, 2001

1 26

4 57

3

8

6 78

9

1 2 4 5 3

½d68

Algorithm: UPGMA

Input: N sequences and their relative distances, dij

Initialization:Assign each sequence to its own cluster, Ci.Define a leaf of T for each sequence and place at height = 0.

IterationPick two clusters Ci, Cj such that dij is minimal.

Define a new cluster k by Ck = {Ci,Cj}.

Define a new set of distances {dkl} between Ck and all current clusters.

Define a node k with daughter nodes i and j, and place it at height hik = ½dik.

Add k to the set of current clusters and remove i and j.Termination:

Rooted: When only two clusters i, j, remain, add the root at height ½dik.

Tree evaluation - MLE

• Given a tree topology and sequences preassigned to each leaf, it is possible to determine a tree’s edge lengths using maximum likelihood estimation.

• Evaluation function: the likelihood of the tree.

Estimating Likelihood• Estimate branch lengths by viewing evolution

as a random process• Requires a probability model of evolution as a

function of time.– For DNA one can use Jukes-Cantor model (all

nucleotides have same substitution rates), or Kimura model (different rates for transitions and transversions).

– For proteins one can use Dayhoff, but in the probability form not the log-odds form.

Estimating LikelihoodS1, etc. are the bases or residues observed in the extant

and ancestral taxa.

v = t where is the substitution rate and t is absolute time

Pi,j(v) is the probability that the residue at node si

becomes residue at node sj in time v

0 is the prior probability of the bases or nucleotides at any position

The likelihood for this tree is:

L = 0P0,5(v5) P5,1(v1) P5,2(v2) P0,6(v6) P6,3(v3) P6,4(v4)

Example: LikelihoodFor each mutating site in a set of sequences

Initialization:Set k = 2n-1, where n is the number of sequences.

Recursion: Compute P(Lk|a) for each symbol, a, in the alphabet as follows:

If k is a leaf node:if xk,u = “a”, then P(Lk|a) = 1,

else Pk(a) = 0.if k is not a leaf node:

Compute P(Li|a), P(Lj|a) for all a at daughters i,j

Set P(Lk|a) = b,cP(b|a,ti) P(Li|b) P(c|a,tj) P(Lj|c).

Termination: Likelihood for site u = a aP(L2n-1|a)

(a is the equilibrium value of the probability distribution for a.)

Concluding step: Combine the likelihoods for each site.

Maximizing Likelihood Estimation over edge times

Likehood estimation includes a step for computing the likelihood of some character “a” at node k given the subtree of k.

While we know that there is the possibility of substitutions leading to a, these depend on how long a time we have to make those substitutions and we do not know the edge times of the tree. We must explore a series of possible times in order to to maximize the likelihood.

• A method that maximizes likelihoods over edge times is what is usually referred to as MLE.

• Standard MLE procedures do not maximize likelihoods over all topologies of the tree.

Comparisons between MLE, Parsimony and

Distance Methods

Algorithm Requires semi-labeled tree

Requires scored alignments

Order Results – Edge weights

Results – Internal tree nodes

Resulting Tree Is Ultrametric

MLE Yes No La2n-1

2an2

Transitional probabilities

subtree probability

Yes

Parsimony Yes No 2an2 Mutation counts Ancestral sequences

No

Distance Methods

No Yes 2n2 Distance measures – e.g. alignment scores

UPGMA: a cluster of sequences

UPGMA - no

NN - yes.

Exploring different topologies

• Successive addition and rearrangement– Very common method (see Phylip programs including:

PROTPARS, DNAPARS, DNACOMP, DNAML, DNAMLK, RESTML, KITSCH, FITCH, CONTML, MIX and DOLLOP)

– Sequences are taken in the order that they appear in the input file and successively added to a tree.

• MCMC

Successive addition• Initialization:

– Place the set of sequences into L.

– Create a tree,T, with one node – the root.

• Iteration: for each sequence in L

– Remove a sequence from L and add it as a leaf to T.

– Apply a process of local rearrangement (in Felsenstein’s package, there are (n-1)(2n-3) arrangements.)

– Score each locally arranged tree.

– set T to equal the best scoring tree.

• Termination: Globally rearrange the tree by swapping subtrees, score each globally rearranged tree and accept the tree with the best score.

Markov Chain Monte Carlo

A Bayesian method for phylogenetic inference – Moderately new method rooted in molecular dynamics.

– Topologies are randomly generated and scored so that a representative set of most likely tree topologies can be identified.

Mau, Newton and Larget (1998) apply MCMC to sample trees using Bayes theorem. The following explanation is based on their methodology - the mistakes are mine, the facts and foundations, theirs.

Introduction to the method

is the set of all semi-labeled trees

Introduction to the methodSampling the set of trees

Q1

Q2

Q3

Introduction to the method

a b c d a c bd a b c d

Q01 Q12 Q23

…

A Chain of Accepted Samples

Introduction to the methodThe partitioned space with representatives {1,3}

MCMC propaganda

• allow exact inference provided certain convergent criteria are demonstrated.

• are efficient and can handle many more taxa or sequences.

• measure uncertainty during tree construction (no bootstrapping needed.)

Summary of the Algorithm

1. Choose a starting tree2. Perturb the current tree’s topology and branch

lengths to find a new tree. 3. Measure the likelihood for the new tree.4. Compare the new tree to the last tree and

decide whether or not to accept it into the chain.

5. If you’ve got a sufficiently long chain, check the characteristics of your sample to see if there is convergence to a set of representative topologies. If so, stop. Otherwise, to to 2.

Subproblems to be discussed

1. How do we represent the tree so it that is easy to operate on? Cophentic matrices.

2. What is our perturbation operator?

3. How do we build our sampling chain?

4. When are we done sampling?

The Cophenetic MatrixSome Notation

– a topology

n – a node

a(n) – the ancestor of a node

L – a leaf node (the leaves are the current record)

I – an internal node (the historical record)

Cophenetic Trees

Labeled history (t1, t2) provides an order on coalescent levels.

level 0

level 2

level 1I1

L3

I2

I0

L1 L2

t1 {

t2 {

Example: A Cophenetic Tree

These trees are described in terms of nodes coalescing or merging backwards in time.

t1= 0.8

t2=0.3

t3=0.7

t4=0.5

t5=0.9

t6=1.5

total: 4.7

Example: Cophenetic Matrix

Leaf 5 7 4 1 2 6 3

5 0 9.4 9.4 9.4 9.4 9.4 9.4

7 0 1.6 4.6 6.4 6.4 6.4

4 0 4.6 6.4 6.4 6.4

1 0 6.4 6.4 6.4

2 0 3.6 3.6

6 0 2.2

3 0

The cophenetic matrix for the previous tree.

The tree representation ( a) is {(5,7,4,1,2,6,3), (4.7, 0.8, 2.3, 3.2, 1.8, 1.1)}

The Cophenetic Matrix

Theorem: For any weighted binary tree with labeled leaf nodes, the tree topology and branch lengths can be uniquely determined using the within-tree distances between all pairs of leaf nodes. (Lapoint and Legendre, 1992)

Note, each permutation of the leaf labels generates a different n x n symmetric matrix of distance distances.

What is the perturbation operator?Q is the proposal function and it has two

stages:

• Q1 randomly selects a new leaf order

• Q2 perturbs the values of the matrix supradiagonals.

The proposal mechanism is symmetricalQ(n,n+1) = Q(n+1,n)

Details on Q1 and Q2Q1 samples one of the 2n-1 leaf orderings of

the current tree model.Q2 simultaneously and independently

modifies the elements of the superdiagonal by creating a uniform distribution (ai d) where d is constant.

By applying both types of perturbations, Q1 and Q2, all the permutations of trees can be reached.

Illustration of Q2

Subproblems to be discussed

1. How do we represent the tree so it that is easy to operate on? Use cophenetic matrices.

2. What is our perturbation operator? Q.

3. How do we build our sampling chain? Apply Metropolis-Hastings

4. When are we done?

Acceptance with Metropolis-Hastings

Given a tree , Metropolis-Hastings:

1. Applies Q to build a new tree, .

2. Always accepts the new tree when it is more likely than the old one and sometimes accepts it when it is less likely than the old one.

Acceptance with Metropolis-Hastings – the algorithm

If P(*) > P()

accept * into the chain.

else

accept into the chain with probability P() / P

Acceptance with Metropolis-Hastings

The final step in evaluating the acceptance test is evaluating

P() / P

This is easy: P() is approximated using the LE of

Size of chain and convergence• How many trees do you have to propose before

you begin to get a good enough sample? Mau et al 1998 sample over about 2500 trees for Clarkia, a phylogeny with 9 leaves

• How do you test that you are done? At the end of the run, we say that we have convergence if there is a small set of topologies with high relative frequency in the chain.

• What’s the result? The topologies with the highest frequencies are the reported reconstructions.

Mixing

• To obtain a confidence measure, the algorithm must be run more than once: each run generates a chain of accepted trees.

• When chains “mix” well when they come up with the same representative topologies, starting from different tree topologies.

• If running a sufficient number of independent chains is computationally prohibitive, Suchard et al, 2002, provide a “poor man's estimate of the uncertainty”.

Example with binary data(from Mau, et al, 1998)

9 species of genus Clarkia (California plants)

120 restriction sites

Data translated into a 9 x 120 matrix of zeroes and ones, representing the absence or presence of a restriction site in the genome of each species.

Running the MCMC algorithm

Random starting trees

Chains of length 250,000 were subsampled at rate of 1/100 = 2500 trees

Each run took 20 minutes on a Sparc 10.

Convergence was inferred by reproducibility across runs with very different starting trees.

The most common topologies for Clarkia

A = 1,2; B = 3,4; C = 5,6; D=8,9

References

Smouse and Li (1989) introduced the Bayesian paradigm, but not the notation, to the phylogeny reconstruction problem.

Goldman (1993) used non-Bayesian Monte Carlo tests of significance to assess the adequacy of evolutionary models.

Griffiths and Tavare (1994) constructed Markov chains to compute likelihoods for ancestral inference.

Mau, Newton and Larget (1998) apply MCMC to sample trees using Bayes theorem.

Drill-down: RatesThe way I use it, and I admit this is quirky, motif means the genetic profile

for a functional structure. Using the following definitions:

– Let rG be the rate of mutation for a gene.– Let rE be the rate of expressed mutation for the protein G encodes.– Let rS be the rate of structural mutation for the protein G encodes.

– Let rF be the rate of functional mutation for the protein G encodes.

rG > rE > rS > rF

Note that the rate of neutral mutations is rN = rG – rF.

The “true” rate of mutation for a motif is rF, the observed rate of mutation for members of a motif in a genotypic tree is rG. If we want motif branchings, we eliminate all branchings in the phylogeny occuring with rates rN.

Drill-down: Semi-labeled trees

Trees with a defined branching pattern and defined leaf labels but WITHOUT edge lengths or internal node labels.

In our terms, phylogenies with known branching patterns but without information about ancestors or mutation times.

nccbac nacbac ncbbbc nccnaa

Drill-down: Progressive Alignments

• As you move up the tree, add to sum of characters in growing alignment

Progressive AlignmentsSum of characters in growing alignment can be represented in a table

of values called afrequency matrix or a profile

Progressive AlignmentsAlignments are frozen once they are made. Scores are then

calculated between aligned positions tabulated in a frequency matrix, using a scoring table

Sij = 2 × G:G + 1×A:G

A G S T

A 4 1 3 10

G 3 2 6

S 2 14

T 8

Algorithm: Neighbor-joining

Input: N sequences and their relative distances, dij

Initialization:Define a leaf of T for each sequence

IterationPick two nodes i,j such that dij – (ri + rj) is minimal.

Define a new set of distances, {dkl} between k and all current nodes.

Define a node k with daughter nodes i and j, and place it at edge length eik = ½(dij + ri – rj) and ejk = dij –dik.

Add k to the set of current nodes and remove i and j.

Termination:Unrooted: When only two nodes i, j, remain, add an

edge of length dij/2.

Comparison: Neighbor-joining and UPGMA

Minimization:– UPGMA uses dij

– Nearest-neighbor uses dij – (ri + rj) where

Distance measures:For distances between leaves i and j:

• dij is the same in both algorithms.

For distances between nodes k and m• UPGMA uses dik = 1/|Ci||Cj| p in Ci, q in Cjdpq

• Nearest-neighbor uses dkm = ½ (dim + djm – dij) where i and j are the daughters of k.

Edge lengths:UPGMA set the height of node k to ½ the distance between

daughters i,j (½ dij).Nearest neighbor sets the edge length between k and daughters j

to ½(dij + ri – rj), daughter k to dij – dik.

.2||

1

Lk

iki dL

r

Drill-down: MLE

P(b|a,tj)

ncbbcbcP(Lj|b) = 1

a P(Lk|a) = P(c|a,ti) P(b|a,tj)

site u = 3simplest case

nccbabcP(Li|c) = 1

P(c|a,ti)

Drill-down: Enumerating topologies

1)!-(n2

2)!-(2n !3)!-(2n||

)1(

)1(21||

1-n

labeledsemi

unlabeled n

n

n

Drill-down: Acceptance with Metropolis-Hastings

A proposed tree is accepted with probability:

However, by detailed balance you can step forward or backward with equal probability:

Q(,) = Q(, )Hence our test becomes

)(

*)(,1min

P

P

*),()(

)*,(*)(,1min

QP

QP

Documents

The distance between sequences, Part I. Foundations M. Elizabeth Corey, Ph.D. UCSC Extension and UC Berkeley Extension