35
Phylogenet ics What is a tree & how many are there? Principles of phylogenetic receconstruction. Special Issues Rooting a tree The Molecular Clock Almost Clocks.

Phylogenetics What is a tree & how many are there? Principles of phylogenetic receconstruction. Special Issues Rooting a tree The Molecular Clock Almost

Embed Size (px)

Citation preview

PhylogeneticsWhat is a tree & how many are there?

Principles of phylogenetic receconstruction.

Special Issues

Rooting a tree

The Molecular Clock

Almost Clocks.

Trees – graphical & biological.A graph is a set vertices (nodes) {v1,..,vk} and a set of edges {e1=(vi1,vj1),..,en=(vin,vjn)}. Edges can be directed, then (vi,vj) is viewed as different (opposite direction) from (vj,vi) - or undirected.

Nodes can be labelled or unlabelled. In phylogenies the leaves are labelled and the rest unlabelled.

The degree of a node is the number of edges it is a part of. A leaf has degree 1.

A graph is connected, if any two nodes has a path connecting them.

A tree is a connected graph without any cycles, i.e. only one path between any two nodes.

v1v2

v4

v3

(v1v2)

(v2, v4)

or (v4, v2)

Trees & phylogenies.A tree with k nodes has k-1 edges. (easy to show by induction).

A root is a special node with degree 2 that is interpreted as the point furthes back in time. The leaves are interpreted as being contemporary.

A root introduces a time direction in a tree.

A rooted tree is said to be bifurcating, if all non-leafs/roots has degree 3, corresponding to 1 ancestor and 2 children. For unrooted tree it is said to have valency 3.

Edges can be labelled with a positive real number interpreted as time duration or amount or evolution.

If the length of the path from the root to any leaf is the same, it obeys a molecular clock.

Tree Topology: Discrete structure – phylogeny without branch lengths.

Leaf

Root

Internal Node

Leaf

Internal Node

Enumerating Trees: Unrooted & valency 3

2

1

3

11

24

23

31 2

3 4

4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

1 2

3 4

5

5 5

5

5

(2 j 3)j3

n 1

(2n 5)!

(n 2)!2n 2

4 5 6 7 8 9 10 15 20

3 15 105 945 10345 1.4 105 2.0 106 7.9 1012 2.2 1020

Recursion: Tn= (2n-5) Tn-1 Initialisation: T1= T2= T3=1

Local operations on trees.

Nearest Neighbor Interchange:

Subtree cut and regrafting – (subtree root kept)

Subtree cut and regrafting – (subtree root possibly new)

A C

DB

AC

DB

Central Principles of Phylogeny Reconstruction

Parsimony

Distance

Likelihood

TTCAGT

TCCAGT

GCCAAT

GCCAAT

s2

s1

s4

s3

s2

s1

s4

s3

s2

s1

s4

s3

0

1

12

0 Total Weight: 4

1

1 2

3 2 10.4

0.6

0.3

0.71.5

L=3.1*10-7

Parameter estimates

Distance Concepts on Trees I

A: Metric, d( , ) : i: d(a,b)=0 <=> a=b ii: d(a,b)=d(b,a) iii: d(a,b) <= d(a,c) + d(c,b)

a

c

b

Tree Metric: (distance function originates from tree)

d(x,y) + d(z,w) = d(x,z) + d(y,w) > d(x,w) + d(y,z), where z,y,z,w is a permutation of a,b,c,d.

(> implies that no branch has length 0)

Distance Concepts on Trees II

s2

s1

s4

s3

Reconstruction Principle: d(s1,i) = (d(s1,s2) + d(s1,s3) - d(s2,s3))/2

s3

s2s1

i

Ultra Metric (distance function originates from tree)

d(x,y) = d(x,z) > d(x,y), where z,y,z is a permutation of a,b,c.(> implies that no branch has length 0)

Distance Concepts on Trees III

i

s1 s3s2

Reconstruction Principle: d(s1,i) = d(s1,s2)/2

Unweighted Pair-Group method with Arithmetic MeanInput: Matrix with pariwise distances between sequences, D:

1: Find smallest distance, di,j

2: i,j are now siblings with a distance, di,j/2, to their MRCA (i,j).

3: A new distancematrix of dimension (n-1)*(n-1) where i and j have been substituted by (i,j). All distances to (i,j) are dk,(i,j) = (dk,i + dj,k)/2.

4: This is done n-1 times and the tree has been reconstructed.

Output: An ultrametric.

Comment: i. If UPGMA is given an ultrametric, it will reconstruct the same ultrametric.

UPGMASokal and Michener, 1958

Assignment to internal nodes: The simple way.

C

A

C CA

CT G

???

?

?

?

What is the cheapest assignment of nucleotides to internal nodes, given some (symmetric) distance function d(N1,N2)??

If there are k leaves, there are k-2 internal nodes and 4k-2 possible assignments of nucleotides. For k=22, this is more than 1012.

Cost of a history - minimizing over internal statesA C G T

A C G T A C G T

d(C,G) +wC(left subtree)

subtree)} (),({min

subtree)} (),({min

)(

rightwNGd

leftwNGd

subtreew

NsNucleotideN

NsNucleotideN

G

Cost of a history – leaves (initialisation).A C G T

G A

Empty

Cost 0

Empty

Cost 0

Initialisation: leaves

Cost(N)= 0 if

N is at leaf,

otherwise infinity

Fitch-Hartigan-Sankoff Algorithm

(A,C,G,T) (9,7,7,7) Costs: Transition 2, / \ Transversion 5. / \ / \ (A, C, G, T) \ (10,2,10,2) \ / \ \ / \ \ / \ \ / \ \ / \ \ (A,C,G,T) (A,C,G,T) (A,C,G,T) * 0 * * * * * 0 * * 0 *

The cost of cheapest tree hanging from this node given there is a “C” at this node

A C

TG

5S RNA Alignment & PhylogenyHein, 1990

10 tatt-ctggtgtcccaggcgtagaggaaccacaccgatccatctcgaacttggtggtgaaactctgccgcggt--aaccaatact-cg-gg-gggggccct-gcggaaaaatagctcgatgccagga--ta17 t--t-ctggtgtcccaggcgtagaggaaccacaccaatccatcccgaacttggtggtgaaactctgctgcggt--ga-cgatact-tg-gg-gggagcccg-atggaaaaatagctcgatgccagga--t- 9 t--t-ctggtgtctcaggcgtggaggaaccacaccaatccatcccgaacttggtggtgaaactctattgcggt--ga-cgatactgta-gg-ggaagcccg-atggaaaaatagctcgacgccagga--t-14 t----ctggtggccatggcgtagaggaaacaccccatcccataccgaactcggcagttaagctctgctgcgcc--ga-tggtact-tg-gg-gggagcccg-ctgggaaaataggacgctgccag-a--t- 3 t----ctggtgatgatggcggaggggacacacccgttcccataccgaacacggccgttaagccctccagcgcc--aa-tggtact-tgctc-cgcagggag-ccgggagagtaggacgtcgccag-g--c-11 t----ctggtggcgatggcgaagaggacacacccgttcccataccgaacacggcagttaagctctccagcgcc--ga-tggtact-tg-gg-ggcagtccg-ctgggagagtaggacgctgccag-g--c- 4 t----ctggtggcgatagcgagaaggtcacacccgttcccataccgaacacggaagttaagcttctcagcgcc--ga-tggtagt-ta-gg-ggctgtccc-ctgtgagagtaggacgctgccag-g--c-15 g----cctgcggccatagcaccgtgaaagcaccccatcccat-ccgaactcggcagttaagcacggttgcgcccaga-tagtact-tg-ggtgggagaccgcctgggaaacctggatgctgcaag-c--t- 8 g----cctacggccatcccaccctggtaacgcccgatctcgt-ctgatctcggaagctaagcagggtcgggcctggt-tagtact-tg-gatgggagacctcctgggaataccgggtgctgtagg-ct-t-12 g----cctacggccataccaccctgaaagcaccccatcccgt-ccgatctgggaagttaagcagggttgagcccagt-tagtact-tg-gatgggagaccgcctgggaatcctgggtgctgtagg-c--t- 7 g----cttacgaccatatcacgttgaatgcacgccatcccgt-ccgatctggcaagttaagcaacgttgagtccagt-tagtact-tg-gatcggagacggcctgggaatcctggatgttgtaag-c--t-16 g----cctacggccatagcaccctgaaagcaccccatcccgt-ccgatctgggaagttaagcagggttgcgcccagt-tagtact-tg-ggtgggagaccgcctgggaatcctgggtgctgtagg-c--t- 1 a----tccacggccataggactctgaaagcactgcatcccgt-ccgatctgcaaagttaaccagagtaccgcccagt-tagtacc-ac-ggtgggggaccacgcgggaatcctgggtgctgt-gg-t--t-18 a----tccacggccataggactctgaaagcaccgcatcccgt-ccgatctgcgaagttaaacagagtaccgcccagt-tagtacc-ac-ggtgggggaccacatgggaatcctgggtgctgt-gg-t--t- 2 a----tccacggccataggactgtgaaagcaccgcatcccgt-ctgatctgcgcagttaaacacagtgccgcctagt-tagtacc-at-ggtgggggaccacatgggaatcctgggtgctgt-gg-t--t- 5 g---tggtgcggtcataccagcgctaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggccagaa-cagtact-gg-gatgggtgacctcccgggaagtcctggtgccgcacc-c--c-13 g----ggtgcggtcataccagcgttaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggccagcc-tagtact-ag-gatgggtgacctcctgggaagtcctgatgctgcacc-c--t- 6 g----ggtgcgatcataccagcgttaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggttggag-tagtact-ag-gatgggtgacctcctgggaagtcctaatattgcacc-c-tt-

9

11

10

6

8

7

543

12

17

16

1514

13

12

Transitions 2, transversions 5

Total weight 843.

Fungi

Animals

Mitochondria Plants

Prokaryotes

The Felsenstein ZoneFelsenstein-Cavendar (1979)

s4

s3s2

s1

Patterns:(16 only 8 shown)

0 1 0 0 0 0 0 0

0 0 1 0 0 1 0 1

0 0 0 1 0 1 1 0

0 0 0 0 1 0 1 1

True Tree Reconstructed Tree

s3

s1

s2

s4

BootstrappingFelsenstein (1985)

ATCTGTAGTCT

ATCTGTAGTCT

ATCTGTAGTCT

ATCTGTAGTCT

10230101201

ATCTGTAGTCT

ATCTGTAGTCT

ATCTGTAGTCT

ATCTGTAGTCT

1

23

4

Probability of a pattern - summing over internal states

A C G T

A C G T A C G T

A

A

A

?

? ?

?

T

GC

Probability of leaf observations - summing over internal states

A C G T

A C G T A C G T

subtree)} ()({

subtree)} ()({

)(

rightPNGP

leftPNGP

subtreeP

NsNucleotideN

NsNucleotideN

G

P(CG) *PC(left subtree)

GleafG leafP

tionInitialisa

,)(

With Clock: Without Clock: s5 s4 23 5.2 \ / /\ 40.9 20.4 / \ \ / / \ ! / \ 1.6 5.6 23 sd4.6 124.4 / \ s1---6-------22---------------11---3 /\ \ ! ! 44.9 /\ \ /\ 7 3.4 4 sd.1.4 / \ \ / \ ! s1 s2 s3 s4 s5 s2

Likelihood: 7.9*10-14 = 0.31.1,0.18.1 6.2*10-12 = 0.34.1 0.16.1

ln(7.9*10-14) –ln(6.2*10-12) is 2 – distributed with (n-2) degrees of freedom.

Output from Likelihood Method

First noted by Zuckerkandl & Pauling (1964) as an empirical fact.

How can one detect it?

Known Ancestor Time Unknown AncestorTime

/\ a at time T. / \ / \ ? \ / \ /\ \ / \ / \ \ / \ / \ \s1 s2 s1 s2 s3

The Molecular Clock

3 billion years ago: no reliable clock no outgroupGiven 2 set of homologous proteins, i.e. MDH & LDH can the archea, prokaria and eukaria be rooted? LDH MDH A A \ \ \ \ --------E --------E / / / / P P LDH MDH / \ / \ / \ /\ /\ / \ / \ / /\ / /\ P A E P A E

Rooting the 3 kingdoms

Purpose 1) To give time direction in the phylogeny & most ancient point2) To be able to define concepts such a monophyletic group.

Metoder:1) Outgrup: Enhance data set with sequence from a species definitely distant to all of them. It will be be joined at the root of the original data set.

2) Midpoint: Find midpoint of longest path in tree.

3) Assume Molecular Clock.

Rootings

(Illustration of Langley-Fitch) s1 /\ \ / \ clock: l1 \ / \ ----*--- s3 /\ \ {l1 = l2 < l3} l2 / l3 / \ \ / / \ \ s2 s1 s2 s3Given root: (2k-3)-(k-1) = (k-2) degrees of freedoms lost in imposing a clock.Assumptions1. Ancestral Sequences are observable.2. The number of events on branch is Poisson distributed with a mean proportional to the branch length. The same proportionality constant for all branches.3. The observed differences between sequences at two neighboring nodes is the actual number of events. s1' s1 \ \ \ l1 \ c*l1 \ ------- s3 ------------ s3' l2 / l3 c*l2 / c*l3 / / s2 / s2' sequences 1 sequences 2 k sequences s species : s(2k-3)s s(k-1) (2k-3)+s s+(k-1)

The generation/year-time clock

I Smoothing a non-clock tree onto a clock tree (Sanderson).

II Rate of Evolution of the rate of Evolution (Thorne et al.).The rate of evolution can change at each bifurcation.

III Relaxed Molecular Clock (Huelsenbeck et al.). At random points in time, the rate changes by multiplying with random variable (gamma distributed)

Almost Clocks (MJ Sanderson (1997) “A Nonparametric Approach to Estimating Divergence Times in the Absence of Rate Constancy” Mol.Biol.Evol.14.12.1218-31) , J.L.Thorne et al. (1998): “Estimating the Rate of Evolution of the Rate of Evolution.” Mol.Biol.Evol. 15(12).1647-57, JP Huelsenbeck et al. (2000) “A compound Poisson Process for Relaxing the Molecular Clock” Genetics 154.1879-92. )

Non-contemporaneous leaves.(A.Rambaut (2000): Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenies. Bioinformatics 16.4.395-399)

In presence of recombination and Gene Conversion, the relationship among sequence might not be describable by a phylogeny!!

Recombination and the Molecular Clock I

Common Practice: I Finding “the phylogeny” anyway.II testing for the molecular clock.

What is the consequences of this practice?I Simulate data with model including recombination.II Reconstruct phylogeny.III Test for Clock.

Recombination and the Molecular Clock IISchierup & Hein (2000): Recombination and the Molecular Clock. Mol.Biol.Evol.17.10.1578-79 + Schierup & Hein (2000): Consequences of Recombination on Traditional Phylogenetic Analysis. Genetics 156.879-91.

History of Phylogenetic Methods

1958 Sokal and Michener publishes UGPMA method for making distrance trees with a clock.

1964 Parsimony principle defined, but not advocated by Edwards and Cavalli-Sforza.

1962-65 Zuckerkandl and Pauling introduces the notion of a Molecular Clock.

1967 First large molecular phylogenies by Fitch and Margoliash.

1969 Heuristic method used by Dayhoff to make trees and reconstruct ancetral sequences.

1970: Neyman analyzes three sequence stochastic model with Jukes-Cantor substitution.

1971-73 Fitch, Hartigan & Sankoff independently comes up with same algorithm reconstructing parsimony ancetral sequences.

1973 Sankoff treats alignment and phylogenies as on general problem – phylogenetic alignment.

1979 Cavender and Felsenstein independently comes up with same evolutionary model where parsimony is inconsistent. Later called the “Felsenstein Zone”.

1981: Felsenstein Maximum Likelihood Model & Program DNAML (i programpakken PHYLIP).

1981 Parsimony tree problem is shown to be NP-Complete.

1985: Felsenstein introduces bootstrapping as confidence interval on phylogenies.

1986 Bandelt and Dress introduces split decompostion as a generalization of trees.

1985-: Many authors (Sawyer, Hein, Stephens, M.Smith) tries to address the problem of recombinations in phylogenies.

1997-9 Thorne et al., Sanderson & Huelsenbeck introduces the Almost Clock.

2000 Rambaut (and others) makes methods that can find trees with non-contemporaneous leaves.

2001- Major rise in the interest in phylogenetic statistical alignment

Books:Molecular Systematics (1996) (eds. Hillis and Craig)New Uses for Phylogenies (1996) (eds. P.Harvey)W.Maddison and D.Maddison : MacCladeSemple & Steel (2003): Phylogenetics OUP

Journals:Molecular Biology and EvoltionJ. Molecular EvolutionMolecular PhylogeneticsSystematic Biology.J. of Classification

www-pages:PAUP – probably the best package for phylogenetic analysis available. David Swoffordhttp://www.lms.si.edu/PAUP/about.html

MacClade – W. & D. Maddison http://phylogeny.arizona.edu/macclade/macclade.html

PHYLIP – J. Felsenstein. http://depts.washington.edu/genetics/faculty/felsenstein.html

PAML – Z. Yang http://abacus.gene.ucl.ac.uk/

Phylogeny: literature, www and packages.

1: Error function: wi,j * (di,j - pi,j)a

2: Minimisation has two parts topology & branchlengths. Try all topologies and solv branch problem for each.

3: A(i,j),k is (n*(n-1)/2)*(2n-3) matrix with 1 if k is an edge on the path from i to j, 0 ellers.

4: The path length i & j, pi,j, In the given topology is given by: pi,j = A(i,j),k*sk.

5: If wi,j =1 og a=2 this can be solved by linear algebra (di,j - A(i,j),k*sk)2

Global Fit Metods

Input: Distancematrix D.

1: For each leaf the average distance to the others is calculated ri=(di,1 + di,2 + + dn,i)/(n-1).

2: Rate corrected distance matrix, M, is constructedmi,j = di,j - (ri + rj)/(n-2). Only minimal mi,j is necessary.

3: Make ancestral node, u, to i & j giving minimal mi,j. New branch lengths are defined by si,u = di,j/2 + (ri - rj)/[2*(N-2)] sj,u = di,j - si,u

4: The distance from u to the others are set to dk,u = (di,k + dj,k -di,j)/2

Do this n-2 times

Alternativ karakterisation af metoden: Start med bedste kvadratiske fit af et træ med en k indre (k<n) indre knuder, tilføj den indre gren, som giver den største forbedring i det kvadratiske fit (nu k+1 knuder). Dette fortsættes indtil hel træet er bygget (k-1 indre knuder er tilføjet.

Nearest Neighbor JoiningSaitou and Nei, 1987

Ø = Lavt overslag på vægten af træ - eventuelt vægten på godt gættet træ.

W(n) = vægten for træet i knude n.R(n) = højt underslag for vægttilvæksten ved at tilføje resten af sekvenserne.Betingelse for bounding:W(n) + R(n) >= Ø97 7 102Hvordan regnes R(n) ud? A T C G A C G G T C G G *

Branch and Bound Algorithm

I. Bootstrapping columns in the alignment.Example: Human, Chimp, Gorilla & Orangutan with root.position 1 2 3 4 5 6 7 8 9 12.586H T C T G A C G T T T G A ... CC T C T G A C G G T T G A ... CG T C T G A C G G T T G A ... CO T C A G A C G G T C G A ... Croot T C A G A C G T A A G A ... C15 possible trees, only 3 of relevance: /\ /\ /\ / \ / \ / \ /\ \ /\ \ /\ \ / \ \ / \ \ / \ \ /\ \ \ /\ \ \ /\ \ \ / \ \ \ / \ \ \ / \ \ \ H C G O H G C O C G H OI. Bootstrap probabilities: 0.80 0.09 0.11II. Differences in likelihood: 0.0 -16.63 s.d=14.22 -15.12 sd=13.95

Tree topology comparison.