Selecting Methods and Models for Phylogenetic Inference & Assessing the Reliability of Inferred Trees Jim Wilgenbusch School of Computational Science and

Selecting Methods and Models for Phylogenetic Inference

&Assessing the Reliability of

Inferred Trees

Jim WilgenbuschSchool of Computational Science and Information

Technology

OverviewPart1: Distance Methods (Picking up from last week)

• Clustering Algorithms• As an Optimality Criterion

Part2: Method and Model Selection• Methods: Assumptions, Efficiency, Power, Robustness, Falsifiability,

and Accuracy• Models: Likelihood Ratio Test, Parametric Bootstrap, AIC

Part3: Assessing Reliability of Trees• Bootstrap/Jackknife, KH-test, Parametric Bootstrap, Bayesian Posterior

Probability of Tree Distribution

Part4: Practical Component• Align Sequences• Infer Tree(s)

Resources

This lecture closely follows:Page, R. D. and Holmes, E. C. 1998. Molecular Evolution: A

Phylogenetic Approach. Blackwell Science, Oxford.

For more details see:Li, W. 1997. Molecular Evolution. Sinauer Associates. Sunderland,

Massachusetts.

Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. Pages 407-514 in D. M. Hillis, C. Moritz, and B. Mable (eds.) Molecular Systematics (2nd ed.), Sinauer Associates, Sunderland, Massachusetts.

Sequences to Distances

Lemur AAGCTTCATAG…TTACATCATCCAHomo AAGCTTCACCG…TTACATCCTCATPan AAGCTTCACCG…TTACATCCTCATGoril AAGCTTCACCG…CCCACGGACTTAPongo AAGCTTCACCG…GCAACCACCCTCHylo AAGCTTTACAG…TGCAACCGTCCTMaca AAGCTTTTCCG…CGCAACCATCCT

Uncorrected ("p") distance matrix

1 2 3 4 5 6 7 1 Lemur catta - 2 Homo sapiens 0.30794 - 3 Pan 0.30906 0.08929 - 4 Gorilla 0.29341 0.10379 0.10603 - 5 Pongo 0.29265 0.16087 0.17093 0.16645 - 6 Hylobates 0.29559 0.18192 0.18862 0.18862 0.18768 - 7 Macaca fuscata 0.28215 0.23214 0.24330 0.23661 0.24467 0.24665 -

Distance Methods

• Clustering Algorithms– Unweighted Pair-Group Method with Arithmetic

Means (UPGMA)• Ultrametric Tree

– Neighbor Joining• Additive trees

• Distance as an Optimality Criterion– Fitch-Margoliash and Related Methods

– Minimum Evolution

Metric DistancesMetric Distances:• d(a,b) ≥ 0 (non-negativity)• d(a,b) = d(b,a) (symmetry)• d(a,c) ≤ d(a,b) + d(b,c)

(triangle inequality)• d(a,b) = 0 if and only if a=b)

(distinctness)

Ultrametric Distances:• d(a,b) ≤ max[d(a,c), d(b,c)]

(isosceles triangle)

Additive Distances:• d(a,b)+d(b+d) ≤ max[d(a,c)

+d(b,d)+d(a,d)+d(b,c)] (four-point condition)

a

b c

a b c

AdditiveTree

UltrametricTree

3

2 2

1

1

3

2

1

UPGMAUltrametric Tree

a b

1 1

a -

b 2 -

c 6 6 -

d 10 10 10 -

a b c d c

23

a-b -

c 6 -

d 10 10 -

a-b c d

d

2

5

a-b-c -

d 10 -

a-b-c d

Neighbor Joining MethodAdditive Tree

• Similar to other clustering methods except the ultrametric assumption is removed.– In practical terms this means that the tree is no longer

constrained to be clock-like.

• NJ finds neighbors sequentially that may minimize the total length of the tree.

• Separation between each pair of nodes is adjusted on the basis of their average divergence from all other nodes.

• NJ algorithm is a special case of the star decomposition method.

Star Decomposition

AB

C

DE

AB C

DE

AB

CDE

AB

CD

E

E

A

B

D

C

Advantages to Clustering methods

Sequences Number of Trees3 14 35 156 1057 9458 103959 135,135

10 2,027,02511 34,459,42512 654,729,07513 13,749,310,57514 316,234,143,22515 7,905,853,580,625

sp2

sp3

sp4

sp1

sp2

sp4

sp3

sp1

sp3

sp4

sp2

sp1

Distance as an Optimality Criterion• Real data are rarely metric• Fitch and Margoliash (1967)

Attempts to minimize the difference between the observed pairwise distances and the path length distances between two taxa on a tree topology

dij is the observed pairwise distance

pij is the path length distance between taxa i and j on the tree.

wij is a weighting factor that in some cases might be used to assign less weight to those distances likely to be more variable.

is often set to 2, so that E becomes the least-squares fit criterion

• Minimum Evolution (Kidd and Sgaramella-Zonta, 1971)• E from FM is used to fit branch lengths• the best tree is the one with the smallest sum of branch lengths

€

E = wij | dij − pij |a

j= i+1

T

∑i=1

T−1

∑

Part II: A. Choosing Among

General Methods Parsimony

Likelihood

Distance?

Choosing Among General Methods(from Page and Holmes, 1998)

• Efficiency– How fast is a method?

• Power– How much data does the method need to produce a reasonable

result?

• Robustness– Will minor violations of the method’s assumptions result in poor

estimates of phylogeny?

• Falsifiability– Will the method tell us when its assumptions are violated, (i.e. that

we should not be using the method at all)?

Testing the Accuracy of Methods • Known phylogenies

– Fitch and Atchley (1985)– Atchley and Fitch (1991)

• used genetic data from inbreed strains of laboratory mice to test phylogenic methods

– Hillis et al. (1992, 1994)• used virus sequences (bacteriophage T7) which have a fast rate

of mutation and speed things up by introducing mutagens.

• Simulate Data– Select a known tree (topology and branch lengths)– “Mutate Data” on the tree according to the model so

that number of changes on a branch are proportional to the defined branch length.

Testing Accuracy of MethodsSimulate Data on Four-taxa trees

(Parametric Bootstrap)

Generate data matrices under a modelof sequenceevolution

seqA seqB

seqC seqD

seqA agctaagcctcaagctaagcctcaseqB accagagtctcgagctaagcctcaseqC accagagtctcaagctaagcctcaseqD agctaagcctcaagctaagcctca

seqA seqC

seqB seqD

seqA seqB

seqC seqD

seqA seqC

seqD seqB

pars

imon

y

distance

likelihood

repeat many times

The Methods• Distance

– Pick the tree or set of trees that minimizes the difference between observed pairwise distances and path-length distances calculated on a tree (i.e., sum of branch lengths between a pair of external nodes).

• Maximum Parsimony– Pick the tree or set of trees that minimizes the total

amount of implied evolutionary change (often counted as steps).

• Maximum Likelihood– Pick the tree or set of trees that maximizes the

probability of having observed the data.

Accuracy of MethodsSimulated Data on Four-taxa tree

Fig 3. From Huelsenbeck and Hillis, 1996

Performance of methods over parameter space (from Holder, 2001)

Distance Methods • Pros

– Fast• algorithmic methods (UPGMA, NJ) Very Fast• NP-complete when used as an optimality criterion (not enough time

in the universe to solve exactly when N is large).

– When distances are small corrected distances can perform well • too small and you might observe large statistical error

• Cons– UPGMA requires ultrametric data (i.e., a clock)

• even when distance are small the method may fail

– Loss of information• cannot trace characters transformations over the tree

– Branch lengths may be difficult to interpret for some distances

Parsimony

• Pros– Relatively fast, easy to calculate tree scores (still an NP-Complete

problem)– No explicit assumptions about the evolutionary process– For large trees fairly robust to LBA

• Cons– No explicit assumptions– No objective function for determining best weighting scheme– Inconsistent when branch lengths are heterogeneous (i.e., long

branches attract)

Long-Branch AttractionA A

C G

reconstructwith

parsimonyC G

CGA

TA

True Tree Estimated Tree

Breaking Up Long Branches:Can parsimony artifacts be cured by dense

taxonomic sampling?

Sometimes but the problem is more complicated (see Poe and Swofford, 1999)

A A

C G

C G

CGA

TA

True Tree

A T

reconstructwith

parsimony

Estimated Tree

Maximum Likelihood

• Pros– Requires an explicit model of nucleotide evolution– Compensates well for unobserved changes – The likelihood can be used as an objective function to determine the

relative fit between data and model – Robust to violations of the underlying model

• Cons– Requires an explicit model of nucleotide evolution– Relatively slow, estimating parameters and calculate tree scores can take a

long time for more than 20 sequences (also the NP-complete problem)

Part II: B. Choosing among models within each general class of method

Parsimony

Likelihood

Distance

•Fitch•Transversion•Generalized

?

•GTR•TrN•HKY85•K3ST•K2P•F81•JC

?•Absolute•Mean•Evolutionary•LogDet

?

Pinv+

Choosing a model

• Goal: “parsimonious” models– Find the simplest model (fewest parameters)

that is adequate to explain the data– Don’t improve the “fit” of the model by trying

to explain everything

Overparameterizing a Model

0

20

40

60

80

100

0 25 50 75 100

y

x

y=1.30+0.965x

(r2 =0.963)

-80

-40

0

40

80

120

0 25 50 75 100

y

x

y=- 330+134x - 15.5x2 +0.816x3

- 0.0225x4 +0.000335x5

- 0.00000255x6 +0.00000000777x7

(r2 =1.000)

Choosing among Parsimony Methods

• Equal Weighted Parsimony– Change from one base to another = 1

• Transversion Parsimony– Transversions ( purines {A, G} pyrimidines {C, T}) =1

– Transistions (A G or C T) = 0

• General or Weighted Parsimony

A G C T

A - 1 3 3

G 1 - 3 3

C 3 3 - 1

T 3 3 1 -

A “step matrix” or cost matrix for transformations from one base to another.

Comparing Parsimony Scores

Gorilla

Pan

Homo sapiens

Hylobates

Pongo

50 changes

4635

50

40

61

159

113

Homo sapiens

Gorilla

Pan

Hylobates

Pongo

50 changes

4212

51

44

39

95

73

Gorilla

Pan

Homo sapiens

Hylobates

Pongo

10 changes

333

210

32

20

Unweighted or Fitch parsimonyscore: 356

Transversionparsimonyscore: 73

General or weightedparsimonyTv:Ti of 3:1score: 504

No common currency

Choosing among Likelihood Methods

The likelihood can be considered a “goodness of fit” for the data (i.e., common currency)

• Likelihood Ratio test

• Monte Carlo Simulations

• Akaike Information Criterion

GTR

SYMTrN

F81

JC

K3ST

K2P

HKY85F84

Equal base frequencies

3 substitution types(transitions,2 transversion classes)

2 substitution types(transitions vs. transversions)

2 substitution types(transitions vs.transversions)

Equal basefrequencies

Single substitution typeEqual base frequencies

(general time-reversible)

(Tamura-Nei)

(Felsenstein)

Jukes-Cantor

(Kimura 2-parameter)

(Kimura 3-subst. type)

(Felsenstein)

A Family of Substitution Models(from Swofford et al., 1996)

Likelihood Ratio For Nested Phylogenetic Models

GorillaPan

Homo sapiens

Hylobates

Pongo

0.05 substitutions/site

GorillaPan

Homo sapiens

Hylobates

Pongo


GTR+ GTR

-lnL1 = 2625.73859 -lnL0 = 2664.43013

∆ = 38.69154

Likelihood Ratio For Nested Models

= L0/ L1 we are less likely to accept the model restriction when is small– L1= ML estimate given the data when fewer restriction are imposed on the

model of nucleotide substitution (more free parameters or with rate heterogeneity).

– L0 = ML estimate given the data when certain parameters are restricted (fewer free parameters/ without rate heterogeneity).

• To quantify a confidence interval we use the fact that -2ln is generally 2 with k degrees of freedom. – where k is the difference between the number of free parameters used to

calculate L0 and L1 (L0 has k fewer parameters then L1).

Likelihood Ratio For Nested Phylogenetic Models

GorillaPan

Homo sapiens

Hylobates

Pongo


GorillaPan

Homo sapiens

Hylobates

Pongo


GTR+ GTR

-lnL1 = 2625.73859 -lnL0 = 2664.43013

= 2(-lnL1 - -lnL0)

2df=1=77.38308, P < 0.0001

Parametric Bootstrap For Non-Nested Models (Efron, 1985)

seqA agctaagcctcaagctaagcctcaseqB accagagtctcgagctaagcctcaseqC accagagtctcaagctaagcctcaseqD agctaagcctcaagctaagcctca seqE agctaagcctcaagctaagcctca

GorillaPan

Homo sapiens

Hylobates

Pongo


seqA agctaagcctcaagctaagcctcaseqB accagagtctcgagctaagcctcaseqC accagagtctcaagctaagcctcaseqD agctaagcctcaagctaagcctca seqE agctaagcctcaagctaagcctca

…

n replicate data setsseqA agctaagcctcaagctaagcctcaseqB accagagtctcgagctaagcctcaseqC accagagtctcaagctaagcctcaseqD agctaagcctcaagctaagcctca seqE agctaagcctcaagctaagcctca

GTR

1 = 2(-lnL1 - -lnL0)

2 = 2(-lnL1 - -lnL0)

n = 2(-lnL1 - -lnL0)

…

ˆ

ˆ

ˆ

•Generate data sets on the best tree using the parameters estimated under the null model from the observed data.

•Calculate the LRT for each rep.

•Use the estimated LRTs as your null distribution.

Parametric Bootstrap For Non-Nested Phylogenetic Models

(Goldman,1993)

0

5

10

15

20

25

30

35

40

45

25 30 35 40 45 50 65 70 75 80

= 2(-lnL1 - -lnL0)

Frequency

Observedvalue of

ˆ

Akaike Information Criterion (AIC) (Akaike, 1974)

• AIC = -2lnL + 2n

• Where,lnL is the maximum likelihood value of a specific

model of nucleotide sequence evolution and tree topology given the data.

n = the number of parameters free to vary

• Smaller AIC indicates a better model

Part III: Assessing Confidence of the Phylogenetic Tree

0

2

4

6

8

10

12

14

12891313132813431358137313881403141814331448146314781493150815231538155315681583159816131628164316581673

MP tree = 1300 ± ?

Distribution of tree scores

Ways of assessing support for a tree topology

• Bootstrap/Jackknife analyses

• Parametric bootstrap

• KH-test and others

• Bayesian Posterior Probabilities

Bootstrap Technique(Efron, 1979)

= 5 (true mean)

0

2

4

6

8

10

12

14

16

û = 5.3Sample meanone replicate

0

2

4

6

8

10

12

14

16

pseudo replicate 1û = 4.8

0

2

4

6

8

10

12

14

16

18

…

…

pseudo replicate nû = 5.1


Bootstrapping Phylogenetic Data(Felsenstein, 1985)


Lemur catta

Homo sapiens

Pan

Gorilla

Pongo

Hylobates

Macaca fuscata

Original data set


Lemur catta

Homo sapiens

Pan

Gorilla

Pongo

Hylobates

Macaca fuscata

…

…

pseudo rep n

Lemur catta

Homo sapiens

Pan

Gorilla

Pongo

Hylobates

Macaca fuscata

pseudo rep 1

Majority-rule Consensus tree

Lemur catta

Homo sapiens

Pan

Gorilla

Pongo

Hylobates

Macaca fuscata

100

91

100

57

1234567 Freq-----------------.***... 100.00.*****. 100.00.****.. 91.17..**... 57.33.**.... 42.00.***.*. 7.83

(1)

(2)

(3)

(4)

(5)

(6)(7)

Jackknifing Phylogenetic Data

• Also used to assess support for nodes on a given tree

• Data are sampled without replacement • Replicates represent some fraction of the total data

set.• Jackknife tree is also displayed as a consensus

tree, where support for a node is given as the percent of the jackknife replicates which contain the node.

Parametric Bootstrap

Homo sapiens

Pan

Gorilla

Pongo

Hylobates




Homo sapiens

Pan

Gorilla

Pongo

Hylobates

Homo sapiens

Pongo

Gorilla

Pan

Hylobates

Homo sapiens

Pan

Gorilla

Pongo

Hylobates

Generate data sets on tree given branch lengths and substitution parameters …

reestimate the tree

Simulated data sets

Is Cyclura Sister to Iguana?A Case of Long Branch Attraction? -- The Parametric Bootstrap Applied

Hyp A Hyp B

Outgroup Outgroup

Cyclura

Cyclura

Iguana Iguana

Sauromalus

Sauromalus

Others

Others

0

10

20

30

40

50

60

A B C D E

0

5

10

15

20

25

30

35

A B C D E

Parsimony Likelihood

0

20

40

60

80

100

120

A B C D E

0

10

20

30

40

50

60

70

80

90

A B C D E

Parsimony Likelihood

Kishino-Hasegawa Test (KH-test) (Kishino and Hasegawa, 1989)

Lemur catta

Tarsius syrichta

Saimiri sciureus

Macaca fuscata

M. mulatta

M. fascicularis

M. sylvanus

Homo sapiens

Pan

Gorilla

Pongo

Hylobates

Lemur catta

Tarsius syrichta

Saimiri sciureus

Macaca fuscata

M. mulatta

M. fascicularis

M. sylvanus

Homo sapiens

Pan

Gorilla

Pongo

Hylobates

-lnL = 5735.81631-lnL = 5728.062107.75420

Site log likelihood scores

1 2 3 4 5 6 7 … n-sites SUMTree 1 -1.8035 -1.8035 -3.2596 -4.8071 -2.2836 -2.2836 -9.7611 … -2.1021 -5728.0621Tree 2 -1.8045 -1.8045 -3.2608 -4.8120 -2.2846 -2.2846 -9.7737 … -2.1030 -5735.8163Difference 0.0010 0.0010 0.0012 0.0048 0.0010 0.0010 0.0126 … 0.0009 7.7542

Distribution of Log Likelihood Differences

0

100

200

300

400

500

600

700

800

900

-3.0

2661

899

-2.7

7316

1101

-2.5

1970

3212

-2.2

6624

5323

-2.0

1278

7434

-1.7

5932

9545

-1.5

0587

1656

-1.2

5241

3767

-0.9

9895

5878

-0.7

4549

7989

-0.4

9204

01

-0.2

3858

2211

0.01

4875

678

0.26

8333

567

0.52

1791

456

Frequency

KH-test Assumptions

• Large number of sites sampled. In this case the paired t-test essentially equals a z-test

• Site are independently and identically distributed.

• Trees must be selected a priori• Alternative to KH-test exist that relax some

of these assumptions– SH-test (see Goldman, 2000)

Bayesian Approach to Phylogenetic Inference

(see Huelsenbeck, et al. 2001)

• Based on a quantity call the Posterior Probability of a tree

• Posterior Probability of a tree is the probability of the tree conditional on the observations. P.P. is given by combining a the prior and the likelihood for each tree using Baye’s rule.

• Baye’s rule

€

Pr[Tree | Data] = Pr[Data | Tree] × P[Tree]

Pr[Data]

The likelihood The prior probability

Normalizing constant

Calculating the Posterior Probability of a tree

• Usually the P[Tree] is equally probable for all trees• The likelihood is calculated in the same way as was

described last week• Easy to formulate but hard to solve analytically

– requires the summation over all trees and for each tree, integration over all combination of branch lengths and substitution parameters.

• Markov Chain Monte Carlo can be used to approximate the Posterior Probability

• MCMC algorithm– propose new tree by stochastically perturbing the current tree– accept or reject with a probability according to the algorithm describe by

Metropolis-Hasting– repeat many thousands of times (AKA, generations)

Approximation of the Posterior Probabilities using MCMC

-lnL

Generations (x106)

Tree with Bayesian Posterior Probabilities for node support

Uta stansburianaUrosaurus ornatusP. hernandesiP. platyrhinosC. d. crinitisC. d. carmenesisC. d. rhodostictusC. d. myurusC. d. ventralisC. d. bogertiC. d. ssp.C. t. texanusC. t. scitulusH. l. lacerataH. l. subcaudalisH. m. elegansH. m. thermophilaH. m. campiH. m. bunkeriH. m. ruthveniH. m. flavilentaH. m. maculataH. p. propinquaU. scoparia 1U. scoparia 2U. n. rufopunctataU. n. notataU. inornataU. exsulU. paraphygasSceloporus jarrovii

60

82

98

68

53

10078

97

96100

100100

56

100

95

79

62100

6686

96

74100

100

10086

100

Bootstrap

BibliographyAkaike, H. 1974. A new look at the statistical model identification. IEEE Trans. Autom. Contr., 19:716-723.

Atchley, W. R. and W. M. Fitch. 1991. Gene trees and the origins of inbred strains of mice. Science 254: 554-558.

Efron, B. (1985) Bootstrap confidence intervals for a class of parametric problems. Biometrika, 72, 45-58.

Fitch, W. M. and W.R. Atchley. 1985. Evolution in inbred strains of mice appears rapid. Science 228:1169-1175.

Fitch, W. M. and E. Margoliash. 1967. Construction of phylogenetic trees. Science155:279-284.

Hillis, D. M., J. J. Bull, M. E. White, M. R. Badgett, and I. J. Molineux. 1992. Experimental phylogenetics: generation of a known phylogeny. Science 255:589-592.

Hillis, D. M., J. P. Huelsenbeck, and C. W. Cunningham. 1994. Application and accuracy of molecular phylogenies. Science 264:671-677.

Holder, M. T. 2001. Using a Complex Model of Sequence Evolution to Evaluate and Improve Phylogenetic Methods. Ph.D. Dissertation. Univ. of Texas at Austin.

Huelsenbeck, J. P., Hillis, D. M. and Jones, R. 1996. Parametric bootstrapping in molecular phylogenetics: Applications and performance. In Ferraris, J. D. and Palumbi, S. R. (eds.), Molecular Zoology. Advances, strategies and protocols. Wiley-Liss, New York, pp. 19-45.

Huelsenbeck, J. P., Ronquist, F., Nielsen, R., Bollback, J. P. 2001. Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294: 2310- 2314.

Bibliography (continued)Goldman, N. 1993. Statistical tests of models of DNA substitution. Journal of Molecular Evolution 36: 182-98.

Goldman, N., J. P. Anderson, and A. G. Rodrigo. 2000. Likelihood-based tests of topologies in phylogenetics. Systematic Biology 49:652-670.

Kishino, H. and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. Journal of Molecular Evolution 29:170-179.

Lewis, P. O. 2001. Phylogenetic systematics turns a new leaf . Trends in Evolution and Ecology 16:30-36.

Li, W. 1997. Molecular Evolution. Sinauer Associates. Sunderland, Massachusetts.

Page, R. D. and Holmes, E. C. 1998. Molecular Evolution: A Phylogenetic Approach. Blackwell Science, Oxford.

Poe, S., and D. L. Swofford. 1999. Taxon sampling revisited. Nature 389:299-300.

Shimodaira, H. and M. Hasegawa. 1999. Multiple Comparisons of Log-Likelihoods with Applications to Phylogenetic Inference. Molecular Biology and Evolution 16:1114-1116.

Steel, M. and Penny, D. 2000. Parsimony, likelihood, and the role of models in molecular phylogenetics. Molecular Biology and Evolution 17:839-850.

Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. Pages 407-514 in D. M. Hillis, C. Moritz, and B. Mable (eds.) Molecular Systematics (2nd ed.), Sinauer Associates, Sunderland, Massachusetts.

Documents

Selecting Methods and Models for Phylogenetic Inference & Assessing the Reliability of Inferred Trees Jim Wilgenbusch School of Computational Science and