Tree Inference Methods Methods to infer phylogenetic trees – Introduction There is no one correct method Methods are grouped according to two criteria

Tree Inference Methods• Methods to infer phylogenetic trees – Introduction• There is no one correct method• Methods are grouped according to two criteria

– Does it use discrete character states or distance matrices?– Does it cluster OTUs in a stepwise manner or evaluate a number of possible trees?

• Discrete character state methods– Includes sequences, morphological characters, physiological characters, restriction

maps, etc.– Each character is analyzed separately and independently (usually)– Best tree is deduced from a set of possible trees using the character state data– Retain information about individual characters throughout the analysis and can be

used to reconstruct ancestral states if necessary– Extremely computer intensive – Beyond certain numbers of taxa, it is impossible to evaluate all possible trees

• Distance matrix methods– Calculate a measure of dissimilarity and abandon any information about the actual

character states– The distance matrix is then used to build a tree from the ground up– Distance matrix represents the genetic or evolutionary distance– No need to evaluate multiple trees, computationally simple– Information is lost– No way to reconstruct ancestral states

Tree Inference Methods

• Tree evaluation methods– With these methods, you have some criterion for selecting a ‘best’ tree based on

the data– If possible, perform an exhaustive search of all possible trees, evaluate all of them

using criterion and choose the best one– Not possible for large numbers of OTUs– Algorithms allow us to evaluate subsets but we risk never identifying the best tree– Many ‘best’ trees are possible (even likely)

• Clustering methods– Construct a tree from nothing using specific algorithms– Cluster the two most closely related taxa– Then add a third most closely related, and so on….– Fast– Produce only one tree

Tree Inference Methods

• Clustering Methods: Obtaining Genetic Distances• Nucleotide substitution models• In order to calculate a genetic distance, we must have some model of

DNA evolution on which to “hang our hat”• General assumptions of most models (often violated at least slightly)

– All sites are independent of one another– Sites are homogeneous in their rates of change– Markovian: Given the present state, future changes are unaffected by past states– Temporal homogeneity

Models of DNA Evolution

• General assumptions of most models (often violated at least slightly)– All sites are independent of one another– Sites are homogeneous in their rates of change– Markovian: Given the present state, future changes are unaffected by past states– Temporal homogeneity

Compensatory changes


• General assumptions of most models (often violated at least slightly)– All sites are independent of one another– Sites are homogeneous in their rates of change– Markovian: Given the present state, future changes are unaffected by past states– Temporal homogeneity


• Clustering Methods: Obtaining Genetic Distances• Nucleotide substitution models• In order to calculate a genetic distance, we must have some model of

DNA evolution on which to “hang our hat”• General assumptions of most models (often violated at least slightly)

– All sites are independent of one another– Sites are homogeneous in their rates of change– Markovian: Given the present state, future changes are unaffected by past states– Temporal homogeneity

• Strictly speaking, these assumptions apply only to regions undergoing little or no selection

• Our task is to determine a mathematical method to model the (presumed) stochastic processes that introduced the observed differences among sequences


• A model should:– Provide a consistent measure of dissimilarity among sequences– Provide linearly proportional distances to the time since divergence (if a molecular

clock is assumed)– Provide distances representing the branch lengths on an evolutionary tree

• The basic model is just counting the number of differences - p-distance (p = #differences/site)

• Intuitively simple but probably accurate only for very few cases because of homoplasy

• Homoplasy - a character state shared by a set of sequences but not present in the common ancestor; a misleading phylogenetic signal

• Most commonly, homoplasy is introduced because of multiple and back substitutions

• P-distances almost invariably underestimate the actual number of changes


• P-distances invariably underestimate the actual number of changes


• P-distances invariably underestimate the actual number of changes

• Saturation – the point at which any phylogenetic signal is lost; so many changes have occurred, the sequences are essentially random with respect to one another


• Substitutions as homogeneous Markov processes• Markov processes are specified in Q matrices• A 4x4 matrix in which each position gives the instantaneous rate of

change from one base to another.• μ = mutation rate• a = rate at which A-C change occurs relative to other possible

changes


• Most Q matrices represent time homogeneous, time continuous, stationary Markov process

• Assumptions– At any given site in a sequence, the rate of change from base i to base j is

independent of the base that occupied the site prior to i.– Time homogeneous/continuous – substitution rates do not change over time– Stationary – the relative frequencies of the bases (πA,πC,πG,πT) are at equilibrium– Many models are also time-reversible – the rate of change from i to j is always the

same as from j to i.

• These assumptions don’t make much sense biologically but are necessary if substitutions are to be modeled as stochastic processes


• Jukes Cantor (JC69) – the simplest model• Assumptions:

– Equilibrium frequencies for the four nucleotides are 25% each (πA=πC=πG=πT=1/4) – Equal probabilities exist for any substitution (a=b=c=d=e=f=1)

• Once the Q matrix is stated, calculating the probability of change from one base to another over evolutionary time, P(t) is accomplished by calculating the matrix exponential

– Matrix algebra is involved. I took it back in 1991. Forgive me

• The resulting correction becomes d=-¾ln(1-(4/3)p)– p = the observed distance (p-distance)


• Using JC69

• Note the parallel substitution at position 9• The actual distance is higher than the observed distance• 6 changes actually occurred


• Using JC69

• p = 4/10 = 0.4• d (JC69) = -3/4 ln [1-4/3 (0.4)] = 0.5716 • A more reasonable estimate of the number of actual changes that

occurred• What assumptions of JC69 are violated?


• Kimura 2-parameter (K2P)• Generally, transitions occur at higher rates than transversions• This violates the rate assumptions of JC69


• Kimura 2-parameter• A different rate must be considered for transitions (α) and

transversions (β), changing the Q matrix to:

• π remains ¼ for all bases• d = ½ ln[1/1-2P-Q] + [1/4 ln[1/(1-2Q]]• P and Q are the proportional differences between sequences due to

transitions and transversions, respectively• Note if, α=β …


• Felsenstein (1981) - F81• In most taxa, A+T ≠ C+G• If there are only a few G’s, the rate of substitution from G to A will be

low compared to other substitutions• Violates the rate assumptions of JC69


• Felsenstein (1981) - F81• Different frequencies must be considered for all bases, substitution

rates are the same for all, changing the Q matrix to:

• π is unique for all bases (πA ≠ πC ≠ πG ≠ πT)• Note that this model assumes similar base composition for all

sequences under consideration• Note, if πA = πC = πG = πT …


• Hasegawa, Kishino and Yano (HKY85)• Combines F81 and K2P

• General Time Reversible (GTR)• Allows all six pairs of substitutions to

have distinct rates • Allows unequal base frequencies



• A variety of other models exist:• Tajima-Nei (1984) – refines JC69 for more accurate rates of

nucleotide substitution• Tamura 3 parameter (1982) – corrects for multiple hits• Tamura-Nei (1993) – corrects for multiple hits, considers purine and

pyrimidine transitions differently


• Varying substitution rates among sites in sequences (rate heterogeneity) can be compensated for

• Most times, a gamma, Γ, distribution is used• An α value to determine the shape of the distribution can be

estimated from the data and incorporated into calculations


• Small values of α = L-shaped Γ-distribution and extreme rate variation among sites, most sites invariable but a few sites have very high substitution rates

• Large values (>1) of α = bell-shaped Γ-distribution and minimal rate variation among sites


• Choosing the wrong model may give the wrong tree– Wrong model incorrect branch lengths, Ti/Tr ratios, divergences rate

estimations, mutation rates, divergence dates

• What model to choose and how to choose it?• Generally, more complex models fit the data better

– Thus, it may seem best to use the most complex model by default– However,

• More parameters must be estimated, making computation more difficult (longer) and increasing the possibility of error in estimation

• Find a medium between complexity and practicality


• Choosing a model• The fit of a model to the data is proportional to:

– The probability of the data (D),– given a model of evolution (M),– a vector of model parameters (θ),– a tree topology (τ) and a vector of branch lengths (ν)– L = P(D | M, θ, τ, ν)– Often use the log likelihood to ease computation– l = lnP(D | M, θ, τ, ν)

• Likelihood ratio test (LRT)• LRT statistic LTR = 2 (l1 – l0) • l1 = the maximum log likelihood under the more complex model (alternative hypothesis)• l0 = the maximum log likelihood under the less complex model (null hypothesis)• Always =>0• Large value = the more complex model is better


• Choosing a model• Hierarchical likelihood ratio test (hLRT)• Most of the models described above are nested, or hierarchical

– i.e. JC is a special case of F81 where the base frequencies are equal

• ModelTest will perform all possible comparisons and evaluate them using a Χ2 test


• Choosing a model• Information criteria• The likelihood of each model is penalized by a function of the number

of free parameters (K) in the model; more parameters = higher penalty

• Akaiki Information Criterion (AIC)• AIC = -2l + 2K• AIC = the amount of information lost when we use a particular model• Small values are better• ModelTest, ProtTest


• Choosing a model• Bayesian methods• Bayes factors are similar to LTR • Posterior probabilities can be calculated• Most commonly Bayesian Information Criterion (BIC) is calculated• BIC = -2l + 2K log n• Smaller = better• ModelTest & ProtTest


Documents

Tree Inference Methods Methods to infer phylogenetic trees – Introduction There is no one correct method Methods are grouped according to two criteria