Upload
alyson-mcbride
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
Tree Inference Methods• Methods to infer phylogenetic trees – Introduction• There is no one correct method• Methods are grouped according to two criteria
– Does it use discrete character states or distance matrices?– Does it cluster OTUs in a stepwise manner or evaluate a number of possible trees?
• Discrete character state methods– Includes sequences, morphological characters, physiological characters, restriction
maps, etc.– Each character is analyzed separately and independently (usually)– Best tree is deduced from a set of possible trees using the character state data– Retain information about individual characters throughout the analysis and can be
used to reconstruct ancestral states if necessary– Extremely computer intensive – Beyond certain numbers of taxa, it is impossible to evaluate all possible trees
• Distance matrix methods– Calculate a measure of dissimilarity and abandon any information about the actual
character states– The distance matrix is then used to build a tree from the ground up– Distance matrix represents the genetic or evolutionary distance– No need to evaluate multiple trees, computationally simple– Information is lost– No way to reconstruct ancestral states
Tree Inference Methods
• Tree evaluation methods– With these methods, you have some criterion for selecting a ‘best’ tree based on
the data– If possible, perform an exhaustive search of all possible trees, evaluate all of them
using criterion and choose the best one– Not possible for large numbers of OTUs– Algorithms allow us to evaluate subsets but we risk never identifying the best tree– Many ‘best’ trees are possible (even likely)
• Clustering methods– Construct a tree from nothing using specific algorithms– Cluster the two most closely related taxa– Then add a third most closely related, and so on….– Fast– Produce only one tree
Tree Inference Methods
• Clustering Methods: Obtaining Genetic Distances• Nucleotide substitution models• In order to calculate a genetic distance, we must have some model of
DNA evolution on which to “hang our hat”• General assumptions of most models (often violated at least slightly)
– All sites are independent of one another– Sites are homogeneous in their rates of change– Markovian: Given the present state, future changes are unaffected by past states– Temporal homogeneity
Models of DNA Evolution
• General assumptions of most models (often violated at least slightly)– All sites are independent of one another– Sites are homogeneous in their rates of change– Markovian: Given the present state, future changes are unaffected by past states– Temporal homogeneity
Compensatory changes
Models of DNA Evolution
• General assumptions of most models (often violated at least slightly)– All sites are independent of one another– Sites are homogeneous in their rates of change– Markovian: Given the present state, future changes are unaffected by past states– Temporal homogeneity
Models of DNA Evolution
• Clustering Methods: Obtaining Genetic Distances• Nucleotide substitution models• In order to calculate a genetic distance, we must have some model of
DNA evolution on which to “hang our hat”• General assumptions of most models (often violated at least slightly)
– All sites are independent of one another– Sites are homogeneous in their rates of change– Markovian: Given the present state, future changes are unaffected by past states– Temporal homogeneity
• Strictly speaking, these assumptions apply only to regions undergoing little or no selection
• Our task is to determine a mathematical method to model the (presumed) stochastic processes that introduced the observed differences among sequences
Models of DNA Evolution
• A model should:– Provide a consistent measure of dissimilarity among sequences– Provide linearly proportional distances to the time since divergence (if a molecular
clock is assumed)– Provide distances representing the branch lengths on an evolutionary tree
• The basic model is just counting the number of differences - p-distance (p = #differences/site)
• Intuitively simple but probably accurate only for very few cases because of homoplasy
• Homoplasy - a character state shared by a set of sequences but not present in the common ancestor; a misleading phylogenetic signal
• Most commonly, homoplasy is introduced because of multiple and back substitutions
• P-distances almost invariably underestimate the actual number of changes
Models of DNA Evolution
• P-distances invariably underestimate the actual number of changes
Models of DNA Evolution
• P-distances invariably underestimate the actual number of changes
• Saturation – the point at which any phylogenetic signal is lost; so many changes have occurred, the sequences are essentially random with respect to one another
Models of DNA Evolution
• Substitutions as homogeneous Markov processes• Markov processes are specified in Q matrices• A 4x4 matrix in which each position gives the instantaneous rate of
change from one base to another.• μ = mutation rate• a = rate at which A-C change occurs relative to other possible
changes
Models of DNA Evolution
• Most Q matrices represent time homogeneous, time continuous, stationary Markov process
• Assumptions– At any given site in a sequence, the rate of change from base i to base j is
independent of the base that occupied the site prior to i.– Time homogeneous/continuous – substitution rates do not change over time– Stationary – the relative frequencies of the bases (πA,πC,πG,πT) are at equilibrium– Many models are also time-reversible – the rate of change from i to j is always the
same as from j to i.
• These assumptions don’t make much sense biologically but are necessary if substitutions are to be modeled as stochastic processes
Models of DNA Evolution
• Jukes Cantor (JC69) – the simplest model• Assumptions:
– Equilibrium frequencies for the four nucleotides are 25% each (πA=πC=πG=πT=1/4) – Equal probabilities exist for any substitution (a=b=c=d=e=f=1)
• Once the Q matrix is stated, calculating the probability of change from one base to another over evolutionary time, P(t) is accomplished by calculating the matrix exponential
– Matrix algebra is involved. I took it back in 1991. Forgive me
• The resulting correction becomes d=-¾ln(1-(4/3)p)– p = the observed distance (p-distance)
Models of DNA Evolution
• Using JC69
• Note the parallel substitution at position 9• The actual distance is higher than the observed distance• 6 changes actually occurred
Models of DNA Evolution
• Using JC69
• p = 4/10 = 0.4• d (JC69) = -3/4 ln [1-4/3 (0.4)] = 0.5716 • A more reasonable estimate of the number of actual changes that
occurred• What assumptions of JC69 are violated?
Models of DNA Evolution
• Kimura 2-parameter (K2P)• Generally, transitions occur at higher rates than transversions• This violates the rate assumptions of JC69
Models of DNA Evolution
• Kimura 2-parameter• A different rate must be considered for transitions (α) and
transversions (β), changing the Q matrix to:
• π remains ¼ for all bases• d = ½ ln[1/1-2P-Q] + [1/4 ln[1/(1-2Q]]• P and Q are the proportional differences between sequences due to
transitions and transversions, respectively• Note if, α=β …
Models of DNA Evolution
• Felsenstein (1981) - F81• In most taxa, A+T ≠ C+G• If there are only a few G’s, the rate of substitution from G to A will be
low compared to other substitutions• Violates the rate assumptions of JC69
Models of DNA Evolution
• Felsenstein (1981) - F81• Different frequencies must be considered for all bases, substitution
rates are the same for all, changing the Q matrix to:
• π is unique for all bases (πA ≠ πC ≠ πG ≠ πT)• Note that this model assumes similar base composition for all
sequences under consideration• Note, if πA = πC = πG = πT …
Models of DNA Evolution
• Hasegawa, Kishino and Yano (HKY85)• Combines F81 and K2P
• General Time Reversible (GTR)• Allows all six pairs of substitutions to
have distinct rates • Allows unequal base frequencies
Models of DNA Evolution
Models of DNA Evolution
• A variety of other models exist:• Tajima-Nei (1984) – refines JC69 for more accurate rates of
nucleotide substitution• Tamura 3 parameter (1982) – corrects for multiple hits• Tamura-Nei (1993) – corrects for multiple hits, considers purine and
pyrimidine transitions differently
Models of DNA Evolution
• Varying substitution rates among sites in sequences (rate heterogeneity) can be compensated for
• Most times, a gamma, Γ, distribution is used• An α value to determine the shape of the distribution can be
estimated from the data and incorporated into calculations
Models of DNA Evolution
• Small values of α = L-shaped Γ-distribution and extreme rate variation among sites, most sites invariable but a few sites have very high substitution rates
• Large values (>1) of α = bell-shaped Γ-distribution and minimal rate variation among sites
Models of DNA Evolution
• Choosing the wrong model may give the wrong tree– Wrong model incorrect branch lengths, Ti/Tr ratios, divergences rate
estimations, mutation rates, divergence dates
• What model to choose and how to choose it?• Generally, more complex models fit the data better
– Thus, it may seem best to use the most complex model by default– However,
• More parameters must be estimated, making computation more difficult (longer) and increasing the possibility of error in estimation
• Find a medium between complexity and practicality
Models of DNA Evolution
• Choosing a model• The fit of a model to the data is proportional to:
– The probability of the data (D),– given a model of evolution (M),– a vector of model parameters (θ),– a tree topology (τ) and a vector of branch lengths (ν)– L = P(D | M, θ, τ, ν)– Often use the log likelihood to ease computation– l = lnP(D | M, θ, τ, ν)
• Likelihood ratio test (LRT)• LRT statistic LTR = 2 (l1 – l0) • l1 = the maximum log likelihood under the more complex model (alternative hypothesis)• l0 = the maximum log likelihood under the less complex model (null hypothesis)• Always =>0• Large value = the more complex model is better
Models of DNA Evolution
• Choosing a model• Hierarchical likelihood ratio test (hLRT)• Most of the models described above are nested, or hierarchical
– i.e. JC is a special case of F81 where the base frequencies are equal
• ModelTest will perform all possible comparisons and evaluate them using a Χ2 test
Models of DNA Evolution
• Choosing a model• Information criteria• The likelihood of each model is penalized by a function of the number
of free parameters (K) in the model; more parameters = higher penalty
• Akaiki Information Criterion (AIC)• AIC = -2l + 2K• AIC = the amount of information lost when we use a particular model• Small values are better• ModelTest, ProtTest
Models of DNA Evolution
• Choosing a model• Bayesian methods• Bayes factors are similar to LTR • Posterior probabilities can be calculated• Most commonly Bayesian Information Criterion (BIC) is calculated• BIC = -2l + 2K log n• Smaller = better• ModelTest & ProtTest
Models of DNA Evolution