Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Lee 1
UNIVERSITY OF CALIFORNIA
Santa Barbara
Testing Trait Evolution Models on Phylogenetic Trees
A Thesis submitted in partial satisfaction of the
requirements for the degree Master of Science
in Computer Science
by
Chunghau Lee
Committee in charge:
Professor Ambuj Singh, CoChair
Professor Todd Oakley, CoChair
Professor YuanFang Wang
March 2005
Lee 2
The thesis of Chunghau Lee is approved.
__________________________________________
YuanFang Wang
__________________________________________
Todd Oakley, Committee CoChair
__________________________________________
Ambuj Singh, Committee CoChair
March 2005
Lee 3
ABSTRACT
Testing Trait Evolution Models on Phylogenetic Trees
by
Chunghau Lee
Tracing the ancestry of current species, whether they be whole organisms or features of
organisms, by constructing phylogenetic trees may yield vital information surrounding the
relationships between species. In addition, practical applications of phylogenetic trees include
identifying related species to serve as new drug test targets. Since the construction of these trees are
based on genetic similarity, the phylogeny of actual physical traits may not be well represented.
However, by applying different models of trait evolution, one can test the evolution of observed data.
The testing is provided by the novel tool CoMET, short for Continuouscharacter Model Evaluation
and Testing, which ranks the results of these models using maximum likelihood. CoMET's
approach includes maximizing discontinuous nonlinear functions with an approximative algorithm
and reducing exponentiallygrowing data structures by simplifying tree redundancies. Furthermore,
simulation experiments provide suggestions for improvement by showing that one model variant is
ineffective while some models requiring trees to have zerolength branches unnecessarily add
complexity. A followup experiment then showed that for some models, user input can improve the
usefulness of the CoMET tests. In conclusion, the endproduct CoMET complements phylogenetic
trees through model testing for continuouslyvarying character data, addressing the need to infer the
phylogeny of nongenomic characteristics.
Lee 4
Table of Contents
Chapter 1: Background
1.1 Introduction 6
1.2 Trees, Taxa, and Characters 7
1.3 Tree Evaluation 7
1.4 Oakley's Nine Models of Trait Evolution 10
1.5 The Akaike Information Criterion 11
1.6 The Mesquite Project 11
Chapter 2: CoMET
2.1 CoMET Introduction 13
2.2 Program Flow 13
2.4 Degrees of Freedom for AIC Values 15
2.5 Free Models 16
2.6 Punctuated Models 17
2.7 Conclusion 22
Chapter 3: Simulation Experiments
3.1 Introduction to Simulations 23
3.2 Experiments with Asymmetry Ratios 23
3.3 Experiments with Pure Punctuated Trees 25
3.4 Results and Discussion 26
Lee 5
3.5 Followup Experiment and Results 29
3.6 Conclusions 31
Chapter 4: Summary
4.1 Conclusions 33
4.2 Open Problems and Future Work 34
Appendix: References 35
Lee 6
Chapter 1: Background
1.1 IntroductionPhylogeny is the study of ancestral relationships between current, observed species. One
example application of phylogenetic techniques is the wellknown Tree of Life (http://tolweb.org).
This project's ongoing purpose is to arrange all known organisms onto a tree showing all the
ancestral relationships. One application of phylogenetics is the mapping of gene expression data
(Shulze and Downward 2001) of an organism onto a tree to study the evolution of expression
(Oakley, et al 2004, Gu 2004). With the construction of phylogenetic trees, researchers may gain
insight that may be difficult to elucidate otherwise about the history and relationships between
compared data.
This first chapter discusses the background behind phylogenetic trees and covers techniques
measuring the quality of a hypothesis tree for given data. In addition, it also discusses the
background on the main project of this thesis: CoMET, short for Continuouscharacter Model
Evaluation and Testing. CoMET is a generalpurpose tool that applies phylogenetic techniques and
recently developed models by Dr. Todd Oakley to evaluate the phylogenetic history of observed
data. CoMET calculates the maximum likelihood (ML) (Hulsenbeck and Crandall 1997) of the
input data and hypothesis tree, and presents to the user the likelihoods of different trait evolutionary
models. The second chapter covers in detail the design and implementation of CoMET. The third
chapter discusses the simulation experiments on CoMET for comparing the different Oakley
models. In addition to answering existing questions about the models, the results of the experiments
also produced new insights on using these models.
Lee 7
1.2 Trees, Taxa, and CharactersA phylogenetic tree describes how its taxa are related ancestrally. Each taxon , located at the
tips of the tree, represent the current, observed species. A rooted phylogenetic tree starts at a root
node, representing the common ancestor of all taxa, and branches upward toward the taxa terminal
nodes at the tree tips. Between the root and the tips may be several internal nodes representing
ancestors of the taxa. For CoMET, only binary trees are used: binary in the sense that every parent
node has two and only two daughter nodes. Therefore, for n taxa nodes, there are n1 internal nodes
including the root. Feature data belonging to the taxa are called characters, and they may be
discrete values like DNA sequences, or, in the case of CoMET, continuouslyvarying values
representing various states. Character data is collected for each taxon, and the whole character
matrix may include several rows of characters for each, representing different states for different
conditions.
1.3 Tree EvaluationTo measure the quality of a tree for a given set of taxa and character data, parsimony score
and likelihood are two well known methods. Parsimony scores are generally related to trees built
using the maximum parsimony (MP) method (Eck and Dayhoff 1966, Fitch 1971), in which trees
are first built and the taxa are arranged at the tips. Then, the number of mutations required to
change from daughter states to parent states for all branches are summed together into the MP score.
Under MP, simpler hypotheses are preferred to more complicated ones, and therefore, the
configurations with the lowest scores are deemed as the best trees.
Like MP, ML also looks for the best fit trees, but its statistical approach produces
probabilities instead of scores. ML calculates the probability of data given a hypothesis tree and
model by accruing the individual probabilities of all branching events according to a random
Lee 8
distribution. CoMET calculates likelihood based on the work of J. Felsenstein (Felsenstein 1981),
who developed a system for calculating ML over continuous character data in his Restricted
Evolution Maximum Likelihood (REML) algorithm. In this algorithm, the Brownian motion
diffusion model is used to predict character states given a time from a parent state. The main
advantages of this model are its simplicity and its applicability to a wide range of data. If the root
state is zero, then the mean of all the taxa states is also zero because of how Brownian motion is
defined. This, in turn, becomes a limitation because it restricts the overall evolutionary growth as
nondirectional.
In Felsenstein's algorithm, the overall likelihood can be expressed with the following
equation, reflecting the underlying assumption of Brownian motion as the model for character
evolution:
L=∏ii '
1
2v i '
exp −x i− x i '
2
2 v i '
(Equation 1)
In this formula, i and i' represent two nodes in the tree and vi' is the time distance, usually depicted
as branch lengths, between these two nodes. Their states are represented by xi and xi' while the
parameter ß represents the rate of change. With REML, Felsenstein simplified Equation 1 and
introduced the recursive approach to compute likelihood.
The basic ML element in a tree is the contrast, which involves a parent node p with two
daughter branches b1 and b2, and two daughter states s1 and s2:
contrastp = 0.5 * ln (b1 + b2) – 0.5 * (s1 – s2)2 / (b1 + b2) (Equation 2)
The contrast value is the natural log of the likelihood of this branching event. As an example, let the
Lee 9
parent node have two daughter nodes that are terminal nodes, which contain character data. If there
are a total of three characters to examine, then the contrast operation would be calculated three
times and summed, once for each character and using the states from just one character at a time. If
a daughter node for a contrast is an internal node of the tree, then the state information for that
internal node is inferred from its two daughters based on Brownian motion. The equation below
shows how the state and branch length of p, a parent node, are calculated based on two daughter
branches b1 and b2, and two daughter states s1 and s2:
sp = ((s1 / b1) + (s2 / b2)) / (b11 + b2
1) (Equation 3a)
bp = bp + (b1 b2) / (b1 + b2) (Equation 4)
In the case that the length of one of the daughter branches is zero, then Equation 3a would have a
divisionbyzero problem in either (s1 / b1) or (s2 / b2). But if the parent node p and a daughter node
d node have no distance between them, then their states are assumed equal:
sp = sd | bd = 0 (Equation 3b)
The reason branch lengths are adjusted in Equation 4 is to allow for proper state inferences later.
The tree's actual branches are not changed, just the temporary representation. Again, given the
nature of these equations, the overall likelihood of a tree is computed recursively. Contrast values
are first collected from the tips of the tree and then the computation progresses downward toward
the root, calculating branch length adjustments and inferring states along the way. At the end of the
process, the contrasts from the internal nodes are summed together to form the likelihood.
Lee 10
1.4 Oakley's Nine Models of Trait EvolutionUsing the ML equations, CoMET tests the tree and character data to see which of the nine
Oakley's models of trait evolution fits best. These models represent a combination of different
styles and characteristics in trait evolution, and are grouped into three general classes and three
model types.
The three general classes are phylogenetic, nonphylogenetic, and punctuated. Models under
phylogenetic assume that character traits are directly affected by speciation events: a split in the tree
means a divergence in character states. In nonphylogenetic models, close genetic relatives are
assumed to be just as similar as distant relatives when compared by phenotype. In punctuated
models, phylogeny layout also correlates with the resulting character states, as in phylogenetic
models, but has one key difference: at each branching event in the tree, one daughter node retains
the same phenotype as the parent while the other daughter does not.
The three model types are distance, equal, and free. Models of the distance type assume that
branch lengths represent the amount of divergence a daughter state has from its parent. The equal
models assume equal divergence for all nodes by modeling all branches to have equal lengths.
Lastly, the free models allow for independent and different rates of divergence by individually
scaling the branch lengths.
Crossing three general classes with the three model types results in nine different models,
each combining different models of phylogeny and divergence rates:
1. Phylogenetic / Distance2. Phylogenetic / Equal3. Phylogenetic / Free4. Nonphylogenetic / Distance5. Nonphylogenetic / Equal6. Nonphylogenetic / Free7. Punctuated / Distance8. Punctuated / Equal9. Punctuated / Free
Lee 11
Again, using these models, CoMET examines the input data to see which evolutionary model fits
the data with the greatest likelihood.
1.5 The Akaike Information CriterionThe degrees of freedom (DOF) involved in the nine models are dependent upon the number
of parameters in the models. Consequently, when character data are tested against these models, the
ML results alone do not factor in choosing the best model. By using the Akaike Information
Criterion (AIC) (Akaike 1973) and defining the DOF for each model, the results of each model can
be ranked. The general formula is as follows:
AIC = 2 ln L + 2 d (Equation 5)
In this equation, L is the likelihood, and d is the DOF used in a particular model. The model with
the lowest AIC is deemed the best fit model for the data. Values of d used for the models are
discussed in the CoMET chapter.
1.6 The Mesquite ProjectCoMET is built upon the Mesquite platform,(Maddison and Maddison 2004) a pervasively
modular graphical program written in Java. The CoMET module contains three packages
accessible for Mesquite: the CoMET evaluator for character matrices, the CoMET evaluator for
single characters, and the CoMET simulation program. Mesquite benefits CoMET by providing a
graphical user interface; data structures for trees, taxa, and characters; an implementation of the
Brownian motion model; and also scientific routines from the Phylogenetic Analysis Library (PAL)
(Drummond and Strimmer 2001). The particular PAL classes used by CoMET are separately
Lee 12
packaged using the latest version from the official PAL website because CoMET requires
monitoring features not found in Mesquite's PAL.
Lee 13
Chapter 2: CoMET
2.1 CoMET IntroductionAs mentioned earlier, CoMET's purpose is to determine the most likely model of trait
evolution for a given set of continuouslyvarying data. Since the only limitation for the data is to be
continuouslyvarying, CoMET can be used for problems of many types. For example, physical
characteristics like body sizes (Mooers, et al 1999) and experimental data like microarray
expression levels can all be used as input for CoMET. A hypothesis tree is also needed from the
user, which can be constructed using the same data or a different data source. One example setup
(Oakley, et al 2004) analyzed the phylogeny of a set of related genes by first constructing a tree
using DNA comparison. The character data at the taxa tips contained the logbasedtwo of
expression levels for each gene at different time points in different microarray experiments. This, in
effect, tested how gene expression for this group of genes evolved.
2.2 Program FlowAgain, the input parameters for CoMET consist of a character data matrix and a hypothesis
tree. CoMET begins by making copies of the hypothesis tree and altering these copies to represent
the different models (Figure 1). Then, the character matrix is used in the likelihood calculations for
these altered trees.
Lee 14
Figure 1: CoMET program flow chart.
How CoMET alters a tree depends on which two model combinations the tree represents.
For the distance models, the branch lengths are left unchanged. For the equal models, all branch
lengths are changed to one to represent equal rates of divergence. In the free models, each branch
length is allowed to grow or shrink to achieve the greatest likelihood to represent varying divergence
rates.
In addition to the changes according to model types, CoMET applies changes according to
Lee 15
the general class to which the tree is assigned. For phylogenetic, the branches are once again left
unchanged. For nonphylogenetic, every branch not directly connected to the taxa nodes has its
length reduced to zero, modeling the situation in which all taxa nodes are directly connected to the
root and allowing for close relatives to be just as similar or dissimilar to each other as distant ones.
For punctuated, CoMET presents two different approaches, a punctuated average and a punctuated
maximal version. The finer details about tree alterations for the punctuated models are discussed in
a later section. However, just as a brief description, CoMET models punctuated evolution by
changing the branch length of one daughter branch at every internal parent node to zero. This
effectively models one daughter state to remain the same as the parent state, while the state of the
daughter with the nonzero branch length would diverge.
After the construction of these altered trees, likelihood can be calculated using character
data. One additional parameter is optimized at this stage: ß, from Equation 1, representing the rate
of phenotypic change. This value is multiplied to all branch lengths of the tree to further maximize
the likelihood. CoMET finds the optimal scaling factor through PAL's UnivariateMinimum class,
which employs Brent's numerical method without calculating derivatives. CoMET implements an
extension of the UnivariateFunction class to be used with UnivariateMinimum. This function takes
in the scalar to use, scales all the branches of the tree using this value, and returns the likelihood of
the resulting tree with the given character data. This process quickly finds the scalar yielding
maximal results.
2.4 Degrees of Freedom for AIC ValuesAs mentioned earlier, the use of AIC over ML facilitates direct comparison between the
results from these altered trees because AIC values account for the different DOF's used by the
models. For most models, the DOF d is 1 because the only parameter is the final scaling factor.
Lee 16
The free models have different values for d because each branch length may have a parameter. In
the case of phylogenetic / free, d equals the total number of branches, which for a tree with n taxa is
2n – 2. For nonphylogenetic / free, there are only as many scalable branches as there are taxa
nodes, so this model's DOF is n. For the case of punctuated / free, only half of the branches remain
nonzero and are thus parameters; so for this model, d = n – 1. After adjusting ML into AIC values,
CoMET outputs the results for each model to the user. The model with the lowest AIC is the best fit
for the data.
2.3 Free ModelsTrees under the free model type have each branch length adjusted toward greater overall
likelihood. The simplest way to achieve this is to provide the basis tree and character data, and let
one of PAL's MultivariateMinimum classes optimize every branch. This idea was tried but
abandoned for two reasons. One, the running time is long and does not scale well as the number of
taxa increases. The second reason is more difficult to ignore: the function to minimize is not
continuous, which in turn causes PAL's methods to fail to produce useful results. The reason why
the function is not continuous is because branch lengths must be allowed to scale to as low as zero.
However, only one daughter branch is allowed to do so at every node due to the divisionbyzero in
Equation 4. In addition, the state of the parent node cannot be determined since both daughter
states are of a distance zero away, there is no way to choose one state over another. PAL's methods
do not know of this limitation and will test combinations that result in these situations, which are
called zero forks. When the ML calculator comes to evaluate the choices made by the optimizer
and encounters these zero forks, it returns a low likelihood value to discourage the choice. When
multiple variables are involved, these spikes confuse the optimizer, which surrenders with poor
results.
Lee 17
CoMET's solution is a greedy algorithm that recursively optimizes two sister branches at a
time, resulting in a maximal value as the ML. With just two variables, PAL's conjugate direction
search has no problems dealing with the function spike in the zero fork situation and can quickly
reach an answer. Even though only two branches at a time are optimized, the quality metric for the
choices made by the conjugate direction search is the ML over the entire tree and not just the single
contrast value of the parent node involved. This choice helps guide the subsequent pair
optimizations toward the highest likelihood in the end. Tree recursion also plays a helpful role to
improve the quality of this algorithm. By processing the tips of the tree before the inner branches,
optimizations in the later stages would not affect the results from earlier. An additional assumption
is that the quality of earlier optimizations is less sensitive to later optimizations because the
inference of states starts from tips and move toward root, as described by Equations 3a, 3b, and 4.
In all, this greedy algorithm represents a besteffort, twoatatime maximizer and runs significantly
faster than simultaneously optimizing all the branch lengths.
2.4 Punctuated ModelsCoMET presents two variations of the punctuated class: punctuated maximal and
punctuated average. In the first case, a greedy algorithm recursively visits parent nodes and sets
one of the daughter branch lengths to zero. The daughter branch that gets chosen is the one that will
result in a higher overall likelihood. Like in the greedy algorithm used for the free models, the
quality metric is the overall likelihood and not just the local likelihood at the parent node.
The punctuated average model presents a more conservative model by considering all
combinations of punctuation and averaging the results. The computation challenge here is the
exponentially growing number of combinations possible as the number of taxa increases. For a tree
with t taxa, there are t – 1 internal nodes, each having two choices of punctuation as to which
Lee 18
daughter branch length to set to zero. This results in a total of 2t1 different versions of the tree
under the punctuated average model, as shown in Figure 2. However, within these 2t1 trees are
redundancies that can be exploited to reduce running time. The below algorithm describes how
CoMET handles the punctuated average calculations:
1. If the current node's daughters are taxa nodes, sum the two combinations' contrasts
and return.
2. Let TA be the subtree of daughter A, and TB be the subtree of daughter B. Let I and J
be the number of internal nodes of each respective subtree.
3. Calculate MLA and MLB as the total of ML of the combinations in TA and TB,
respectively.
4. Let MLA' = MLA * 2J + 1, because TA is repeated 2J + 1 times. 2J + 1 is also the number of
different arrangements for TB. The extra 1 added to J accounts for the parent node of
TA and TB.
5. Likewise, MLB' = MLB * 2I + 1
6. Efficiently compile all possible statepairs at this node P and calculate the contrasts
using these pairs and the daughter branch lengths. Let MLP be the sum of these
contrasts.
7. The total ML at current parent node P, covering all combinations, is MLP' = MLP +
MLA' + MLB'.
8. The average ML is then MLP' / k, where k = 2n – 1 and n is the number of taxa and n
– 1 is the number of internal nodes in the tree.
Lee 19
Figure 2: A threetaxa tree has four punctuation possibilities.
Even though significant redundancies have been reduced by steps 4 and 5, step 6 represents another
challenge. In step 6, the sum of all possible contrasts at node P is calculated, so the need to find all
possible states that occur at P's daughter nodes appears, and this collection of states must also be
done as efficiently as possible due to the exponential nature in counting the number of state pairs.
To illustrate, take Figure 2 again as example. In each punctuated version of the tree, the root node's
contrast calculation requires the state at taxon 1 and the state at the internal node p, parent to taxa
nodes 2 and 3. This translates to a pair of states per character per version of punctuation at the root
node. The state at p is either the state at taxon 2 or taxon 3, depending on which daughter branch is
zero due to Equation 3b. Therefore, instead of storing states, CoMET can store taxon numbers and
lookup the states later.
Lee 20
In all, we have these pairs of taxa contributing their states for contrast calculation at the root
node: {(1, 2), (1, 3), (1, 2), (1, 3)}. The size of this list is 2n – 1 if n is the number of internal nodes at
a particular subtree. To save access time and memory, CoMET stores the taxa pair list as a taxa pair
matrix (TPM). The TPM stores taxa pair counts and uses the counts as multipliers in the contrast
total. Because there are two ways to punctuate two daughter branches, CoMET calculates the
contrast twice using the two ways of punctuation and sums the two. This sum is then multiplied by
the count value c stored in the TPM. If the pair information were stored as lists with repeated
elements, then CoMET would have to repeat the same contrast calculations by c times . For the root
node in the above example, its TPM would look like this:
1 2 3
1 1 1
2 1
3 1
Figure 3: The root node TPM for trees in Figure 2.
This small example does not show significant savings in memory, but for larger trees, the savings in
memory and access time is clear. This TPM shows a repeat of counts because of the diagonal
mirroring effect, but CoMET only reads half of this matrix when doing its calculations. There are
two additional advantages for storing taxa pair information as TPM's. The first one is the ability to
build TPM's recursively without requiring much more space. They begin as small tables at upper
internal nodes, but merge together into larger TPM's as the recursion descends back toward the
bottom of the tree. When the recursion returns to the root node, its TPM becomes an n × n matrix,
where n is the total number of taxa. Merging of TPM's is also fast due to indexed lookups for pair
Lee 21
counts. The second benefit of TPM's is reusability. The matrix uses taxa as rows and columns, and
since the layout of the tree does not change with each punctuation version, the same set of matrices
can be used. If instead of TPM's CoMET were to generate lists of propagated states at every parent
node, then for every character, a new list would be required and generated, adding to running time
significantly.
How CoMET merges TPM's is explained next. Every cell in the first TPM is compared with
every cell in the second. If two cells are nonzero, the product of the two cells is added to the
corresponding cells in the new TPM. As an example, Figures 5, 6, and 7 show a sample tree and the
TPM's of the root's daughter nodes.
Figure 4: A fivetaxa tree.
A B
A 1
B 1
Figure 5: The TPM of the left daughter of the root in Figure 4.
C D E
C 1 1
D 1
Lee 22
C D E
E 1
Figure 6: The TPM of the right daughter of the root in Figure 4.
First, the cells AB from Figure 5 and CD from Figure 6 are merged. This adds 1 into the new
matrix's cells AC, AD, BC, and BD. Next, cells AB and CE from the two older tables are processed,
adding 1 to the new matrix's cells AC, AE, BC, and BE (Figure 8). Keep in mind that although
CoMET stores the counts twice due to the mirroring in the TPM, the actual contrast calculation
would only involve reading a diagonal half of this matrix.
A B C D E
A 2 1 1
B 2 1 1
C 2 2
D 1 1
E 1 1
Figure 7: Merged TPM.
2.5 ConclusionCoMET is a generalpurpose tool that matches character data with a bestfit model chosen
from one of Oakley's nine models of trait evolution. As long as an experiment involves
continuouslyvarying data, CoMET can be used to hypothesize the data's phylogeny. CoMET offers
automatic optimization of the ß parameter, a fast algorithm for the free models, and an efficient
process to handle punctuated average models. Through the Mesquite platform, CoMET is easily
accessible through a graphical user interface for end users. In addition, Mesquite's framework
allows CoMET to be easily extensible and integrated into other Mesquite modules.
Lee 23
Chapter 3: Simulation Experiments
3.1 Introduction to SimulationsThe purpose of the simulation experiments conducted was to answer questions concerning
the punctuated models. First, the DOF for punctuated maximal models is unclear, because choices
exist for deciding which branches are to remain unchanged and which to have lengths assumed to be
zero. If these choices are considered as parameters, then the DOF would be as high as the number
in free models, always resulting in poorer AIC values. The punctuated average avoids this
uncertainty by averaging the ML over all versions of punctuation, although this model may be too
conservative because its ML results have mostly been lower than other models. On the other hand,
the punctuated maximal only processes the maximal version of the many ways to punctuate a tree,
and the punctuation choices it makes may count as extra parameters. At this point, the DOF for
punctuated maximal is unclear, and one of the goals of these experiments is to empirically
determine its DOF.
The simulations first compared the two punctuated models with the phylogenetic models to
test their sensitivities to varying levels of punctuation. The findings of these simulations lead to a
greater understanding surrounding the DOF issue, suggesting an addition to CoMET for user
defined punctuation.
3.2 Experiments with Asymmetry RatiosThe experiments consisted of running a subset of the CoMET calculations on characters
simulated over a range of punctuation levels. Specifically, the phylogenetic, punctuated maximal,
and punctuated average classes ran ML and AIC calculations for each of their three model types:
Lee 24
distance, equal, and free for a total of nine models. The basis trees used in the CoMET simulations
were the yeast kinases and HSP dnaK trees borrowed from Oakley's data. The program flow of the
simulations is explained next.
Prior to simulating character data for CoMET, the simulator must first build trees of different
punctuation levels with which to simulate characters. The measure of punctuation is determined by
the length ratio between two sister branches. Trees with asymmetry ratios (AR's) ranging from 20
to 2000 were randomly constructed using a Gaussian distribution centered at the desired asymmetry
ratio to provide exact multipliers for one of every two sister branches. These onesided scaling
actions turned the basis tree into variations with punctuated branching events.
After creating random punctuated trees, the simulator generated character states using
Mesquite's builtin Brownian motion model. During the course of testing, it was discovered that
generating states for more than one character leads to uninteresting results. This was due to the
Brownian motion model producing tip states averaging toward zero in this restricted evolution
model. With enough characters, even the states for a single taxon's multiple characters began to
average toward zero due to the balanced mix of positive and negative states generated by Mesquite's
Brownian motion model. Consequently, the characters did not show indications of punctuation, and
the decision was made to generate only one character per random punctuated tree.
After simulating a character for a random tree, CoMET calculated the ML and AIC values
for the aforementioned nine models. These calculations were repeated for ten iterations per AR. In
each iteration, the best AIC from the three phylogenetic models was paired with the best AIC from
punctuated maximal's three models and the best AIC from punctuated average's three models. The
ratios between the best phylogenetic AIC with the other two best AIC's from the punctuated models
were then computed and averaged over these ten iterations. The overall program flow for this
simulation is show in Figure 8. This test showed how different the two variants of punctuated are to
Lee 25
phylogenetic when using simulated punctuated character data.
3.3 Experiments with Pure Punctuated TreesAlongside the experiments using AR's to generate random punctuated trees were the
experiments that use pure punctuated trees for character simulation. In this case, the AR was
referred to as the nonzero length (NZL) because this random number was used for the exact branch
length of one daughter branch per internal node while the other daughter branch was set to zero for
ultimate punctuation. The results from these tests were compared against the results from the
experiments with the more realistic punctuated trees discussed earlier.
Lee 26
Figure 8: Experiment program flow over a range of AR values.
3.4 Results and DiscussionThe graphs in Figure 9 and 10 below show how well punctuated maximal and punctuated
average detect punctuation as compared to the standard phylogenetic models. In addition, the
Lee 27
results of the experiments generating pure punctuated trees using NZL settings are also plotted.
AIC Results Comparison Mean Median
(Punctuated Max. / Phy.) 1 0.00880 0.00850
(Punctuated Avg. / Phy.) 1 0.03144 0.03144
Pure (Punctuated Max. / Phy.) 1 0.00650 0.00500
Pure (Punctuated Avg. / Phy.) 1 0.03807 0.03746
Figure 9: Results based on the kinases tree.
Lee 28
AIC Results Comparison Mean Median
(Punctuated Max. / Phy.) 1 0.01277 0.01207
(Punctuated Avg. / Phy.) 1 0.04439 0.04317
Pure (Punctuated Max. / Phy.) 1 0.01047 0.00977
Pure (Punctuated Avg. / Phy.) 1 0.03790 0.03632
Figure 10: Results based on the HSP dnaK tree.
From these results, three main observations can be made. The first is the performance of the
punctuated maximal. This model is a significantly better fit for the character data than punctuated
average. In the case of the kinases tree, the punctuated maximal also beats phylogenetic rather
consistently. The second observation finds the punctuated maximal to be insensitive to the changing
asymmetry, unlike the punctuated average model. This insensitivity plus the first observation
together suggest that perhaps the punctuated maximal is too aggressive to model punctuation. In
addition, these findings show that even the conservative punctuated average has the steady trend to
Lee 29
beat phylogenetic as asymmetry increases. As a result, punctuated average was accepted as the
more useful model.
The third main observation is the similarity between results from simulating characters with
pure and impure punctuated trees. Since pure punctuation is infinitely more punctuated than the
impure, one expects results from those types of trees to be significantly different. However, the
results say otherwise. In light of these findings, a likely explanation for the similarity is that even
when the resulting characters show pure punctuation, the ML for pure punctuation would not be
significantly different from the ML for impure punctuation due to the smoothing effect of
randomness in the Brownian motion model.
3.5 Followup Experiment and ResultsUsing the lessons learned from the three observations from the first experiment, a followup
experiment was conducted. This experiment's purpose was to test whether or not redefining
punctuation as a userspecified variable is feasible. The graphs of the punctuated average models in
Figures 9 and 10 show a steady trend toward favoring punctuated average over phylogenetic, so it
would be interesting to see if the user of CoMET can decide what asymmetry ratio represents the
threshold in defining a tree as punctuated or not. The graph would then be shifted downward and
cross into the negatives at some point by multiplying a factor to the DOF to lower the AIC. This
time, the free model was not used because it does not have the same DOF as the distance and equal
models; therefore, an adjustment to the DOF would not account for all three model types properly.
Also in this followup, results using punctuated maximal were no longer calculated, and characters
were simulated using impure punctuated trees only. This is because, as previously discussed, the
punctuated maximal is too aggressive to be useful and the pure punctuated trees offer no advantages
over the more realistic impure punctuation. The chart in Figure 11 below shows the AIC ratio
Lee 30
comparison between punctuated average and phylogenetic using the best AIC from distance and
equal only.
Figure 11: AIC ratio results without influence from the free model.
Suppose that a CoMET user runs a simulation like the one above, and wishes to set a
punctuation threshold at AR = 500 so that when the full CoMET test is run, the punctuated average
model would be favored for situations in which the AR is found to be over 500. CoMET could take
the simulation data and find a DOF multiplier for the punctuated average to be used for later testing
against character data. Using this experiment as an example, the average ML for phylogenetic at AR
= 500 is 25.105. The average ML for punctuated average at AR = 500 is 26.053. If AR = 500 is
the threshold, then for AR < 500, phylogenetic would have a lower AIC than punctuated average,
and vice versa for AR > 500. Therefore, at AR = 500, the AIC's of both should be equal, resulting
in the below expression using Equation 5:
2(25.105) + 2(1) = 2(26.056) + 2(1)x
Lee 31
In this expression, x is the DOF multiplier necessary to establish the redefinition of punctuation.
The value 1 is the default DOF for punctuated models. The value of x in this expression becomes
0.049. If this is multiplied to the DOF for punctuated average / distance and punctuated average /
equal, then punctuated should be favored over phylogenetic when AR > 500 is detected. These
custom DOF multipliers are treespecific, so separate simulations to determine these multipliers are
necessary for different trees.
As for the free model, a different simulation can be run that compares just the punctuated
average / free with the best of the three phylogenetic models. A new DOF multiplier can be derived
and reused for repeated testing with the same tree.
3.6 ConclusionsThe simulation experiments initially began as a quest to find a way to deal with the DOF for
the punctuated maximal models. After the initial findings, the usefulness of punctuated maximal
was devalued so the DOF problem became a nonissue. Although punctuated average shows a good
sensitivity to varying asymmetries, it is usually too conservative to beat phylogenetic. Therefore,
the followup simulation shows how one can give punctuated average a handicap by lowering its
DOF with a multiplier and allow the user to defined the threshold of punctuation.
Another insightful finding is the similarity of ML's between data simulated from pure and
impure punctuated trees. This discovery suggests that zerolength branches may not be that
different from shortlength branches after all. A consequent reduction of some of CoMET's
complexities in the future may follow, because modeling zerolength branches may no longer be
necessary for the free models. In which case, confusing zero fork scenarios would not exist, and
PAL's multivariate optimizers may be able to replace the current greedy algorithm.
Lee 32
Chapter 4: Summary
4.1 ConclusionsPhylogenetic trees' graphical representations of ancestry can lead to a better understanding
about current species as well as new insights concerning their evolutionary conditions. However,
since most trees are based on genetic comparisons, they do not necessarily reflect the phylogeny of
other features of the taxa. To address this issue, CoMET uses the Oakley models to test character
data against phylogenetic trees and answers the question as to how the given data evolved.
The development of CoMET presented challenges in optimizing branch lengths of the free
models and improving the efficiency in calculating the punctuated average models. The free
models were accommodated by an approximative algorithm and the efficiency in calculating the
ML for punctuated average was drastically improved through redundancy analysis and reduction..
Simulation experiments further increased the understanding on the models and the
underlying algorithms in CoMET. After these experiments, the punctuated maximal is no longer
considered a useful model due to its aggressive and insensitive results as well as its uncertainty in
calculating AIC's. In addition, the comparison between pure and impure punctuated trees shows
that the difference in ML results between using zerolength branches and nonzerolength branches
is basically nonexistent.
Over 3600 lines of Java code are available to add to the Mesquite Project. This includes
CoMET in wholematrix and singlecharacter versions as well as the simulation module used in the
experiments. Binaries, source, and documentation are available at
http://www.cs.ucsb.edu/~chunghau/comet, or by contacting Chunghau Lee or Dr. Todd Oakley at
the University of California, Santa Barbara.
Lee 33
4.2 Open Problems and Future WorkAlthough CoMET was released even before simulations were done, the unsolved problems
existing then still exist now. The major hurdle is still the free model. Since the simulations show
that zero and nonzero branch lengths do not show much difference in ML results, we can model the
free trees as continuous multivariable functions without zerofork problems and use the conjugate
gradient method.
Another idea for a new feature is the custom definition of punctuation through simulation
training. After receiving the hypothesis tree and an AR from the user to use as the punctuation
tolerance level, CoMET can run the simulation similar to the one shown in Figure 11 to determine a
proper DOF multiplier for the punctuated average models. This would then allow character data
suggesting punctuation exceeding the userspecified tolerance to be designated as punctuated.
Lee 34
Appendix: References
Akaike, H. 1973. Information theory and an extension of the maximum likelihood principal. Proc. 2nd Int. Symp. Information Theory, Suppl. Problems of Control and Information Theory. 267281.
Drummond, A., and K. Strimmer. 2001. PAL: An objectoriented programming library for molecular evolution and phylogenetics. Bioinformatics 17: 662663.
Eck, R.V. and Margaret O. Dayhoff. 1966. Evolution of the Structure of Ferredoxin Based on Living Relics of Primitive Amino Acid Sequences. Science 152(3720): 363366.
Felsenstein, J. 1981. Evolutionary Trees From Gene Frequencies and Quantitative Characters: Finding Maximum Likelihood Estimates. Evolution 35(6):12291242.
Felsenstein, J. 2003. Inferring Phylogenies. Sinauer Associates, Inc.
Fitch, W. M. 1971. Toward defining the course of evolution: minimum change for a specified tree topology. Systematic Zoology 20:406416.
Gu, Xun. 2004. Statistical framework for phylogenetic analysis of expression profiles. Genetics 167:531542.
Hulsenbeck, J. P. and K. A. Crandall. 1997. Phylogeny Estimation and Hypothesis Testing Using Maximum Likelihood.
Maddison, W. P. and D.R. Maddison. 2004. Mesquite: a modular system for evolutionary analysis. Version 1.05 http://www.mesquiteproject.org.
Mooers, A., S. Vamosi, and D. Schluter. 1999. Using Phylogenies to Test Macroevolutionary Hypotheses of Trait Evolution in Cranes (Gruinae). The American Naturalist 154(2):249259.
Oakley, T., Z. Gu, E. Abouheif, N. H. Patel, and W. H. Li. 2004. Comparative Methods for the Analysis of Gene Expression Evolution: An Example Using Yeast Functional Genomic Data. Molecular Biology and Evolution 22(1):4050.
Saitou, N. and M. Nei. 1987. The neighborjoining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4(4):406425.
Schulze, A. and J. Downward. 2001. Navigating gene expression using microarrays a technology review. Nature Cell Biology 3:E190E195.