Upload
zmo82
View
217
Download
0
Embed Size (px)
Citation preview
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
1/78
Guest Editorial: WABI Special Section Part llJunhyong Kim and Inge Jonassen
THE Fourth International Workshop on Algorithms inBIoinformatics (WABI) 2004 was held in Bergen, Nor-way, September 2004. The program committee consisted of33 members and selected, among 117 submissions, 39 to bepresented at the workshop and included in the proceedingsfrom the workshop (volume 3240 of Lecture Notes inBioinformatics, series edited by Sorin Istrail, Pavel Pevzner,and Michael Waterman).
The WABI 2004 program committee selected a small
number of papers among the 39 to be invited to submit
extended versions of their papers to a special section of the
IEEE/ACM Transactions on Computational Biology and Bioin-
formatics. Four papers were published in the October-
December 2004 issue of the journal and this issue contains
an additional three papers. We would like to thank both the
entire program committee for WABI and the reviewers of
the papers in this issue for their valuable contributions.The first of the papers is A New Distance for High Level
RNA Secondary Structure Comparison authored by Julien
Allali and Marie-France Sagot. This paper describes algo-
rithms for comparing secondarystructures of RNAmolecules
wherethe structures arerepresentedby trees.The problem of
classifying RNA secondary structure is becoming critical as
biologists are discovering more and more noncoding func-
tional elements in the genome (e.g., miRNA). Most likely, themajor functional determinants of the elements are their
secondary structure and, therefore, a metric between such
secondary structures will also help delineate clusters of
functional groups. In Allali and Sagots paper, two tree
representations of secondary structure are compared by
analysing how one tree can be transformed into the other
using an allowed set of operations. Each operation can be
associated with a cost and thedistance between two trees can
then be defined as the minimum cost associated with a
transform of one tree to the other. Allali and Sagot introduce
two new operations that they name edge fusion and node
fusion and show that these alleviate limitations associated
with the classical tree edit operations used for RNAcomparison. Importantly, they also present algorithms for
calculating the distance between trees allowing the new
operations in addition to the classical ones, and analyze the
performance of the algorithms.
The second paper is Topological Rearrangements andLocal Search Method for Tandem Duplication Trees and isauthored by Denis Bertrand and Olivier Gascuel. The paperapproaches the problem of estimating the evolutionaryhistory of tandem repeats. A tandem repeat is a stretch ofDNA sequence that contains an element that is repeatedmultiple times and where the repeat occurrences are next toeach other in the sequence. Since the repeats are subject tomutations, they are not identical. Therefore, tandem repeatsoccur through evolution by copying (duplication) ofrepeat elements in blocks of varying size. Bertrand andGascuel address the problem of finding the most likelysequence of events giving rise to the observed set of repeats.Each sequence of events can be described by a duplicationtree and one searches for the tree that is the mostparsimonious, i.e., one that explains how the sequence hasevolved from an ancestral single copy with a minimumnumber of mutations along the branches of the tree. Themain difference with the standard phylogeny problem isthat linear ordering of the tandem duplications imposeconstraints the possible binary tree form. This paperdescribes a local search method that allows exploration ofthe complete space of possible duplication trees and showsthat the method is superior to other existing methods forreconstructing the tree and recovering its duplication
events.The third paper is Optimizing Multiple Seeds for
Homology Search authored by Daniel G. Brown. Thepaper presents an approach to selecting starting points forpairwise local alignments of protein sequences. Theproblem of pairwise local alignment is to find a segmentfrom each so that the two local segments can be aligned toobtain a high score. For commonly used scoring schemes,this can be solved exactly using dynamic programming.However, pairwise alignment is frequently applied to largedata sets and heuristic methods for restricting alignments tobe considered are frequently used, for instance, in theBLAST programs. The key is to restrict the number of
alignments as much as possible, by choosing a few goodseeds, without missing high scoring alignments. The papershows that this can be formulated as an integer program-ming problem and presents algorithm for choosing optimalseeds. Analysis is presented showing that the approachgives four times fewer false positives (unnecessary seeds) incomparison with BLASTP without losing more good hits.
Junhyong KimInge Jonassen
Guest Editors
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 1
. J. Kim is with the Department of Biology, University of Pennsylvania,3451 Walnut Street, Philadelphia, PA 19104.E-mail: [email protected].
. I. Jonassen is with the Department of Informatics and ComputationalBiology Unit, University of Bergen, HIB N5020 Bergen, Norway.E-mail: [email protected].
For information on obtaining reprints of this article, please send e-mail to:
[email protected]/05/$20.00 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
2/78
Junhyong Kim is the Edmund J. and LouiseKahn Term Endowed Professor in the Depart-mentof Biologyat the University of Pennsylvania.He holds joint appointments in the Department ofComputerand Information Science, Penn Centerfor Bioinformatics, and the Penn GenomicsInstitute. He serves on the editorial board ofMolecular Development and Evolution and theIEEE/ACM Transactions on Computational Biol-
ogy and Bioinformatics, thecouncilof theSocietyfor Systematic Biology, and the executive committee of the CyberInfrastructure for Phylogenetics Research. His research focuses oncomputational and experimental approaches to comparative develop-ment. The current focus of his lab is in three areas: computationalphylogenetics, in silico gene discovery, and comparative developmentusing genome-wide gene expression data.
Inge Jonassen is a professor of computerscience in the Department of Informatics at theUniversity of Bergen in Norway, where he ismember of the bioinformatics group. He is alsoaffiliated with the Bergen Center for Computa-tional Science at the same university where heheads the Computational Biology Unit. He is alsovice president of the Society for Bioinformatics inthe Nordic Countries (SocBiN) and a member of
the board of the Nordic Bioinformatics Network.He coordinates the technology platform for bioinformatics funded by theNorwegian Research Council functional genomics programme FUGE.He has worked in the field of bioinformatics since the early 1990s, wherehe has primarily focused on methods for discovery of patterns withapplications to biological sequences and structures and on methods forthe analysis of microarray gene expression data.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
3/78
A New Distance for High Level RNASecondary Structure Comparison
Julien Allali and Marie-France Sagot
AbstractWe describe an algorithm for comparing two RNA secondary structures coded in the form of trees that introduces two new
operations, called node fusionand edge fusion, besides the tree edit operations of deletion, insertion, and relabeling classically used in
the literature. This allows us to address some serious limitations of the more traditional tree edit operations when the trees represent
RNAs and what is searchedfor is a commonstructuralcoreof twoRNAs. Althoughthe algorithmcomplexity hasan exponential term, this
term depends only on the number of successive fusions that may be applied to a same node, not on the total number of fusions. The
algorithm remains therefore efficient in practice and is used for illustrative purposes on ribosomal as well as on other types of RNAs.
Index TermsTree comparison, edit operation, distance, RNA, secondary structure.
1 INTRODUCTION
RNAS are one of the fundamental elements of a cell. Theirrole in regulation has been recently shown to be far
more prominent than initially believed (20 December 2002issue of Science, which designated small RNAs withregulatory function as the scientific breakthrough of theyear). It is now known, for instance, that there is massivetranscription of noncoding RNAs. Yet current mathematicaland computer tools remain mostly inadequate to identify,
analyze, and compare RNAs.An RNA may be seen as a string over the alphabet of
nucleotides (also called bases), {A, C, G, T}. Inside a cell,RNAs do not retain a linear form, but instead fold in space.The fold is given by the set of nucleotide bases that pair. The
main type of pairing, called canonical, corresponds to bondsof the type A U and G C. Other rarer types of bondsmay be observed, the most frequent among them is G U,also called the wobble pair. Fig. 1 shows the sequence of afolded RNA. Each box represents a consecutive sequence ofbonded pairs, corresponding to a helix in 3D space. Thesecondary structure of an RNA is the set of helices (or thelist of paired bases) making up the RNA. Pseudoknots,which may be described as a pair of interleaved helices, arein general excluded from the secondary structure of anRNA. RNA secondary structures can thus be represented asplanar graphs. An RNA primary structure is its sequence ofnucleotides while its tertiary structure corresponds to thegeometric form the RNA adopts in space.
Apart from helices, the other main structural elements in
an RNA are:
1. hairpin loops which are sequences of unpaired basesclosing a helix;
2. internal loops which are sequences of unpairedbases linking two different helices;
3. bulges which are internal loops with unpaired baseson one side only of a helix;
4. multiloops which are unpaired bases linking at leastthree helices.
Stems are successions of one or more among helices,
internal loops, and/or bulges.The comparison of RNA secondary structures is one of
the main basic computational problems raised by the study
of RNAs. It is the problem we address in this paper. The
motivations are many. RNA structure comparison has been
used in at least one approach to RNA structure prediction
that takes as initial data a set of unaligned sequences
supposed to have a common structural core [1]. For each
sequence, a set of structural predictions are made (for
instance, all suboptimal structures predicted by an algo-
rithm like Zuckers MFOLD [15], or all suboptimal sets of
compatible helices or stems). The common structure is then
found by comparing all the structures obtained from the
initial set of sequences, and identifying a substructure
common to all, or to some of the sequences. RNA structure
comparison is also an essential element in the discovery ofRNA structural motifs, or profiles, or of more general
models that may then be used to search for other RNAs of
the same type in newly sequenced genomes. For instance,
general models for tRNAs and introns of group I have been
derived by hand [3], [10]. It is an open question whether
models at least as accurate as these, or perhaps even more
accurate, could have been derived in an automatic way. The
identification of smaller structural motifs is an equally
important topic that requires comparing structures.As we saw, the comparison of RNA structures may
concern known RNA structures (that is, structures that were
experimentally determined) or predicted structures. The
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 3
. J. Allali is with the Institut Gaspard-Monge, Universite de Marne-la-Valle e, CiteDescartes, Champs-sur-Marne, 77454, Marne-la-Vallee Cedex2, France. E-mail: [email protected].
. M.-F. Sagot is with Inria Rhone-Alpes, UniversiteClaude Bernard, Lyon I,43 Bd du Novembre 1918, 69622 Villeurbanne cedex, France.E-mail: [email protected].
Manuscript received 11 Oct. 2004; accepted 20 Dec. 2004; published online30 Mar. 2005.For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TCBB-0164-1004.1545-5963/05/$20.00 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
4/78
objective in both cases is the same: to find the common parts
of such structures.
In [11], Shapiro suggested to mathematically model RNA
secondary structures without pseudoknots by means of
trees. The trees are rooted and ordered, which means that
the order among the children of a node matters. This order
corresponds to the 5-3 orientation of an RNA sequence.
Given two trees representing each an RNA, there are two
main ways for comparing them. One is based on the
computation of the edit distance between the two trees
while the other consists in aligning the trees and using the
score of the alignment as a measure of the distance between
the trees. Contrary to what happens with sequences, the
two, alignment and edit distance, are not equivalent. The
alignment distance is a restrained form of the edit distance
between two trees, where all insertions must be performed
before any deletions. The alignment distance for general
trees was defined in 1994 by Jiang et al. in [9] and extended
to an alignment distance between forests in [6]. Morerecently, Hochsmann et al. [7] applied the tree alignment
distance to the comparison of two RNA secondary
structures. Because of the restriction on the way edit
operations can be applied in an alignment, we are not
concerned in this paper with tree alignment distance and
we therefore address exclusively from now on the problem
of tree edit distance.
Our way for comparing two RNA secondary structures is
thentoapplyanumberoftreeeditoperationsinoneorbothof
the trees representing the RNAs until isomorphic trees are
obtained. The currently most popular program using this
approach is probably theVienna package [5],[4]. Thetreeeditoperations considered are derived from the operations
classically applied to sequences [13]: substitution, deletion,
andinsertion. In 1989, Zhang andShasha [14] gave a dynamic
programming algorithm for comparing two trees. Shapiro
and Zhang then showed [12] how to use tree editing to
compare RNAs. The latter also proposed various tree models
that could be used for representing RNA secondary struc-
tures. Each suggested tree offers a more or less detailed view
of an RNA structure. Figs. 2b, 2c, 2d, and 2e present a few
examples of such possible views for the RNA given in Fig. 2a.
In Fig. 2, the nodes of the tree in Fig. 2b represent either
unpaired bases (leaves) or paired bases (internal nodes). Each
node is labeled with, respectively, a base or a pair of bases. A
node of the tree in Fig. 2c represents a set of successive
unpaired bases or of stacked paired ones. The label of a node
is an integer indicating, respectively, thenumber of unpaired
bases or the height of the stack of paired ones. The nodes of the
tree in Fig. 2d represent elements of secondary structure:
hairpin loop (H), bulge (B), internal loop (I),or multiloop (M).
The edges correspond to helices. Finally, the tree in Fig. 2e
contains only the information concerning the skeleton of
multiloops of an RNA. Thelast representation,though giving
a highlysimplified view of an RNA, is importantnevertheless
as it is generally accepted that it is this skeleton which is
usually the most constrained part of an RNA. The last two
models may be enriched with information concerning, for
instance, the number of (unpaired) bases in a loop (hairpin,
internal, multi) or bulge, and the number of paired bases in a
helix.The first label thenodesof thetree, thesecond itsedges.
Other types of information may be added (such as overall
composition of the elements of secondary structure). In fact,
one could consider working with various representations
simultaneously or in an interlocked, multilevel fashion. This
goes beyond the scope of this paper which is concerned with
comparing RNA secondary structures using any one among
the many tree representations possible. We shall, however,
comment further on this multilevel approach later on.
Concerning the objectives of this paper, they are twofold.
The first is to give some indications on why the classical edit
operations that have been considered so far in the literature
for comparing trees present some limitations when the trees
stand for RNA structures. Three cases of such limitations will
be illustrated through examples in Section 3. In Section 4, we
then introduce two novel operations, so-called node-fusion
and edge-fusion, that enable us to address some of these
limitations and then give a dynamic programming algorithm
for comparing twoRNA structures with these twoadditional
operations. Implementation issues and initial results are
presentedin Section 4. In Section 5, we give a first application
4 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 1. Primary and secondary structures of a transfer RNA.
Fig. 2. Example of different tree representations ((b), (c), (d), and (e)) of
the same RNA (a).
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
5/78
of our algorithm to the comparison of two RNA secondary
structures. Finally, in Section 6, we sketch the main ideas
behind the multilevel RNA comparison approach mentioned
above.Before that, we start by introducing some notation and
by recalling in the next section the basics about classical tree
edit operations and tree mapping.
This paper is an extended version of a paper presented at
the Workshop on Algorithms in BioInformatics (WABI) in
2004, in Bergen, Norway. A few more examples are given to
illustrate some of the points made in the WABI paper,
complexity and implementation issues are discussed in
more depth as are the cost functions and a multilevel
approach to comparing RNAs.
2 TREE EDITING AND MAPPING
Let T be an ordered rooted tree, that is, a tree where the
order among the children of a node matters. We define
three kinds of operations on T: deletion, insertion, and
relabeling (corresponding to a substitution in sequencecomparison). The operations are shown in Fig. 3. The
deletion (Fig. 3b) of a node u removes u from the tree. The
children ofu become the children ofus father. An insertion
(Fig. 3c) is the symmetric of a deletion. Given a node u, we
remove a consecutive (in relation to the order among the
children) set u1; . . . ; up of its children, create a new node v,
make v a child ofu by attaching it at the place where the set
was, and, finally, make the set u1; . . . ; up (in the same order)
the children of v. The relabeling of a node (Fig. 3d) consists
simply in changing its label.
Given two trees T and T0, we define S fs1 . . . seg to be
a series of edit operations such that, if we apply succes-
sively the operations in S to the tree T, we obtain T0 (i.e., T
and T0 become isomorphic). A series of operations like S
realizes the editing of T into T0 and is denoted by T S
T0.
We define a function cost from the set of possible edit
operations (deletion, insertion, relabeling) to the integers (or
the reals) such that costs is the score of the edit operation s.
IfSis a series of edit operations, we define by extension that
costS isP
s2Scosts. We can define the edit distance between
two trees as the series of operations that performs the
editing of T into T0 and such that its cost is minimal:
distanceT ; T
0
fmincostSjT
S
T
0
g.
Let an insertion or a deletion cost one and the relabeling of
a node cost zero if the label is the sameand one otherwise.For
the two trees of the figure on the left, the series relabelA
F:deleteB:insertG realizes the editing of the left tree into
the right one and costs 3. Another possibility is the series
deleteB:relabelA G:insertF which also costs 3. The
distance between these two trees is 3.
Given a series of operations S, let us consider the nodes
of T that are not deleted (in the initial tree or after some
relabeling). Such nodes are associated with nodes of T0. The
mapping MS relative to S is the set of couples u; u0 with
u 2 T and u0 2 T0 such that u is associated with u0 by S.
The operations described above are the classical tree editoperations that have been commonly used in the literature
for RNA secondary structure comparison. We now present a
few results obtained using such classical operations that will
allow us to illustratea fewlimitations they maypresent when
used for comparing RNA structures.
3 LIMITATIONS OF CLASSICAL TREE EDITOPERATIONS FOR RNA COMPARISON
As suggested in [12], the tree edit operations recalled in the
previous section can be used on any type of tree coding of
an RNA secondary structure.Fig. 4 shows two RNAsePs extracted from the database [2]
(they are found, respectively, in Streptococcus gordonii and
Thermotoga maritima). For the example we discuss now, we
code the RNAs using the tree representation indicated in
Fig. 2b where a node represents a base pair and a leaf an
unpaired base. After applying a few edit operations to the
trees, we obtain the result indicated in Fig. 4, with deleted/
insertedbasesingray.Wehavesurroundedafewregionsthat
match in the two trees. Bases in the rectangular box at the
bottom of the RNA on the left are thus associated with bases in
thebottom rightmostrectangular boxof theRNA on theright.
The same is observed for the bases in the oval boxes for bothRNAs. Such matches illustrateone of themain problems with
the classical tree edit operations: Bases in one RNA may be
mapped to identically labeled bases in the other RNA to
minimise the total cost, while such bases should not be
associated in terms of the elements of secondary structure to
which they belong. In fact, such elements are often distant
from one another along the common RNA structure. We call
this problem the scattering effect. It is related to the
definition of tree edit operations. In the case of this example
and of the representation adopted, the problem might have
been avoided if structural information had been used.
Indeed, the problem appears also because the structural
ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 5
Fig. 3. Edit operations: (a) the original tree T, (b) deletion of the node
labelled D, (c) insertion of the node labeled I, and (d) relabeling of a
node in T (the label A of the root is changed into K).
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
6/78
location of an unpaired base is not taken into account. It is
therefore possible to match, for instance, an unpaired basefrom a hairpin loop with an unpaired base from a multiloop.
Using another type of representation, as we shall do, would,
however, not be enough to solve all problems as we see next.
Indeed, to compare the same two RNAs, we can also use a
more abstract tree representation such as the one given in
Fig. 2d. In this case, the internal nodes represent a multiloop,
internal-loop, or bulge, the leaves code for hairpin loops and
edges forhelices. Theresult of theeditionofTinto T0 forsome
cost function is presented in Fig. 5 (weshallcome back later to
the cost functions used in the case of such more abstract RNA
representations; for the sake of this example, we may assume
an arbitrary one is used).The problem we wish to illustrate in this case is shown
by the boxes in the figure. Consider the boxes at the bottom.
In the left RNA, we have a helix made up of 13 base pairs. In
the right RNA, the helix is formed by seven base pairs
followed by an internal loop and another helix of size 5. By
definition (see Section 2), the algorithm can only associateone element in the first tree to one element in the second
tree. In this case, we would like to associate the helix of the
left tree to the two helices of the second tree since it seems
clear that the internal loop represents either an inserted
element in the second RNA, or the unbonding of one base
pair. This, however, is not possible with classical edit
operations.
A third type of problem one can meet when using only
the three classical edit operations to compare trees standing
for RNAs is similar to the previous one, but concerns this
time a node instead of edges in the same tree representa-
tion. Often, an RNA may present a very small helix betweentwo elements (multiloop, internal-loop, bulge, or hairpin-
loop) while such helix is absent in the other RNA. In this
case, we would therefore have liked to be able to associate
one node in a tree representing an RNA with two or more
6 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 5. Illustration of the one-to-one association problem with edges. Result of the matching of the two RNAsePs, of Saccharomyces uvarumand of
Saccharomyces kluveri, using the model given in Fig. 2d.
Fig. 4. Illustration of the scattering effect problem. Result of the matching of two RNAsePs, of Streptococcus gorgoniiand of Thermotoga maritima,
using the model given in Fig. 2b.
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
7/78
nodes in the tree for the other RNA. Once again, this is not
possible with any of the classical tree edit operations. An
illustration of this problem is shown in Fig. 6.We shall use RNA representations that take the elements
of the structure of an RNA into account to avoid some of the
scattering effect. Furthermore, in addition to considering
information of a structural nature, labels are attached, in
general, to both nodes and edges of the tree representing an
RNA. Such labels are numerical values (integers or reals).
They represent in most cases the size of the correspondingelement, but may also further indicate its composition, etc.
Such additional information is then incorporated into the
cost functions for all three edit operations. It is important to
observe that when dealing with trees labeled at both the
nodes and edges, any node and the edge that leads to it (or,
in an alternative perspective, departs from it) represent a
single object from the point of view of computing an edit
distance between the trees.It remains now to deal with the last two problems that
are a consequence of the one-to-one associations between
nodes and edges enforced by the classical tree edit
operations. To that purpose, we introduce two novel tree
edit operations, called the edge fusion and the node fusion.
4 INTRODUCING NOVEL TREE EDIT OPERATIONS
4.1 Edge Fusion and Node Fusion
In order to address some of thelimitations of theclassical tree
edit operations that were illustrated in the previous section,
we need to introduce two novel operations. These arethe edge
fusion and the node fusion. They may be applied to any of the
tree representations given in Figs. 2c, 2d, and 2e.
An example of edge fusion is shown in Fig. 7a. Let eu bean
edge leading to a node u, ci a child of u and eci the edge
between u and ci. The edge fusion of eu and eci consists in
replacing eci and eu with a new single edge e. The edge e links
the father of u to ci. Its label then becomes a function of the
(numerical) labels ofeu, u and eci . For instance, if such labels
indicatedthe size of each element (e.g.,for a helix,the number
ofitsstackedpairs,andforaloop,the min , max ortheaverage
of its unpaired bases on each side of the loop), the label of e
could be the sum of the sizes of eu, u and eci . Observe that
merging two edges implies deleting all subtrees rooted at the
children cj ofu forj different fromi. Thecost of such deletions
is added to the cost of the edge fusion.An example of node fusion is given in Fig. 7b. Let u be a
node and ci one of its children. Performing a node fusion of
u and ci consists in making u the father of all children of ciand in relabeling u with a value that is a function of thevalues of the labels of u, ci and of the edge between them.
Observe that a node fusion may be simulated using the
classical edit operations by a deletion followed by a
relabeling. However, the difference between a node fusion
and a deletion/relabeling is in the cost associated with both
operations. We shall come back to this point later.Obviously, like insertions or deletions, edge fusions and
node fusions have of course symmetric counterparts, which
are the edge split and the node split.Given two rooted, ordered, and labeled trees T and T0,
we define the edit distance with fusion between T and T0
ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 7
Fig. 7. (a) An example of edge fusion. (b) An example of node fusion.
Fig. 6. Illustration of the one-to-one association problem with nodes. The two RNAs used here are RNAsePs from Pyrococcus furiosus and
Metallosphaera sedula. Triangles stand for bulges, diamond stand for internal loops, and squares for hairpin loops.
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
8/78
as distancefusionT ; T0 fmincostSjT
ST0g with costs the
cost associated to each of the seven edit operations nowconsidered (relabeling, insertion, deletion, node fusion andsplit, edge fusion and split).
Proposition 1. If the following is verified:
. costmatcha; b is a distance,
. costinsa costdela ! 0,
. costnodefusion a;b;c costnodesplit a;b;c ! 0, and
. costedgefusion a;b;c costedgesplit a;b;c ! 0,
then distancefusion is indeed a distance.
Proof. The positiveness of distancefusion is given by the factthat all elementary cost functions are positive. Its
symmetry is guaranteed by the symmetry in the costsof the insertion/deletion and (node/edge) fusion/split
operations. Finally, it is straighforward to see that
distancefusion satisfies triangular inequality. tu
Besides the above properties that must be satisfied by thecost functions in order to obtain a distance, others may beintroduced for specific purposes. Some will be discussed in
Section 5.We now present an algorithm to compute the tree edit
distance between two trees using the classical tree edit
operations plus the two operations just introduced.
4.2 Algorithm
The method we introduce is a dynamic programming
algorithm based on the one proposed by Zhang and Shasha.Their algorithm is divided in two parts: They first compute
the edit distance between two trees (this part is denoted by
TDist) and then the distance between two forests (this partis denoted by FDist). Fig. 8 illustrates in pictorial form the
part TDist and Fig. 9 the FDist part of the computation.In order to take our two new operations into account, we
need to compute a few more things in the TDist part.Indeed, we must add the possibility for each tree to have a
node fusion (inversely, node split) between the root and one
of its children, or to have an edge fusion (inversely edge
split) between the root and one of its children. These
additional operations are indicated in the right box of Fig. 8.We present now a formal description of the algorithm. Let
T be an ordered rooted tree with jTj nodes. We denote by ti
the ith node in a postfix order. For each node ti, li is the
index of the leftmost child of the subtree rooted at ti. Let
Ti . . .j denote the forest composed by the nodes ti . . . tj(T T0 . . . jTj. To simplify notation, from now on, when
there is no ambiguity, i will refer to the node ti. In this case,
distancei1 . . . i2; j1 . . .j2 will be equivalent to distanceTi1. . . i2; T
0j1 . . .j2.The algorithm of Zhang and Sasha is fully described by
the following recurrence formula:
if i1 li2 andj1 lj2
MI N
distance i1 . . . i2 1 ; j1 . . .j2 costdeli2
distance i1 . . . i2 ; j1 . . .j2 1 costinsj2
distance i1 . . . i2 1 ; j1 . . .j2 1 costmatchi2; j2
8>:
1
else
MI N
distance i1 . . . i2 1 ; j1 . . .j2
costdeli2
distance i1 . . . i2 ; j1 . . .j2 1
costinsj2
distance i1 . . . li2 1 ; j1 . . . lj2 1
distance li2 . . . i2 ; lj2 . . .j2
8>>>>>>>>>>>>>>>:
2
Part (1) of the formula corresponds to Fig. 8, while part (2)
corresponds to Fig. 9. In practice, the algorithm stores in a
matrix the score between each subtree ofT and T0. The space
complexityis thereforeOjTj jT0j.Toreachthiscomplexity,
the computation must be done in a certain order (see
Section 4.3). The time complexity of the algorithm is
OjTj minleafT; heightT
jT0j minleafT0; heightT0;
where leafT and heightT represent, respectively, the
number of leaves and the height of a tree T.
8 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 8. Zhang and Sashas dynamic programming algorithm: the tree distance part. The right box corresponds to the additional operations added to
take fusion into account.
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
9/78
The formula to compute the edit score allowing for both
node and edge fusions follows.
if i1 ! lik and j1 ! ljk0
MI N
distancefi1 . . . ik1g; ;; fj1 . . .jk0 g; path0 costdelikdistancefi1 . . . ikg; path; fj1 . . .jk01g; ; costinsjk0
distancefi1 . . . ik1g; ;; fj1 . . .jk01g; ; costmatchik; jk0
for each child ic of ik in fi1; . . . ; ikg; set il lic
distancefi1 . . . ic1; ic1 . . . ikg; path:u; ic; fj1 . . .jk0 g;
path0
costnode f usionic; ikobs: :ik data are changed
distancefil . . . ic1; ikg; path:e; ic; fj1 . . .jk0 g; path0
costedge f usionic; ik distancefi1 . . . il1g;
;; ;; ;
distancefic1 . . . ik 1; ;; ;; ;
obs: : ik data are changed
for each child jc0 of jk0 in fj1; . . . ; jk0 g; set jl0 ljc0
distancefi1 . . . ikg; path; fj1 . . .jc01; jc01 . . .jk0 ;
path0:u; jc0
costnode splitjc0 ; jk0
obs: : jk0 data are changed
distancefi1 . . . ikg; path; fjl0 . . .jc0 ; jk0 ; path0:e; jc0
costedge splitjc0 ; jk0
distance;; ;; fj1 . . .jl0 1g; ;
distance;; ;; jc01 . . .jk01; ;
obs: : jk
0 data are changed
8>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>:
3
else set il lik and jl0 ljk0
MI N
distancefi1 . . . ik1g; ;; fj1 . . .jk0 g; path0 delik
distancefi1 . . . ikg; path; fj1 . . .jk01g; ; insjk0
distancefi1 . . . il1g; ;; fj1 . . .jl01g; ;
distancefil . . . ikg; path; fjl0 . . .jk0 g; path0
8>>>>>:
4
Given two nodes u and v such that v is a child of u,
node fusionu; v is the fusion of node v with u, and
edge fusionu; v is the edge fusion between the edges
leading to, respectively, nodes u and v. The symmetric
operations are denoted by, respectively, node splitu; v and
edge splitu; v.The distance computation takes two new parameters
path and path0. These are sets of pairs e or u;v which
indicate, for node ik (respectively, jk), the series of fusions
that were done. Thus, a pair e; v indicates that an edge
fusion has been perfomed between ik and v, while for u; va node v has been merged with node ik.
The notation path:e; v indicates that the operation e; v
has been performed in relation to node ik and the
information is thus concatenated to the set path of pairs
currently linked with ik.
4.3 Implementation and Complexity
The previous section gave the recurrence formul for
calculating the edit distance between two trees allowing for
node and edge fusion and split. We now discuss the
complexity of the algorithm. This requires paying attention
to some high-level implementation details that, in the case
of the tree edit distance problem, may have an important
influence on the theoretical complexity of the algorithm.
Such details were first observed by Zhang and Shasha. They
concern the order in which to perform the operations
indicated in (2) and (1) to obtain an algorithm that is time
and space efficient.Let us consider the last line of (2). We may observe that
the computation of the distance between two forests refersto the computation of the distance between two treesTli2 . . . i2 and T
0lj2 . . .j2. We must therefore memor-ise the distance between any two subtrees of T and T0.Furthermore, we have to carry out the computation from
the leaves to the root because when we compute thedistance between two subtrees U and U0, the distancebetween any subtrees of U and U0 must already have beenmeasured. This explains the space complexity which is inOjTj jT0j and corresponds to the size of the table used forstoring such distances in memory.
If we look at (1) now, we see that it is not necessary tocalculate separately the distance between the subtreesrooted at i0 and j0 if i0 is on the path from li to i and j0
is on the path from lj to j, for i and j nodes of,respectively, T and T0.
We define a set LRT of the left roots of T as follows:
LRT fkj1 k jTj and 69k0 > k such that lk0 lkg
ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 9
Fig. 9. Zhang and Sashas dynamic programming algorithm: the forest distance part.
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
10/78
The algorithm for computing the edit distance between t
and T0 consists then in computing the distance between
each subtree rooted at a node in LRT and each subtree
rooted at a node in LRT0. Such subtrees are considered
from the leaves to the root of T and T0, that is, in the order
of their indexes.
Zhang and Shasha proved that this algorithm has atime complexity in OjTj minleafT; heightT jT0j
minleafT0; heightT0, leafT designating the num-
ber of leaves of T and heightT its height. In the worst
case (fan tree), the complexity is in OjTj2 jT0j2.Taking fusion and split operations into account does
not change the above reasoning. However, we must now
store in memory the distance between all subtrees
Tli2 . . . i2 and T0lj2 . . .j2, and all the possible values
of path and path0.We must therefore determine the number of values that
path can take. This amounts to determine the total number
of successive fusions that could be applied to a given node.
We recall that path is a list of pairs e or u; v. Let path
fe or u; v1; e or u; v2; . . . ; e or u; vg be the list for node i
ofT. The first fusion can be performed only with a child v1of i. If d is the maximum degree of T, there are d possible
choices for v1. The second fusion can be done with one of
the children of i or with one of its grandchildren. Let v2 be
the node chosen. There are d + d2 possible choices for v2.
Following the same reasoning, there arePk
k1 dk possible
choices for the th node v to be fusioned with i.
Furthermore, we must take into account the fact that a
fusion can concern a node or an edge. The total number of
values possible for the variable path is therefore:
2 Ykk1
Xjkj1
dj 2lYkk1
dk1 1d 1
;
that is:
2 1
d 1
Ykk1
dk1 1 < 2l 1
d 1
ld
122 :
A node i may then be involved in O2dl possible
successive (node/edge) fusions.As indicated, we must store in memory the distance
between each subtree Tli2 . . . i2 and T0lj2 . . .j2 for all
possible values of path and path
0
. The space complexity of
our algorithm is thus in O2d 2d0 jTj jT0j, with dand d0 the maximum degrees of, respectively, T and T0.
The computation of the time complexity of our algorithm
is done in a similar way as for the algorithm of Zhang and
Shasha. For each node of T and T0, one must compute the
number of subtree distance computations the node will be
involved in by considering all subtrees rooted in, respec-tively, a node of LRT and a node of LRT0. In our case,
one must also take into account for each node the possibility
of applying a fusion. This leads to a time complexity in
O2d jTj minleafT; heightT 2d0 jT0j
minleafT0; heightT0:
This complexity suggests that the fusion operations may
be used only for reasonable trees (typically, less than
100 nodes) and small values of l (typically, less than 4). It is
however important to observe that the overall number of
fusions one may perform can be much greater than l
without affecting the worst-case complexity of the algo-
rithm. Indeed, any number of fusions can be made while
still retaining the bound of
O2dl jTj minleafT; heightT jT0j minleafT0;
heightT0
so long as one does not realize more than l consecutive
fusions for each node.In general, also, most interesting tree representations of
an RNA are of small enough size as will be shown next,
together with some initial results obtained in practice.
5 APPLICATION TO RNA SECONDARY STRUCTURESCOMPARISON
The algorithm presented in the previous section has beencoded using C++. An online version is available at http://www-igm.univ-mlv.fr/~allali/migal/.
We recall that RNAs are relatively small molecules with
sizes limited to a few kilobases. For instance, the small
ribosomal subunit of Sulfolobus acidocaldarius (D14876) is
made up of 1,147 bases. Using the representation shown in
Fig. 2b, the tree obtained contains 440 internal nodes and
567 leaves, that is 1,007 nodes overall. Using the representa-
tion in Fig. 2d, the tree is composed of 78 nodes. Finally, thetree obtained using the representation given in Fig. 2e
contains only 48 nodes. We therefore see that even for large
RNAs, any of the known abstract tree-representations (that
is, representations which take the elements of the secondary
structure of an RNA into account) that we can use leads to a
tree of manageable size for our algorithm. In fact, for small
values of l (2 or 3), the tree comparison takes reasonable
time (a few minutes) and memory (less than 1Gb).
As we already mentioned, a fusion (respctively, split) can
be viewed as an alternative to a deletion (respectively,
insertion) followed by a relabeling. Therefore, the cost
function for a fusion must be chosen carefully.
10 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
11/78
To simplify, we reason on the cost of a node fusion
without considering the label of the edges leading to the
nodes that are fusioned with a father. The formal definition
of the cost functions takes the edges also into account.
Let us assume that the cost function returns a realvalue between zero and one. If we want to compute thecost of a fusion between two nodes u and v, the aim is togive to such fusion a cost slightly greater than the cost ofdeleting v and relabeling u; that is, we wish to havecostnode f usionu; v mincostdelv t; 1. The parameter tis a tuning parameter for the fusion.
Suppose that the new node w resulting from the fusion of
u and v matches with another node z. The cost of this match
is costmatchw; z. If we do not allow for node fusions, the
algorithm will first match u with z, then will delete v. If we
compare the two possibilities, on one hand we have a total
cost of costnode f usionu; v costmatchw; z for the fusion,
that is, costdelv t costmatchw; z, on the other hand, acost ofcostdelv costmatchu; z. Thus, t represents the gain
that must be obtained by costmatchw; z with regard to
costmatchu; z, that is, by a match without fusion. This is
illustrated in Fig. 10.
In this example,the cost associated with thepathon thetop
is costmatch5; 9 costdel3. The pathat the bottomhasa cost
of costnode f usion5; 3 costdel3 t for the node fusion to
whichis added a relabeling cost ofcostmatch8; 9, leading toa
total of costmatch8; 9 costdel3 t. A node fusion will
therefore be chosen if costmatch8; 9 t > costmatch5; 9,
therefore if the score of a match with fusion is better by at
least t than a match without fusion.
We applythe same reasoning to the cost of an edge fusion.
The cost function for a node and an edge fusion between a
node u and a node v, with eu denoting the edge leading to u
and ev the edge leading to v is defined as follows:
costnode f usionu; v costdelv costdelev t
costedge f usionu; v costdelu costdeleu t
X
csibling ofv
cost deleting subtree rooted at c:
The tuning parameter t is thus an important parameter
that allows us to control fusions. Always considering a cost
function that produces real values between 0 and 1, if t is
equal to 0:1, a fusion will be performed only if it improves
the score by 0:1. In practice, we use values of t between 0
and 0:2.For practical considerations, we also set a further
condition on the cost and relabeling functions related to a
node or edge resulting from a fusion which is as follows:
costdela costdelb ! costdelc
with c the label of the node/edge resulting from the fusion
of the nodes/edges labeled a and b. Indeed, if this condition
is not fulfilled, the algorithm may systematically fusion the
nodes or edges to reduce the overall cost.An important consequence of the conditions seen above
is that a node fusion cannot be followed by an edge fusion.
Below, the node fusion followed by an edge fusion costs:
costdelb costdelB t costdelAB costdela t:
ThealternativeistodestroynodeB(togetherwith edgeb)andthen to operate an edge fusion, the whole costing: costdelb
costdelB costdelA costdela t. The difference be-
tween these two costs is t costdelAB costdelA,whichis
always positive.
This observation allows to significantly improve the
performance in practice of the algorithm.We have applied the new algorithm on the two RNAs
shown in Fig. 5 (these are eukaryotic nuclear P RNAs from
Saccharomyces uvarum and Saccharomyces kluveri) and coded
using the same type of representation as in Fig. 2d. We have
limited the number of consecutive fusions to one (l 1).
The computation of the edit distance between the two trees
taking node and edge fusions into account besides dele-
tions, insertions, and relabeling has required less than a
second. The total cost allowing for fusions is 6:18 with t
0:05 against 7:42 without fusions. As indicated in Fig. 11, the
last two problems discussed in Section 3 disappear thanks
to some edge fusions (represented by the boxes).An example of node fusions required when comparing
two real RNAs is given in Fig. 12. The RNAs are coded
using the same type of representation as in Fig. 2d. The
figure shows part of the mapping obtained between the
small subunits of two ribosomal RNAs retrieved from [8]
(from Bacillaria paxillifer and Calicophoron calicophorum). The
node fusion has been circled.
ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 11
Fig. 10. Illustration of the gain that must be obtained using a fusion
instead of a deletion/relabeling.
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
12/78
6 MULTILEVEL RNA STRUCTURE COMPARISON:SKETCH OF THE MAIN IDEA
We briefly discuss now an approach which addresses in
part the scattering effect problem (see Section 2). This
approach is being currently validated and will be more fully
described in another paper. We therefore present here the
main idea only.
To start with, it is important to understand the nature of
this scattering effect. Let us consider first a trivial case: the
cost functions are unitary (insertion, deletion, and relabeling
each cost 1) and we compute the edit distance between two
trees composed of a single node each. The obtained mapping
will associate the single node in the first tree with the singleone in the second tree, independently from the labels of the
nodes. This example can be extended to the comparison of
two trees whose node labels are all different. In this case, the
obtained mapping corresponds to the maximum home-
omorphic subtree common to both trees.If the two RNA secondary structures compared using a
tree representation which models both the base pairs and
the nonpaired bases are globally similar but present some
local dissimilarity, then an edit operation will almost
always associate the nodes of the locally divergent regions
that are located at the same positions relatively to the globalcommon structure. This is a normal, expected behavior in
the context of an edition. However, it seems clear also when
we look at Fig. 4 that the bases of a terminal loop should not
be mapped to those of a multiple loop.To reduce this problem, one possible solution consists of
adding to the nodes corresponding to a base an information
concerning the element of secondary structure to which the
base belongs. The cost functions are then adapted to take
this type of information into account. This solution,
although producing interesting results, is not entirely
satisfying. Indeed, the algorithm will tend to systematically
put into correspondence nodes (and, thus, bases) belonging
to structural elements of the same type, which is also not
necessarily a good choice as these elements may not be
related in the overall structure. It seems therefore preferable
to have a structural approach first, mapping initially the
elements of secondary structure to each other and taking
care of the nucleotides in a second step only.The approach we have elaborated may be briefly
described as follows: Given two RNA secondary structures,
the first step consists in coding the RNAs by trees of type c
in Fig. 2 (nodes represent bulges or multiple, internal or
12 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 12. Part of a mapping between two rRNA small subunits. The node fusion is circled.
Fig. 11. Result of the editing between the two RNAs shown in Fig. 4 allowing for node and edge fusions.
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
13/78
terminal loops while edges code for helices). We thencompute the edit distance between these two trees using the
two novel fusion operations described in this paper. This
also produces a mapping between the two trees. Each node
and edge of the trees, that is, each element of secondary
structure, is then colored according to this mapping. Two
elements are thus of a same color if they have been mapped
in the first step. We now have at our disposal an
information concerning the structural similarity of the two
RNAs. We can then code the RNAs using a tree of type b.
To these trees, we add to each node the colour of the
structural element to which it belongs. We need now only to
restrict the match operation to nodes of the same color. Two
nodes can therefore match only if they belong to secondary
elements that have been identified in the first step as being
similar.To illustrate the use of this algorithm, we have applied it
to the two RNAs of Fig. 4. Fig. 13 presents the trees of type(Fig. 2c) coding for these structures, and the mappingproduced by the computation of the edit distance withfusion. In particular, the noncolored fine dashed nodes andedges correspond, respectively, to deleted nodes/edges.One can see that in the left RNA, the two hairpin loopsinvolved in the scattering effect problem in Fig. 4 (indicatedby the arrows) have been destroyed and will not be mappedto one another anymore when the edit operations are
applied to the trees of the type in Fig. 2b.This approach allows to obtain interesting results.
Furthermore, it considerably reduces the complexity of
the algorithm for comparing two RNA structures coded
with trees of the type in Fig. 2b. However, it is important to
observe that the scattering effect problem is not specific of
the tree representations of the type in Fig. 2b. Indeed, the
same problem may be observed, to a lesser degree, with
trees of the type in Fig. 2c. This is the reason why we
generalize the process by adopting a modelling of RNA
secondary structures at different levels of abstraction. This
model, and the accompanying algorithm for comparing
RNA structures, is in progress.
7 FURTHER WORK AND CONCLUSIONWe have proposed an algorithm that addresses two main
limitations of the classical tree edit operations for compar-
ing RNA secondary structures. Its complexity is high in
theory if many fusions are applied in succession to any
given (the same) node, but the total number of fusions that
may be performed is not limited. In practice, the algorithm
is fast enough for most situations one can meet in practice.To provide a more complete solution to the problem of
the scattering effect, we also proposed a new multilevel
approach for comparing two RNA secondary structures
whose main idea was sketched in this paper. Further details
and evaluation of such novel comparison scheme will be thesubject of another paper.
REFERENCES[1] D. Bouthinon and H. Soldano, A New Method to Predict the
Consensus Secondary Structure of a Set of Unaligned RNASequences, Bioinformatics, vol. 15, no. 10, pp. 785-798, 1999.
[2] J.W. Brown, The Ribonuclease P Database, Nucleic AcidsResearch, vol. 24, no. 1, p. 314, 1999.
[3] N. el Mabrouk and F. Lisacek, and Very Fast Identification ofRNA Motifs in Genomic DNA. Application to tRNA Search in theYeast Genome, J. Molecular Biology, vol. 264, no. 1, pp. 46-55, 1996.
[4] I. Hofacker, The Vienna RNA Secondary Structure Server, 2003.[5] I. Hofacker, W. Fontana, P.F. Stadler, L. Sebastian Bonhoeffer, M.
Tacker, and P. Schuster, Fast Folding and Comparison of RNA
Secondary Structures, Monatshefte fur Chemie, vol. 125, pp. 167-188, 1994.[6] M. Hochsmann, T. Toller, R. Giegerich, and S. Kurtz, Local
Similarity in RNA Secondary Structures, Proc. IEEE Computer Soc.Conf. Bioinformatics, p. 159, 2003.
[7] M. Hochsmann, B. Voss, and R. Giegerich, Pure Multiple RNASecondary Structure Alignments: A Progressive Profile Ap-proach, IEEE/ACM Trans. Computational Biology and Bioinfor-matics, vol. 1, no. 1, pp. 53-62, 2004.
[8] T. Winkelmans, J. Wuyts, Y. Van de Peer, and R. De Wachter, TheEuropean Database on Small Subunit Ribosomal RNA, Nucleic
Acids Research, vol. 30, no. 1, pp. 183-185, 2002.[9] T. Jiang, L. Wang, and K. Zhang, Alignment of TreesAn
Alternative to Tree Edit, Proc. Fifth Ann. Symp. CombinatorialPattern Matching, pp. 75-86, 1994.
[10] F. Lisacek, Y. Diaz, and F. Michel, Automatic Identification ofGroup I Intron Cores in Genomic DNA Sequences, J. Molecular
Biology, vol. 235, no. 4, pp. 1206-1217, 1994.
ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 13
Fig. 13. Result of the comparison of the two RNAs of Fig. 4 using trees in Fig. 2c. The thick dash lines indicate some of the associations resulting
from the computation of the edit distance between these two trees. Triangular nodes stand for bulges, diamonds for internal loops, squares for
hairpin loops, and circles for multiloops. Noncolored fine dashed nodes and lines correspond, respectively, to deleted nodes/edges.
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
14/78
[11] B. Shapiro, An Algorithm for Multiple RNA Secondary Struc-tures, Computer Applications in the Biosciences, vol. 4, no. 3, pp. 387-393, 1988.
[12] B.A. Shapiro and K. Zhang, Comparing Multiple RNA SecondaryStructures Using Tree Comparisons, Computer Applications in theBiosciences, vol. 6, no. 4, pp. 309-318, 1990.
[13] K.-C. Tai, The Tree-to-Tree Correction Problem, J. ACM, vol. 26,no. 3, pp. 422-433, 1979.
[14] K. Zhang and D. Shasha, Simple Fast Algorithms for the Editing
Distance between Trees and Related Problems, SIAM J. Comput-ing, vol. 18, no. 6, pp. 1245-1262, 1989.[15] M. Zuker, Mfold Web Server for Nucleic Acid Folding and
Hybridization Prediction, Nucleic Acids Research, vol. 31, no. 13,pp. 3406-3415, 2003.
Julien Allali studied at the University of Marnela Vallee (France), where he received the MScdegree in computer science and computationalgenomics. In 2001, he began his PhD incomputational genomics at the Gaspard MongeInstitute of the University of Marne la Vallee. Histhesis focused on the study of RNA secondarystructures and, in particular, their comparisonusing a tree distance. In 2004, he received the
PhD degree.
Marie-France Sagot received the BSc degree in computer science fromthe University of Sao Paulo, Brazil, in 1991, the PhD degree intheoretical computer science and applications from the University ofMarne-la-Vallee, France, in 1996, and the Habilitation from the sameuniversity in 2000. From 1997 to 2001, she worked as a researchassociate at the Pasteur Institute in Paris, France. In 2001, she movedto Lyon, France, as a research associate at the INRIA, the FrenchNational Institute for Research in Computer Science and Control. Since2003, she has been the Director of Research at the INRIA. Her researchinterests are in computational biology, algorithmics, and combinatorics.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
14 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
15/78
Topological Rearrangements and Local SearchMethod for Tandem Duplication Trees
Denis Bertrand and Olivier Gascuel
AbstractThe problem of reconstructing the duplication history of a set of tandemly repeated sequences was first introduced by Fitch
[4]. Many recent studies deal with this problem, showing the validity of the unequal recombination model proposed by Fitch, describing
numerous inference algorithms, and exploring the combinatorial properties of these new mathematical objects, which are duplication
trees. In this paper, we deal with the topological rearrangement of these trees. Classical rearrangements used in phylogeny (NNI, SPR,
TBR, ...) cannot be applied directly on duplication trees. We show that restricting the neighborhood defined by the SPR (Subtree
Pruning and Regrafting) rearrangement to valid duplication trees, allows exploring the whole duplication tree space. We use these
restricted rearrangements in a local search method which improves an initial tree via successive rearrangements. This method is
applied to the optimization of parsimony and minimum evolution criteria. We show through simulations that this method improves all
existing programs for both reconstructing the topology of the true tree and recovering its duplication events. We apply this approach to
tandemly repeated human Zinc finger genes and observe that a much better duplication tree is obtained by our method than using any
other program.
Index TermsTandem duplication trees, phylogeny, topological rearrangements, local search, parsimony, minimum evolution, Zinc
finger genes.
1 INTRODUCTION
REPEATED sequences constitute an important fraction ofmost genomes, from the well-studied Escherichia colibacterial genome [1] to the Human genome [2]. For
example, it is estimated that more than 50 percent of the
Human genome consists of repeated sequences [2], [3].
There exist three major types of repeated sequences:
transposon-derived repeats, micro or minisatellites, and
large duplicated sequences, the last often containing one orseveral RNA or protein-coding genes. Micro or minisatel-
lites arise through a mechanism called slipped-strand
mispairing, and are always arranged in tandem: copies of
a same basic unit are linearly ordered on the chromosome.
Large duplicated sequences are also often found in tandem
and, when this is the case, unequal recombination is widely
assumed to be responsible for their formation.
Both the linear order among tandemly repeated se-
quences, and the knowledge of the biological mechanisms
responsible for their generation, suggest a simple model of
evolution by duplication. This model, first described by
Fitch in 1977 [4], introduces tandem duplication trees as
phylogenies constrained by the unequal recombination
mechanism. Although being a completely different biologi-
cal mechanism, slipped-strand mispairing leads to the same
duplication model [5]. A formal recursive definition of this
model is provided in Section 2, but its main features can be
grasped from the examples of Fig. 1. Fig. 1a shows the
duplication history of the 13 Antennapedia-class homeobox
genes from the cognate group [6]. In this history, the
ancestral locus has undergone a series of simple duplica-
tion events where one of the genes has been duplicated into
two adjacent copies. Starting from the unique ancestral
gene, this series of events has produced the extant locus
containing the 13 linearly ordered contemporary genes. It is
easily seen [7] that trees only containing simple duplication
events are equivalent to binary search trees with labeled
leaves. They differ from standard phylogenies in that node
children have left/right orientation. Fig. 1b shows another
example corresponding to the nine variable genes of the
human T cell receptor Gamma (TRGV) locus [8]. In this
history, the most recent event involves a double duplica-
tion where two adjacent genes have been simultaneously
duplicated to produce four adjacent copies. Duplication
trees containing multiple duplication events differ from
binary search trees, but are less general than phylogenies.
The model proposed by Fitch [4] covers both simple and
multiple duplication trees.
Fitchs paper [4] received relatively little attention at the
time of its publication probably due to the lack of available
sequence data. Rediscovered by Benson and Dong [9],
Tang et al. [10], and Elemento et al. [8], tandemly repeated
sequences and their suggested duplication model have
recently received much interest, providing several new
computational biology problems and challenges [11], [12].
The main challenge consists of creating algorithms
incorporating the model constraints to reconstruct the
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 15
. The authors are with Projet Me thodes et Algorithmes pour la Bioinforma-tique, LIRMM (UMR 5506, CNRSUniv. Montpellier 2), 161 rue Ada,34392 Montpellier Cedex 5France. E-mail: [email protected].
Manuscript received 11 Oct. 2004; revised 17 Dec. 2004; accepted 20 Dec.2004; published online 30 Mar. 2005.For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TCBBSI-0170-1004.1545-5963/05/$20.00 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
16/78
duplication history of tandemly repeated sequences.
Indeed, accurate reconstruction of duplication histories
will be useful to elucidate various aspects of genome
evolution. They will provide new insights into the
mechanisms and determinants of gene and protein domain
duplication, often recognized as major generators of
novelty [13]. Several important gene families, such as
immunity-related genes, are arranged in tandem; better
understanding their evolution should provide new insights
into their duplication dynamics and clues about their
functional specialization. Studying the evolution of micro
and minisatellites could resolve unanswered biologicalquestions regarding human migrations or the evolution of
bacterial diseases [14].
Given a set of aligned and ordered sequences (DNA or
proteins), the aim is to find the duplication tree that best
explains these sequences, according to usual criteria in
phylogenetics, e.g., parsimony or minimum evolution. Few
studies have focused on the computational hardness of this
problem, and all of these studies only deal with the
restricted version where simultaneous duplication of multi-
ple adjacent segments is not allowed. In this context, Jaitly
et al. [15] shows that finding the optimal single copy
duplication tree with parsimony is NP-Hard and that this
problem has a PTAS (Polynomial Time Approximation
Scheme). Another closely related PTAS is given by Tang
et al. [10] for the same problem. On the other hand,
Elemento et al. [7] describes a polynomial distance-based
algorithm that reconstructs optimal single copy tandem
duplication trees with minimum evolution.
However, it is commonly believed, as in phylogeny, that
most (especially multiple) duplication tree inference pro-
blems are NP-Hard. This explains the development of
heuristic approaches. Benson and Dong [9] provides various
parsimony-basedheuristic reconstruction algorithms to infer
duplication trees, especially from minisatellites. Elemento
et al. [8] present an enumerative algorithm that computes the
most parsimonious duplication tree; this algorithm (by its
exhaustive approach) is limited to datasets of less than 15
repeats. Several distance-based methods have also been
described.The WINDOW method [10] usesan agglomeration
scheme similar to UPGMA [16] and NJ [17], but the cost
function used to judge potential duplication is based on the
assumptionthat thesequencesfollow a molecular clockmode
of evolution. The DTSCORE method [18] uses the same
schemebut corrects this limitation using a score criterion[19],
like ADDTREE [20]. DTSCORE can be used with sequences
that do not follow the molecular clock, which is, for example,
essential when dealing with gene families containing
pseudogenes that evolve much faster than functional genes.
Finally, GREEDY SEARCH [21] corresponds to a different
approach divided into two steps: First, a phylogeny is
computed with a classical reconstruction method (NJ), then,
with nearest neighbor interchange (NNI) rearrangements, a
duplication tree close to this phylogeny is computed. This
approach is noteworthy since it implements topological
rearrangements which are highly useful in phylogenetics
[22], but it works blindly and does not ensure that goodduplication trees will be found (cf. Section 5.2).
Topological rearrangements have an essential function in
phylogenetic inference, where they are used to improve an
initial phylogeny by subtree movement or exchange.
Rearrangements are very useful for all common criteria
(parsimony, distance, maximum likelihood) and are inte-
grated into all classical programs like PAUP* [23] or
PHYLIP [24]. Furthermore, they are used to define various
distances between phylogenies and are the foundation of
much mathematical work [25]. Unfortunately, they cannot
be directly used here, as shown by a simple example given
16 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 1. (a) Rooted duplication tree describing the evolutionary history of the 13 Antennapedia-class homeobox genes from the cognate group [6].
(b) Rooted duplication tree describing the evolutionary history of the nine variable genes of the human T cell receptor Gamma (TRGV) locus [8]. In
both examples, the contemporary genes are adjacent and linearly ordered along the extant locus.
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
17/78
later. Indeed, when applied to a duplication tree, they do
not guarantee that another valid duplication tree will be
produced.
In this paper, we describe a set of topological rearrange-
ments to stay inside the duplication tree space and explore
the whole space from any of its elements. We then show the
advantages of this approach for duplication tree inference
from sequences. In Section 2, we describe the duplication
model introduced by [4], [8], [10], as well as an algorithm to
recognize duplication trees in linear time. Thanks to this
algorithm, we restrict the neighborhoods defined by
classical phylogeny rearrangements, namely, nearest neigh-
bor interchange (NNI) and subtree pruning and regrafting
(SPR), to valid duplication trees. We demonstrate (Section 3)
that for NNI moves this restricted neighborhood does not
allow the exploration of the whole duplication tree space.
On the other hand, we demonstrate that the restricted
neighborhood of SPR rearrangement allows the whole
space to be explored. In this way, we define a local search
method, applied here to parsimony and minimum evolu-
tion (Section 4). We compare this method to other existing
approaches using simulated and real data sets (Section 5).
We conclude by discussing the positive results obtained by
our method, and indicate directions for further research
(Section 6).
2 MODEL
2.1 Duplication History and Duplication Tree
The tandem duplication model used in this article was first
introduced by Fitch [4] then studied independently by [8],
[10]. It is based on unequal recombination which is assumed
to be the sole evolution mechanism (except point mutations)
acting on sequences. Although it is a completely different
biological mechanism, slipped-strand mispairing leads to
the same duplication model [5], [9].
Let O 1; 2; . . . ; n be the ordered set of sequences
representing the extant locus. Initially containing a single
copy, the locus grew through a series of consecutive
duplications. As shown in Fig. 2a, a duplication history
may contain simple duplication events. When the dupli-
cated fragment contains two, three, or k repeats, we say that
it involves a multiple duplication event. Under this
duplication model, a duplication history is a rooted tree
with n labeled and ordered leaves, in which internal nodes
of degree 3 correspond to duplication events. In a real
duplication history (Fig. 2a), the time intervals between
consecutive duplications are completely known, and the
internal nodes are ordered from top to bottom according to
the moment they occurred in the course of evolution. Anyordered segment set of the same height then represents an
ancestral state of the locus. We call such a set a floor, and
we say that two nodes i; j are adjacent (i 0 j) if there is a
floor where i and j are consecutive and i is on the left of j.
However, in the absence of a molecular clock mode of
evolution (a typical problem), it is impossible to recover the
order between the duplication events of two different
lineages from the sequences. In this case, we are only able to
infer a duplication tree (DT) (Fig. 2b) or a rooted
duplication tree (RDT) (Fig. 2c).
A duplication tree is an unrooted phylogeny with
ordered leaves, whose topology is compatible with at least
one duplication history. Also, internal nodes of duplication
trees are partitioned into events (or blocks following
[10]), each containing one or more (ordered) nodes. We
distinguish simple duplication events that contain a
unique internal node (e.g., b and f in Fig. 2c) and multiple
duplication events which group a series of adjacent and
simultaneous duplications (e.g., c, d, and e in Fig. 2c). Let
E si; si1; . . . ; sk denote an event containing internal
nodes si; si1; . . . ; sk in left to right order. We say that two
consecutive nodes of the same event are adjacent (sj 0 sj1)
just like in histories, as any event belongs to a floor in all of
BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 17
Fig. 2. (a) Duplication history; each segment represents a copy; extant segments are numbered. (b) Duplication tree (DT); the black points show the
possible root locations. (c) Rooted duplication tree (RDT) corresponding to history (a) and root position 1 on (b).
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
18/78
the histories that are compatible with the DT being
considered. The same notation will also be used for leaves
to express the segment order in the extant locus. When the
tree is rooted, every internal node sj is unambiguously
associated to one parent and two child nodes; moreover,
one child ofsj is left and the other one is right, which is
denoted as lj and rj, respectively. In this case, for anyduplication history that is compatible with this tree, child
nodes of an event, si; si1; . . . ; sk are organized as follows:
li 0 li1 0 . . . 0 lk 0 ri 0 ri1 0 . . . 0 rk:
In [8] , [26], [27], i t w as shown t hat r ooting a
duplication tree is different than rooting a phylogeny:
the root of a duplication tree necessarily lies on the tree
path between the most distant repeats on the locus, i.e., 1
and n; moreover, the root is always located above all
multiple duplications, e.g., Fig. 1b shows that there are
only three valid root positions, the root cannot be a direct
ancestor of 12.
2.2 Recursive Definition of Rooted and UnrootedDuplication Trees
A duplication tree is compatible with at least one duplica-
tion history. This suggests a recursive definition, which
progressively reconstructs a possible history, given a
phylogeny T and a leaf ordering O. We define a cherry
l;s;r as a pair of leaves (l and r) separated by a single
node s in T, and we call CT the set of cherries of T. This
recursive definition reverses evolution: It searches for a
visible duplication event, agglomerates this event, and
checks whether the reduced tree is a duplication tree. Incase of rooted trees, we have:
T ; O defines a duplication tree with root if and only if:1. T ; O only contains , or
2. there is in CT a series of cherriesli; si; ri; li1; si1; ri1; . . . ; lk; sk; rk
with k ! i and
li 0 li1 0 . . . 0 lk 0 ri 0 ri1 0 . . . 0 rk in O, suchthat T; O defines a duplication tree with root ,where T is obtained from T by removing
li; li1; . . . ; lk; ri; ri1; . . . ; rk, and O is obtained by
replacing li; li1; . . . ; lk; ri; ri1; . . . ; rk by
si; si1
;. . .
; sk in O.
The definition for unrooted trees is quite similar:
T ; O defines an unrooted duplication tree if and only if:1. T ; O contains 1 segment, or2. same as for rooted trees with T; O now defining an
unrooted duplication tree.
Those definitions provide a recursive algorithm, RADT
(Recognition Algorithm for Duplication Trees), to check
whether any given phylogeny with ordered leaves is a
duplication tree. In case of success, this algorithm can also
be used to reconstruct duplication events: At each step, the
series of internal nodes above denoted as si; si1; . . . ; sk is
a duplication event. When the tree is rooted, lj is the left
child of sj and rj its right child, for every j; i j k. This
algorithm can be implemented in On [26] where n is the
number of leaves. Another linear algorithm is proposed by
Zhang et al. [21] using a top down approach instead of a
bottom-up one, but applies only to rooted duplication trees.
3 TOPOLOGICAL REARRANGEMENTS FORDUPLICATION TREES
This section shows how to explore the DT space using SPR
rearrangements. First, we describe some NNI, SPR, and
TBR rearrangement properties with standard phylogenies.
But, these rearrangements cannot be directly used to
explore the DT space. Indeed, when applied to a duplica-
tion tree, they do not guarantee that another valid
duplication tree will be produced. So, we have decided to
restrict the neighborhood defined by those rearrangements
to duplication trees. If we only used NNI rearrangements,
the neighborhood would be too restricted (as shown by a
simple example) and would not allow the whole DT space
to be explored. On the other hand, we can distinguish two
types of SPR rearrangements which, when applied to a
rooted duplication tree guarantee that another valid
duplication tree will be produced. Thanks to these specific
rearrangements, we demonstrate that restricting the neigh-
borhood of SPR rearrangements allows the whole space of
duplication trees to be explored.
18 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 3. The tree obtained by applying an NNI move to a DT is not always a valid DT: T whose RT is a rooted version; T is obtained by
applying NNI(5,4) around the bold edge; none of the possible root positions of T (a, b, c, and d) leads to a valid RDT, cf. tree (b) which
corresponds to root b in T.
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
19/78
3.1 Topological Rearrangements for Phylogeny
There are many ways of carrying out topological rearrange-
ments on phylogeny [22]. We only describe NNI (Nearest
Neighbor Interchange), SPR (Subtree Pruning Regrafting),
and TBR (Tree Bisection and Reconnection) rearrangements.
The NNI move is a simple rearrangement which
exchanges two subtrees adjacent to the same internal edge(Figs. 3 and 4). There are two possible NNIs for each
internal edge, so 2n 3 neighboring trees for one tree
with n leaves. This rearrangement allows the whole space of
phylogeny to be explored; i.e., there is a succession of NNI
moves making it possible to transform any phylogeny P1into any phylogeny P2 [28].
The SPR move consists of pruning a subtree and
regrafting it, by its root, to an edge of the resulting tree
(Figs. 6 and 7). We note that the neighborhood of a tree
defined by the NNI rearrangements is included in the
neighborhood defined by SPRs. The latter rearrangement
defines a neighborhood of size 2n 32n 7 [25].Finally, TBR generalizes SPR by allowing the pruned
subtree to be reconnected by any of its edges to the resulting
tree. These three rearrangements (NNI, SPR, and TBR) are
reversible, that is, if T is obtained from T by a particular
rearrangement, then T can be obtained from T using the
same type of rearrangement.
3.2 NNI Rearrangements Do Not Stay in DT Space
The classical phylogenetic rearrangements (NNI, SPR,
TBR,...) do not always stay in DT space. So, if we apply
an NNI to a DT (e.g., Fig. 3), the resulting tree is not always
a valid DT. This property is also true for SPR and TBRrearrangements since NNI rearrangements are included in
these two rearrangement classes.
3.3 Restricted NNI Does Not Allow the Whole DTSpace to Be Explored
To restrict the neighborhood defined by NNI rearrange-
ments to duplication trees, each element of the neighbor-
hood is filtered thanks to the recognition algorithm (RADT).
But, this restricted neighborhood does not allow the whole
DT space to be explored. Fig. 4 gives an example of a
duplication tree, T, the neighborhood of which does not
contain any DT. So, its restricted neighborhood is empty,
and there is no succession of restricted NNIs allowing T to
be transformed into any other DT.
3.4 Restricted SPR Allows the Whole DT Space toBe Explored
As before, we restrict (using RADT) the neighborhood
defined by SPR rearrangements to duplication trees. We
name restricted SPR, SPR moves that, starting from a
duplication tree, lead to another duplication tree.
Main Theorem. Let T1 and T2 be any given duplication trees; T1
can be transformed into T2 via a succession of restricted SPRs.
Proof. To demonstrate the Main Theorem, we define two
types of special SPR that ensure staying within the space
of rooted duplication trees (RDT). Given these two types
of SPRs, we demonstrate that it is possible to transform
any rooted duplication tree into a caterpillar, i.e., a
rooted tree in which all internal nodes belong to the tree
path between the leaf1 and the tree root (cf. Fig. 5).
This result demonstrates the theorem. Indeed, let T1and T2 be two RDTs. We can transform T1 and T2 into a
caterpillar by a succession of restricted SPRs. So, it is
possible to transform T1 into T2 by a succession of
restricted SPRs, with (possibly) a caterpillar as inter-
mediate tree. This property holds since the reciprocal
movement of an SPR is an SPR. As the two SPR types
proposed ensure that we stay within the RDTs space, we
have the desired result for rooted duplication trees. And,
this result extends to unrooted duplications trees since
two DTs can be arbitrarily rooted, transformed from one
to the other using restricted SPRs, then unrooted. tu
The first special SPR allows multiple duplication
events to be destroyed. Let E si; si1; . . . ; sk be a
duplication event, ri and lk respectively right child of si
BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 19
Fig. 5. A six-leaf caterpillar.
Fig. 4. The NNI neighborhood of a duplication tree does not always contain duplication trees: T whose RT is a rooted version; T is obtained by
exchanging subtrees 1 and (2 5); none of the possible root positions of T (a, b, and c) leads to a valid duplication tree, cf. tree (b) which corresponds
to root b in T; and the same holds for every neighbor of T being obtained by NNI.
7/30/2019 -IEEE Transactions on Computational Biology and Bioinformatics (January-March). Volume 2, Number 1(2005)
20/78
and left child of sk, and let pi be the father of si. The
DELETE rearrangement consists of pruning the subtree of
root ri and grafting this subtree on the edge sk; lk, while
li is renamed si and the edge li; si is deleted. Fig. 6
demonstrates this rearrangement.
Lemma 1. DELETE preserves the RDT property.
Proof. Let T be the initial tree (Fig. 6a), E si; si1; . . . ; sk
be an event of T, and T be the tree obtained from T by
applying DELETE to E (Fig. 6b). Children of any node sj(i j k) are denoted lj and rj.
By definition, for any duplication history compatible
with T we have
li 0 li1 0 . . . 0 lk 0 ri 0 ri1 0 . . . 0 rk:
Thus, there is a way to partially agglomerate T (using an
RADT-like procedure) such that these nodes becomes
leaves. The same agglomeration can be applied to T as
only ancestors of the ljs and rjs are affected by DELETE.
Now, 1) agglomerate the event E of T, and 2) reduce T
by agglomerating the cherry lk; ri and then agglomer-
ating the