View
4
Download
0
Category
Preview:
Citation preview
Exact Algorithms for the ReversalMedian Problem
by
Adam C. Siepel
B.S., Cornell University, 1994
THESIS
Submitted in Partial Fulfillment of the
Requirements for the Degree of
Master of Science
Computer Science
The University of New Mexico
Albuquerque, New Mexico
December, 2001
c©2001, Adam C. Siepel
iii
Acknowledgments
Many people have supported me, directly or indirectly, in the months of research,programming, and writing that went into this thesis, but several deserve specialrecognition. Of these, my adviser, Bernard Moret, has been the most directly in-volved. His encouragement has been more important than he is likely to realize; ithas many times given me the confidence to face another long weekend of solitarywriting or programming. The other members of my committee, David Sankoff andDavid Bader, also deserve special thanks. David Sankoff more than anyone else hascreated the field to which I hope to contribute with this work. Before I took hisseminar at the University of New Mexico (taught while he was there on sabbatical),I knew neither what a reversal nor a median was, let alone that the two togetherposed a problem. David Bader has always been generous and encouraging with me,and I have benefited enormously from his classes. Without the use of his flawless(and very fast) code for computing inversion distance, my first naive attempts at thereversal median problem would likely have ended in frustration and failure. My wife,Amber, has been the least, yet the most, involved of all in this project. It is she whohas kept the life we share moving along, when I have been too busy to do anythingbut work; who has endured my long periods of silent contemplation of matters that,to her humanist mind, were utterly devoid of interest; and who has still managedto feign believable enthusiasm when discussing a new species of hurdle. I cannotneglect to mention my colleagues at the National Center for Genome Resources,and my supervisor, Bill Beavis, who have been exceptionally accommodating of mysometimes irregular schedule. Finally, as I so rarely think to do, I acknowledge myparents, Timothy and Virginia Siepel; they are responsible for instilling in me theinquisitive nature and appetite for learning that are required to sustain even thehumblest attempt at research.
iv
Exact Algorithms for the ReversalMedian Problem
by
Adam C. Siepel
ABSTRACT OF THESIS
Submitted in Partial Fulfillment of the
Requirements for the Degree of
Master of Science
Computer Science
The University of New Mexico
Albuquerque, New Mexico
December, 2001
Exact Algorithms for the ReversalMedian Problem
by
Adam C. Siepel
B.S., Cornell University, 1994
M.S., Computer Science, University of New Mexico, 2001
Abstract
While most work in computational molecular biology since its inception in the
1970s has focused on problems involving DNA and amino acid sequences, there has
been growing interest during the past decade in the use of alternative models of
molecular evolution that are based on the order and content of genes in complete
genomes. Phylogeny reconstruction using gene order is of particular interest, and
promises to offer improved results in cases where sequence-based methods perform
poorly, such as when taxa are distant or sequences mutate rapidly. Evolutionary
distance between genomes can be estimated as the minimum number of inversions
required to transform one into the other, a measure that can be computed efficiently
as the reversal distance between signed permutations. Much progress has been made
in recent years on this problem and the related one of finding a minimum sequence
of sorting reversals, but the application of these techniques to phylogeny so far has
been limited to distance-based reconstruction methods. An alternative method of
reconstruction, presented by Sankoff and Blanchette, uses a Steinerization process to
vi
establish an optimal tree in which internal nodes are labeled with the configurations
of ancestral genomes. This method relies on repeatedly finding “medians” of three
signed permutations. Medians have previously been computed using a heuristic
called “breakpoint distance” rather than the more precise reversal distance, largely
because no efficient solution existed to the reversal median problem. In this thesis, we
derive in three stages an efficient algorithm to find a reversal median of three signed
permutations. In the first stage, we develop a simple branch-and-bound algorithm
that uses only the metric properties of the distance measure; in the second stage, we
derive a solution to the problem of finding all sorting reversals of one permutation
with respect to another; and in the third stage, we synthesize from the results of the
first two a dramatically improved algorithm for the median problem.
vii
Contents
List of Figures xi
1 Introduction 1
1.1 Computing Genomic Distance . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Phylogeny Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Reconstruction Using Medians of Three . . . . . . . . . . . . . . . . . 10
1.4 The Reversal Median Problem . . . . . . . . . . . . . . . . . . . . . . 12
2 A Simple Algorithm for Finding an Exact Median 14
2.1 Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Experimental Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.1 Performance of Bounds . . . . . . . . . . . . . . . . . . . . . . 24
viii
Contents
2.5.2 Running Time and Parallel Speedup . . . . . . . . . . . . . . 26
2.5.3 Reversal Medians vs. Breakpoint Medians . . . . . . . . . . . 28
2.5.4 Preponderance of Perfect Medians . . . . . . . . . . . . . . . . 31
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3 Finding All Sorting Reversals 33
3.1 Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Sorting Reversals in the Absence of Fortresses . . . . . . . . . . . . . 41
3.3 Accommodating Fortresses . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5 Detecting New Unoriented Components . . . . . . . . . . . . . . . . . 68
3.5.1 A Simple Linear-Time Solution . . . . . . . . . . . . . . . . . 69
3.5.2 The Effect of a Reversal on the Overlap Graph . . . . . . . . 70
3.5.3 A Bitwise Algorithm . . . . . . . . . . . . . . . . . . . . . . . 73
3.6 Experimental Methods and Results . . . . . . . . . . . . . . . . . . . 79
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4 An Improved Algorithm for Finding an Exact Median 86
4.1 Enumerating Neutral Reversals . . . . . . . . . . . . . . . . . . . . . 87
4.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3 Experimental Methods and Results . . . . . . . . . . . . . . . . . . . 96
ix
Contents
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
References 102
x
List of Figures
1.1 Representation of distance problem with signed permutations . . . . 6
1.2 Illustration of median-based phylogeny reconstruction . . . . . . . . 11
2.1 Illustration of reversal graph . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Illustration of global lower bound for median . . . . . . . . . . . . . 17
2.3 Illustration of bounds for intermediate vertex . . . . . . . . . . . . . 18
2.4 Number of vertices visited while finding a median . . . . . . . . . . 25
2.5 Statistical median of V for constant r . . . . . . . . . . . . . . . . . 26
2.6 Statistical median of V for constant n . . . . . . . . . . . . . . . . . 27
2.7 Sequential and parallel running times . . . . . . . . . . . . . . . . . 28
2.8 Comparison of various medians . . . . . . . . . . . . . . . . . . . . . 29
2.9 Distribution of number of medians . . . . . . . . . . . . . . . . . . . 30
2.10 Percentage of medians that are perfect . . . . . . . . . . . . . . . . . 31
3.1 Illustration of idea for sorting median . . . . . . . . . . . . . . . . . 34
3.2 Example breakpoint graph and overlap graph . . . . . . . . . . . . . 37
xi
List of Figures
3.3 Ways to cut a hurdle and ways to merge hurdles . . . . . . . . . . . 41
3.4 Example of a double superhurdle . . . . . . . . . . . . . . . . . . . . 45
3.5 Example of a sorting reversal that merges two benign components . 47
3.6 Example hurdle graph . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.7 Example showing effect of a reversal on overlap of edges . . . . . . . 71
3.8 The effect of a reversal on the breakpoint and overlap graphs . . . . 72
3.9 The effect of an arbitrary reversal on the bit matrix . . . . . . . . . 78
3.10 Running times of programs find-all-bf and find-all-sr . . . . . 81
3.11 Speedup of find-all-sr . . . . . . . . . . . . . . . . . . . . . . . . 82
3.12 Average number of sorting reversals . . . . . . . . . . . . . . . . . . 83
4.1 Comparison of metric and sorting median algorithms . . . . . . . . . 100
4.2 Performance of sorting algorithm (detail) . . . . . . . . . . . . . . . 101
4.3 Speedup of sorting algorithm with respect to metric algorithm . . . 101
xii
Chapter 1
Introduction
Recent years have seen numerous advancements in the mathematical and computa-
tional study of genome rearrangements—those processes that change the way seg-
ments of chromosomal DNA are organized into complete genomes (see summaries in
[34, 24, 33, 30]). Because genome rearrangements become evident through a compar-
ative analysis of contemporary genomes, this sub-field of computational molecular
biology has been called “comparative genomics” [33]1. Comparative genomics uses
differences in the gene order and gene content of the genomes of related organisms
as a source of information about mechanisms of molecular change during evolution-
ary history and about the evolutionary relationships among organisms. Applications
include constructing comparative genomic maps [16], reconstructing phylogenetic re-
lationships among organisms [32, 27, 28, 22], estimating the lengths or boundaries of
homologous chromosomal segments [23, 31, 16, 36], and estimating the relative fre-
quencies of mechanisms of genome rearrangement [21]. As the emphasis of genomic
science shifts from individual genes to whole genomes, the methods of comparative
genomics are expected only to become of greater interest.
1Note that the same term has been used in a broader sense, to suggest any sort ofcomparison of genomic data from different organisms.
1
Chapter 1. Introduction
Computer scientists and mathematicians working in computational molecular bi-
ology since its inception in the early 1970s have focused largely on the analysis
of DNA and amino acid sequences. Problems that have received the most atten-
tion include similarity searching (searching a database of sequences for the closest
matches to a query sequence), multiple alignment (aligning homologous regions of a
set of sequences, and placing “gaps” to indicate likely insertions or deletions), con-
tig assembly (building a contiguous sequence from a set of overlapping fragments),
gene prediction (finding segments of genomic DNA believed to code for proteins),
and phylogeny reconstruction (hypothesizing the evolutionary relationships among
organisms from comparisons of their sequences). In general, solutions to these prob-
lems are based on models of evolutionary change. For example, to compute a useful
“distance” between two sequences (as one often does in similarity searching, multi-
ple sequence alignment, and phylogeny reconstruction), one must assume a model of
sequence mutation over evolutionary time. The mutation of sequences, however, is
only one aspect of molecular evolution; at a level above the genes, chromosomes split
and combine, whole genes are duplicated and lost, and genes are rearranged in order.
Comparative genomics, by characterizing genomes in terms of gene content and or-
der, provides an alternative framework for modeling evolution—one that allows for
phenomena, such as gene loss, gene duplication, and genome rearrangement, that are
not easily accommodated with sequence-based approaches, and one that provides the
means to measure evolutionary change at a much greater time scale (because changes
to gene order and content occur much less frequently than mutations at the sequence
level).
Comparative genomics is being used both to address new problems and to enable
new approaches to old problems. Perhaps the best example of the latter use is
phylogeny reconstruction, which can be performed according to comparisons of gene
content and gene order instead of according to sequence comparison. Performing
phylogeny reconstruction at the genome, rather than the sequence, level promises to
2
Chapter 1. Introduction
be particularly valuable in cases where taxa are extremely distant or where sequences
are evolving rapidly [24] (in these cases, sequence-based methods perform poorly
because of the problem of “saturation” in edit distances).
The subject of this thesis, the “reversal median problem”, is related to a method
used to perform phylogeny reconstruction according to gene order. Below we will
briefly outline the background of the problem.
1.1 Computing Genomic Distance
One of the most fundamental computational problems in comparative genomics,
which must be solved before many higher level problems can be attacked, is to com-
pute the “distance” between two genomes. The idea is to come up with a measure,
based on gene order and gene content, that reflects as closely as possible the evo-
lutionary distance of the given organisms. The challenge is to find a measurement
that is biologically meaningful yet efficiently computable.
To be realistic, a measurement should reflect several known mechanisms of ge-
nomic rearrangement. In the case of single-chromosome genomes (such as those of
prokaryotes, chloroplasts, and mitchondria), these mechanisms include the following:
• inversion: A section of a chromosome is excised, reversed in orientation, and
re-inserted.
• transposition: A section of a chromosome is excised and inserted at new
position in the chromosome, without changing orientation.
• inverted transposition: Exactly like transposition, except that the trans-
posed segment changes orientation.
3
Chapter 1. Introduction
• gene duplication: A section of a chromosome is duplicated, so that multiple
copies exist of every gene in that section.
• gene loss: A section of a chromosome is excised and lost, so that all of its
genes are effectively deleted from the organism.
If a genome has multiple chromosomes, then transposition and inverted transpo-
sition can occur between chromosomes. In this case, the following are also possible:
• translocation: The end of one chromosome is broken off and attached to the
end of another chromosome.
• fission: A chromosome splits and becomes two chromosomes.
• fusion: Two chromosomes combine and become one chromosome.
In early attempts to compute genomic distance (e.g., [32]), investigators found
that the problem was quite difficult, even when one considered only few mechanisms
(in [32], transpositions, insertions, deletions, and inversions were considered). It be-
came evident that the genomic distance problem was algorithmically much harder,
and likely to be computationally more intensive, than computing edit distances be-
tween sequences. As a result, Sankoff, et al. began to use a heuristic for genomic
distance called “breakpoint distance”, which applies when two genomes each con-
tain a single instance of each of n genes. The breakpoint distance is the number
of pairs of genes that are adjacent in one genome but not in the other (the mea-
sure is symmetric). Breakpoint distance is rapidly computable and appears to be
a reasonable estimator of genomic distance, but it is not directly correlated to any
mechanism of rearrangement (all mechanisms can create or remove breakpoints, but
no rearrangement-based measure of distance can be determined precisely from break-
4
Chapter 1. Introduction
point distance)2.
In the early 1990s, Pavel Pevzner, Vineet Bafna, and Sridhar Hannenhalli chose
instead to distill the distance problem to what may be its simplest useful formulation
directly based on known mechanisms of rearrangement: to find the minimum number
of inversions necessary to transform one genome into another. They worked also with
two genomes each containing a single instance of n genes, but in their formulation of
the distance problem, one configuration of genes must be transformed to the other
using only the mechanism of inversion. This problem can be restated mathematically
as that of finding the minimum number of reversals required to “sort” a permutation
of size n—i.e., to convert it to the identity permutation (note that two arbitrary
permutations can always be mapped such that one of them is represented as the
identity permutation). The problem of “sorting by reversals”3 can be seen as a
generalization of a problem known as “sorting by prefix reversals”, which incidentally
was studied (but not solved) in the late 1970s by a Harvard undergraduate named
W.H. Gates, who later discovered an interest in operating systems [15]. Circular
and non-circular versions of the distance problem can be defined, corresponding to
circular and non-circular genomes (it turns out to be easy to transform one version
into the other). Modeling rearrangement distance with reversal distance is supported
by reports in the biological literature that inversions are the primary mechanism
of genome rearrangement for many genomes [21, 5]. The idea of using a distance
measure based only on inversions was foreshadowed by observations as early as the
1930s that differences in gene order could be explained by sequences of chromosomal
inversions [11]. In the 1980s, Watterson, et al. explicitly proposed using inversion
distance as a measure of evolutionary distance that could be useful for phylogeny
2Note that some have considered this property to be an advantage, because little isknown about the relative likelihoods of alternative mechanisms of rearrangement
3Note that the problems of reversal distance and of sorting by reversals are subtlydifferent; it turns out one can compute reversal distance without actually finding a sequenceof sorting reversals.
5
Chapter 1. Introduction
DE
A
B
C
C
D
A
E
B
π1 = (+1,+2,+3,+4,+5)π2 = (−2,−1,−5,−4,−3)
Figure 1.1: If the orientation of each gene in each genome is known, then the problemof finding the inversion distance between genomes is equivalent to the problem offinding the reversal distance between signed permutations.
reconstruction, noting that inversion distance was a true metric [37].
If only the orders of genes are known, then the problem of finding inversion dis-
tance is equivalent to finding the reversal distance of unsigned permutations. In
this case, for example, the distance between the permutations (1, 3, 2) and (1, 2, 3)
is a single reversal. If the orientations of genes are also known, however, then the
problem can be modeled with signed permutations (orientation is taken to be rep-
resented by direction of transcription). The distance between the signed permuta-
tions (+1,+3,+2) and (+1,+2,+3) is three [e.g., (+1,+3,+2) → (+1,−2,−3) →
(+1,+2,−3) → (+1,+2,+3)]. With unsigned permutations, both reversal distance
and sorting by reversals are NP-Hard [6], but Hannenhalli and Pevzner showed that
the signed versions of the problems, surprisingly, can be solved in polynomial time
[18]. Their solution is the capstone of a baroque edifice of combinatorial theory that
has become known as the Hannenhalli-Pevzner cycle-decomposition theory. The
Hannenhalli-Pevzner theory describes the relationship between two signed permuta-
6
Chapter 1. Introduction
tions with a particular kind of diagram (a “breakpoint graph”), captures relation-
ships between “cycles” in that diagram in an interleaving graph, classifies certain
connected components in the interleaving graph (“oriented” and “unoriented” com-
ponents) and relationships among connected components (“hurdles”, “superhurdles”,
and “fortresses”), and establishes numerous useful properties about these diagrams
and graphs.
It is possible that no class of problems in computational biology has exerted a
stronger pull on theoretically-inclined computer scientists than those involving sort-
ing by reversals of signed permutations. During the past six years, several improved
algorithms have been produced for both the sorting problem and the distance prob-
lem [4, 20, 1, 2]. The fastest standing algorithms are those of Kaplan, et al. [20]
for the sorting problem (O(n2) time, where n is the permutation size) and of Bader,
et al. [1] for the distance problem (O(n) time). Recently, Bergeron has shown an
alternative method for sorting by reversals [2], also in O(n2) time, which is in many
ways simpler than that of Kaplan, et al.
In addition, some progress has been made on computing distances that take
into account other mechanisms of rearrangement. Hannenhalli and Pevzner pub-
lished an algorithm that finds the distance between multiple-chromosome genomes
in terms of equally-weighted translocations, fissions, fusions, and inversions, essen-
tially by reducing this problem to that of finding the inversion distance between two
single-chromosome genomes [19]. In addition, Sankoff has developed a method that
accommodates multiple members of gene families [26], and thus allows for duplica-
tion. No fast algorithm, however, has yet been published to compute exact distance
in terms of transpositions.
7
Chapter 1. Introduction
1.2 Phylogeny Reconstruction
Most methods for phylogeny reconstruction were originally developed for use with
sequence data (including sequences of morphological “characters”, as well as DNA
and amino acid sequences). These methods generally begin with a multiple align-
ment of N sequences (representing N taxa) and produce a binary tree describing
the evolutionary relationships among the sequences—that is, a branching pattern
through evolution that is likely to have allowed an ancestral sequence to give rise
to all of the observed, contemporary sequences (the observed sequences appear as
leaves of the tree). Note that, strictly speaking, the directed graph of evolutionary
history is not a tree; to the contrary, we know of many processes, such as horizontal
transfer and hybridization, that can allow a given leaf (an observed species) to have
more than one path to the root (the hypothetical common ancestor). Nevertheless,
a tree is believed to provide a reasonable approximation of the evolutionary history
of most sets of species. A reconstructed tree is important both in its topology and
in its branch lengths.
In general, algorithms for phylogeny reconstruction return one or more trees
that are optimal according to some appropriate cost function. The algorithms differ
primarily in the nature of their cost functions. The most widely-used methods can
be classified into three categories:
• Distance-based methods build trees that best fit the pairwise distances of
all sequences (usually in the sense of minimizing the sum of the lengths of all
tree edges). Distance between sequences is generally computed as a kind of
“edit distance” (e.g., a minimum cost of insertions/deletions and substitutions
required to transform one sequence into the other). The internal nodes of
distance-based trees have no biological meaning associated with them; they
simply represent abstract points in a high-dimensional space. Trees can be
8
Chapter 1. Introduction
computed much faster than with other methods (generally in polynomial time).
Distance-based methods are currently dominated by the “neighbor-joining”
algorithm of Saitou and Nei [25].
• Maximum parsimony methods build trees by attempting to find the least
costly pathways connecting the “character states” represented at nodes in the
tree (internal and external). These methods are preferable to distance-based
methods for some applications because they associate actual sequences with
internal tree nodes (so-called “hypothetical ancestors”). Maximum parsimony
was pioneered by Eck and Dayhoff [12] and adapted for DNA sequences by
Fitch [14].
• Maximum likelihood methods strive to find the most likely of all possible
trees according to a well-defined probabilistic model. Like maximum parsimony
methods, these methods label (or can be made to label) internal nodes with
ancestral sequences. They tend to be highly computational, however, and as
yet are only feasible for relatively small sets of sequences. Maximum likelihood
methods were first applied to phylogeny reconstruction by Cavalli-Sforza and
Edwards [9], and were first adapted for use with DNA and amino acid sequences
by Felsenstein [13].
Analogs based on gene order4, rather than sequences, have been developed for all
three of these classes. Distance-based methods can be used without alteration for
gene-order data, since they separate the computation of a distance matrix from the
construction of a tree; one only needs to compute pairwise distances using a measure
based on gene order. Maximum parsimony methods do not exist by the same name,
4Most phylogeny reconstruction at the genome, rather than the sequence, level has fo-cused on comparison of gene order, rather than of gene content. The study of Sankoff andEl-Mabrouk [29] is an exception, and in its synthesis of methods for phylogeny reconstruc-tion with Sankoff’s approach to handling multiple gene copies [26], suggests a promisingavenue for further exploration.
9
Chapter 1. Introduction
but we will discuss an approach below—called “median-based reconstruction”—that
is effectively analogous to them. At least one maximum likelihood method has been
proposed [10], but it requires enormous computation time, and appears to have to
resort to fairly drastic pruning strategies to solve problems of reasonable size. As
a result, tree-building methods for gene-order data are effectively limited, for the
present, to distance- and median-based reconstruction.
1.3 Reconstruction Using Medians of Three
The median-based method for phylogeny reconstruction was first proposed by Sank-
off and Blanchette [27]. Given a set of N signed permutations, each of size n, they
sought to construct an optimal tree such that each node was labeled with a signed
permutation, and the input permutations appeared as leaves of the tree. In this
way, as with maximum parsimony, they would ensure that internal nodes retained
biological meaning, and that edges between nodes represented transitions between
actually achievable states of the genome (note, in contrast, that with a distance-
based method, there is no guarantee of the existence of an internal node having
distances to its neighbors as hypothesized). The problem, then, was to find Steiner
points in the space of genome rearrangements (the algorithm has been described
as the “Steinerization algorithm” [30]). Their idea was to build a global solution
by aggregating local solutions for the simplest possible version of the problem: to
find a Steiner point of three genomes—that is, a permutation π such that sum is
minimized of the distances between π and each of the starting genomes. They called
such a point a “median of three”, or simply a “median”. After an initialization
step (which can be executed in various ways), their algorithm iterates over a tree,
repeatedly resetting the permutations of internal nodes to medians of their three
neighbors (the tree is always binary). It continues until convergence occurs. The
10
Chapter 1. Introduction
set to medianof neighbors
π2 π3 π4 π5 πN−1 πNπ1
Figure 1.2: The median-based reconstruction algorithm of Sankoff and Blanchetteiterates over a tree, resetting the permutations at internal nodes to the medians oftheir three neighbors, until convergence occurs.
algorithm guarantees only a locally optimal solution, but with multiple executions
and with various initialization configurations, appears effectively to approximate a
global optimum.
Sankoff and Blanchette computed medians using breakpoint distance rather than
inversion distance. They discovered a straightforward reduction of the breakpoint
median problem to a special case of the Traveling Salesman Problem, and were able to
compute medians relatively efficiently. Finding a median based on inversion distance,
in contrast, was believed to be too prohibitive to be performed as frequently as
required by the median-based reconstruction algorithm. (At the time, no algorithm
had been reported to find inversion medians, but in at least one study, they had been
obtained for a particular data set using a bounded exhaustive search [17]).
Breakpoint medians have drawbacks, however. While breakpoint distance is use-
ful as a heuristic measure, because breakpoints do not correlate directly to any actual
mechanism of rearrangement, a breakpoint median has no straightforward biological
interpretation (an inversion median, on the other hand, represents precisely a most-
parsimonious ancestor of the genomes in question under an inversions-only model
11
Chapter 1. Introduction
of evolution). In addition, breakpoint medians tend to be far from unique—that
is, a large number of permutations often score equally well as medians—so that the
median-based reconstruction algorithm must choose arbitrarily among many candi-
dates (some of which are likely better than others at advancing the search toward
a global minimum). We will also show in this thesis that breakpoint medians score
poorly compared to inversion medians when evaluated in terms of inversion distance.
1.4 The Reversal Median Problem
For these reasons, we seek a solution to the median problem in terms of inversion
distance. This problem is known alternatively as “multiple sorting by reversals”,
the “inversion median problem”, and the “reversal median problem”. We will re-
fer to it using the last of these names, because our goal is to enable the median-
based algorithm for phylogeny reconstruction, but our solution can be developed in
general mathematical terms. The reversal median problem has been shown to be
NP-Hard [7], and until the present study began, no efficient algorithm for it had
been reported (during the course of this study, two algorithms were reported: one by
Caprara [8], and another based on preliminary work of ours [35]). Note that, while
a solution to the reversal median problem directly addresses only the rearrangement
mechanism of inversion, it opens the door to median-based phylogeny reconstruc-
tion for equally-weighted inversions, translocations, fissions, and fusions, through
the methods of [19].
In this thesis, we develop an algorithm for the reversal median problem in three
stages. First (Chapter 2), we develop a simple branch-and-bound algorithm based
on distance computations from intermediate permutations; this algorithm does not
use the Hannenhalli-Pevzner cycle-decomposition theory, and depends only on the
availability of a rapidly computable distance metric (thus, it could be used for other
12
Chapter 1. Introduction
measures of distance, if fast algorithms were available to compute them). Next
(Chapter 3), we develop a solution, using Hannenhalli-Pevzner theory, to the pre-
viously unsolved problem of finding all sorting reversals of one permutation with
respect to another, with the goal of navigating more efficiently the space that the
algorithm of Chapter 2 must explore. Finally (Chapter 4), we synthesize the work
of the first two chapters, and develop a dramatically more efficient solution to the
median problem.
13
Chapter 2
A Simple Algorithm for Finding
an Exact Median
You will be safest in the middle.
–Ovid
In this chapter, we derive a simple branch-and-bound algorithm to find an exact
reversal median of three signed permutations. This algorithm does not depend on
properties specific to reversals, but can be used with any rapidly computable metric
(in applying it to the case of reversals, therefore, we depend heavily on an efficient
routine to compute reversal distance [1]). We also provide results from an exper-
imental study showing (1) that our algorithm performs surprisingly efficiently for
a range of parameters, but that it has a greater tendency to “bog down” as the
distances become large between input permutations; (2) that reversal medians score
significantly better than breakpoint medians, and tend to be far more unique; and
(3) that an unexpectedly large number of reversal medians are “perfect medians”
(having a score equal to the global lower bound). The simple algorithm presented
here is the basis of a more complicated algorithm developed in subsequent chapters.
14
Chapter 2. A Simple Algorithm for Finding an Exact Median
2.1 Notation and Definitions
We consider the case where all genomes have identical sets of n genes and inver-
sion is the single mechanism of rearrangement. We represent each genome Gi as
a permutation πi of size n, and we let all pairs of genomes Gi = (gi,1 . . . gi,n) and
Gj = (gj,1 . . . gj,n), in a set of genomes G, be represented by πi = (πi,1 . . . πi,n) and
πj = (πj,1 . . . πj,n) such that πi,k = πj,l iff gi,k = gj,l, and πi,k = −1 · πj,l iff gi,k is the
reverse complement of gj,l.
We will model inversions to genomes with reversals to permutations. A reversal
acting on permutation π from i to j, for i ≤ j, is that operation which transforms π
into φ = (π1, π2, . . . , πi−1,−πj,−πj−1, . . . ,−πi, πj+1, . . . , πn). The minimal number
of reversals required to change one permutation πi into another permutation πj is
the reversal distance, which we denote by d(πi, πj) (sometimes abbreviated as di,j).
Let the reversal median M of a set of N permutations Π = π1, π2, . . . , πN be
the signed permutation that minimizes the sum S(M,Π) =∑N
i=1 d(M,πi)). Let this
sum S(M,Π) = S(Π) be called the median score of M with respect to Π.
For a given number of genes n, we can construct an undirected graph Gn = (V,E)
such that each vertex in V corresponds to a signed permutation of size n and two
vertices are connected by an edge if and only if one of the corresponding permutations
can be obtained from the other through a single reversal; formally, E = vi, vj |
vi, vj ∈ V and d(πi, πj) = 1. We will call Gn the reversal graph of size n. In this
graph, the length of the shortest path between any two vertices, vi and vj, is the
same as the reversal distance between the corresponding permutations, πi and πj.
Furthermore, finding the median of a set of permutations Π is equivalent to finding
the minimum unweighted Steiner tree of the corresponding vertices in Gn. Note that
Gn is extremely large (|V | = n! · 2n), so this representation does not immediately
suggest a feasible graph-search algorithm, even for small n.
15
Chapter 2. A Simple Algorithm for Finding an Exact Median
(−1,+2,+3) (+1,−2,+3) (−1,−2,+3) (−1,+2,−3) (+1,−2,−3) (−1,−2,−3)(+1,+2,+3) (+1,+2,−3)
(−2,+1,+3) (+2,−1,+3) (−2,−1,+3) (−2,+1,−3) (+2,−1,−3) (−2,−1,−3)(+2,+1,+3) (+2,+1,−3)
(−2,+3,+1) (+2,−3,+1) (−2,−3,+1) (−2,+3,−1) (+2,−3,−1) (−2,−3,−1)(+2,+3,+1) (+2,+3,−1)
(−3,+1,+2) (+3,−1,+2) (−3,−1,+2) (−3,+1,−2) (+3,−1,−2) (−3,−1,−2)(+3,+1,+2) (+3,+1,−2)
(−1,+3,+2) (+1,−3,+2) (−1,−3,+2) (−1,+3,−2) (+1,−3,−2) (−1,−3,−2)(+1,+3,+2) (+1,+3,−2)
(−3,+2,+1) (+3,−2,+1) (−3,−2,+1) (−3,+2,−1) (+3,−2,−1) (−3,−2,−1)(+3,+2,+1) (+3,+2,−1)
Figure 2.1: The reversal graph for n = 3. For clarity of presentation, edges havebeen drawn only for the first column of vertices.
Definition 2.1 A shortest path between two permutations of size n, π1 and π2,
is a connected subgraph of the reversal graph Gn containing only the vertices v1 and
v2 corresponding to π1 and π2, and the vertices and edges on a single shortest path
between v1 and v2.
Definition 2.2 A median path of a set of permutations Π each of size n is a con-
nected subgraph in the reversal graph of Gn containing only the vertices corresponding
to permutations in Π, the vertex corresponding to a median M of Π, and a shortest
path between M and each π ∈ Π.
Definition 2.3 A trivial median of a set of permutations Π is a median M such
that M ∈ Π.
Definition 2.4 A trivial median path of a set of permutations Π is a median
path that includes only the elements of Π and shortest paths between elements of Π.
16
Chapter 2. A Simple Algorithm for Finding an Exact Median
v3
vMd3,M
v1
d1,3
v2
d2,M
d1,2 d1,M
d2,3
Figure 2.2: Let vertices v1, v2, and v3 correspond to permutations π1, π2, and π3, andlet vertex vM correspond to a median M . The lowest possible median score occurswhen d1,2 = d1,M + d2,M ; d1,3 = d1,M + d3,M ; and d2,3 = d2,M + d3,M .
2.2 Bounds
Lemma 2.1 The median score S(Π) of a set of equally sized permutations Π = π1,
π2, π3, separated by pairwise distances d1,2, d1,3, and d2,3, obeys these bounds:⌈d1,2 + d1,3 + d2,3
2
⌉≤ S(Π) ≤ min
(d1,2 + d2,3), (d1,2 + d1,3), (d2,3 + d1,3)
Proof: The upper bound follows directly from the possibility of a trivial median, and
the lower bound from properties of metric spaces (a median of lower score would
necessarily violate the triangle inequality with respect to two of π1, π2, and π3; see
Figure 2.2).
Definition 2.5 A perfect median of a set of equally sized permutations Π = π1,
π2, π3, separated by pairwise distances d1,2, d1,3, and d2,3, is a median having median
score S(Π) = dd1,2+d1,3+d2,3
2e.
The following Lemma will be useful in proving Theorem 2.1.
Lemma 2.2 If three permutations π1, π2, and π3 have a median M that is part of a
trivial median path, then M must be a trivial median.
17
Chapter 2. A Simple Algorithm for Finding an Exact Median
v1
d2,3v3
vφ
v2
d3,φd2,φvM
d1,φ
Figure 2.3: A median path including vφ can be constructed using a shortest pathfrom v1 to vφ and any median path of vφ, v2, and v3.
Proof: Assume to the contrary that π1, π2, and π3 have a trivial median path and
have a median M that is not trivial. By Definition 2.4, M must be on a shortest
path between two of π1, π2, and π3. If M is not trivial, it must be distance d > 0
from the closest of π1, π2, and π3. But then its median score must be greater by d
than the score of a trivial median, so M cannot be a median.
Theorem 2.1 Let π1, π2, and π3 be permutations such that π2 and π3 are separated
by distance d2,3, and let φ be another permutation separated from π1, π2, and π3 by
distances d1,φ, d2,φ, and d3,φ, respectively (see Figure 2.3). Suppose that φ is on a
median path PM of π1, π2, and π3 such that φ is on a shortest path between π1 and
a median M . Then the score S(M) of M obeys these bounds:
d1,φ+
⌈d2,φ + d3,φ + d2,3
2
⌉≤ S(M) ≤ d1,φ+min
(d2,φ+d2,3), (d2,φ+d3,φ), (d2,3+d3,φ)
Proof: Let v1, v2, and v3 be vertices corresponding to π1, π2, and π3, in the reversal
graph of the appropriate size. In addition, let there be a vertex vφ corresponding to
φ, as illustrated in Figure 2.3. We claim that a median path PM including vφ and M ,
such that vφ is on a shortest path from v1 to M , can be constructed by combining a
shortest path between v1 and vφ and a median path of vφ, v2, and v3. Assume to the
contrary that there exists a shorter median path Pshort, which also includes vφ and
M , but does not include the shortest path between v1 and vφ or does not include
18
Chapter 2. A Simple Algorithm for Finding an Exact Median
a median path of vφ, v2, and v3. Pshort has to include v1 via a vertex other than vφ
and consequently other than M (because vφ is on a shortest path between v1 and
M). By Definition 2.2, Pshort must consist only of v1, v2, v3,M , and vertices between
them (including vφ), so v1 must be connected to Pshort via v2 or v3. Consequently,
M must be on a shortest path between v2 and v3; otherwise including M in Pshort
would result in a score greater than that of a trivial median. Therefore, M is part
of a trivial median path, which means that by Lemma 2.2, M is a trivial median. In
particular, M must be the vertex vi ∈ v2, v3 to which v1 is connected. Furthermore,
our assumptions about φ require that vφ be on the shortest path between v1 and vi.
Then Pshort includes both the shortest path between v1 and vφ and the median path
of vφ, v2, and v3, and we obtain the desired contradiction.
Because PM can be constructed by combining a shortest path between v1 and vφ,
and a median path of vφ, v2, and v3, its score is equivalent to the sum of the distance
between v1 and vφ (d1,φ), and the score of the median of vφ, v2, and v3. By applying
Lemma 2.1 to the latter, we obtain the desired bound.
2.3 The Algorithm
Algorithm 2.1 is a branch-and-bound search for an optimal reversal median that
uses Theorem 2.1 to prune regions of the reversal graph from its search tree. The
algorithm also uses Theorem 2.1 to prioritize among search branches.
Prioritization is managed using a priority stack—which always returns an item of
highest priority, but returns items of equal priority in last-in-first-out order. Because
the range of possible priorities is small, we use a fixed array of priority values, each
pointing to a stack, and so can execute push and pop operations in fast constant
time. Using stacks rather than the more conventional queues in this application is not
required for correctness, but, by inducing depth-first searching among alternatives
of equal cost, rapidly produces a good upper bound for the search.
19
Chapter 2. A Simple Algorithm for Finding an Exact Median
Input: Three signed permutations of size n: π1, π2, and π3. Assume a functiondistance(πi, πj) that returns the reversal distance between πi and πj inlinear time.
Output: An optimal reversal median M .
begind1,2 ← distance(π1, π2);d1,3 ← distance(π1, π3);d2,3 ← distance(π2, π3);
1 Mmin ← dd1,2+d1,3+d2,3
2e;
2 Mmax ← min(d1,2 + d2,3), (d1,2 + d1,3), (d2,3 + d1,3);Initialize priority stack s for range Mmin to Mmax;(ψorig, ψ1, ψ2)← (πi, πj, πk) such that πi, πj, πk = π1, π2, π3 anddi,j + di,k = Mmax;
3 create vertex v with vlabel = ψorig, vdist = 0, vbest = Mmin, vworst = Mmax;4 push(s, v);5 M ← ψorig;
dsep ← dψ1,ψ2 ;stop← false ;
6 while s is not empty and stop = false dopop(s, v);
7 if vbest ≥Mmax then stop← true ;else
8 foreach w | w is an unmarked neighbor of v dowdist ← distance(wlabel, ψorig);
9 if wdist ≤ vdist then continue;mark w;dψ1 ← distance(wlabel, ψ1);dψ2 ← distance(wlabel, ψ2);
10 wbest ← wdist + ddψ1+dψ2
+dsep
2e;
11 wworst ← wdist + min(dψ1 + dsep), (dψ1 + dψ2), (dsep + dψ2);12 if wworst = Mmin then M ← wlabel; stop ← true ;
elseif wbest < Mmax then push(s, w, wbest);
13 if wworst < Mmax thenM ← wlabel; Mmax ← wworst;
endend
endend
endend
Algorithm 2.1: find reversal median
20
Chapter 2. A Simple Algorithm for Finding an Exact Median
The algorithm begins by establishing upper and lower bounds for the solution
using Lemma 2.1 (steps 1 and 2) and priming the priority stack with a best-scoring
vertex (steps 3 and 4). Then it enters a main loop (step 6) in which it repeatedly pops
the “most promising” vertex from the priority stack, finds all of its as-yet-unvisited
neighbors (step 8), and evaluates each one for feasibility. Neighbors are obtained by
generating all(n+1
2
)possible permutations that can be produced from a vertex by a
single reversal. Neighbors of a vertex v can be ignored if they are not farther from
the origin than is v (step 9); such vertices will be examined as neighbors of another
vertex if they can feasibly belong to a median path. The best possible score (i.e.,
lower bound) for a vertex w is is used as the basis for prioritization. Best and worst
possible scores are calculated using the bounds of Theorem 2.1 (steps 10 and 11)
and maintained for all vertices present in the priority stack. Vertices can be pruned
when their best possible scores exceed the current global upper bound. The global
upper bound can be lowered when a vertex is found that has a lesser upper bound
(step 13). The search ends when no vertex in the queue has a best-possible score
lower than the upper bound (step 7) or a score equal to the global lower bound is
found (step 12).
Theorem 2.2 Algorithm 2.1 will return a permutation M only if M is a true median
of the inputs π1, π2, and π3.
Proof: Assume to the contrary that a permutation M ′ returned by the algorithm is
not a true median. Because the algorithm returns the permutation having the lowest
median score of all of the permutations (vertices) it visits (steps 5 and 13), it must not
have visited some median. If the algorithm did not visit some median, then either
it pruned all paths to medians or it exited before reaching any median. Suppose
the algorithm pruned all paths to medians. It only prunes vertices when their best
possible scores are lower than the current global upper bound, Mmax. Note that the
global upper bound always corresponds to the actual median score of a vertex that
21
Chapter 2. A Simple Algorithm for Finding an Exact Median
has been visited (steps 2 and 13), so it cannot be wrong. Consider a median M
with at least one median path PM . By Definition 2.2, PM must include at least one
path between M and each of the vertices v1, v2, and v3 corresponding to π1, π2, and
π3. The algorithm proceeds by examining neighbors of an origin ψorig ∈ π1, π2, π3.
Therefore, if the algorithm pruned all paths to M , then it must have pruned a vertex
on the path between ψorig and M . But the best scores of such vertices are calculated
using the lower bound of Theorem 2.1 (step 10), which we have shown to be correct.
Therefore, the algorithm cannot have pruned the shortest paths to medians.
Suppose instead that the algorithm exited before reaching a median. The algorithm
can exit for one of three reasons:
1. The priority stack s becomes empty (step 6);
2. The next item returned from s has a best possible score greater than or equal
to the current global upper bound (step 7);
3. A vertex w is found with a worst possible score equal to the global lower bound
(step 12);
Case 1 can occur only if all vertices have been visited, or if all remaining neighbors
have been pruned (because, except when the algorithm stops for another reason, each
new neighbor is either pruned or pushed onto s). If all vertices have been visited,
then a median must have been visited. We have shown above that all neighbors on
paths to a median cannot have been pruned. Because s always returns a vertex v
such that no other vertex in s has a lower best-possible score than v, and because
all neighbors that are not pruned are added to s, case 2 can only occur if a median
has been visited or if all paths to medians have been pruned. We have shown that
all paths to medians cannot have been pruned. Therefore, if case 2 occurs, a median
must have been visited. In case 3, w must be a median, since the global lower bound
is set directly according to Lemma 2.1 (step 1), which we have shown to be correct.
22
Chapter 2. A Simple Algorithm for Finding an Exact Median
Thus, none of these three cases can arise before a median has been found, and the
algorithm must return a median.
The worst-case running time of Algorithm 2.1 is O(n3d), with d = mind1,2,
d2,3, d1,3, but as would be expected with a branch-and-bound algorithm, the average
running time appears to be much better.
2.4 Experimental Method
We implemented find reversal median in C, reusing the linear-time distance rou-
tine (as well as some auxiliary code) from GRAPPA [1], and we evaluated its perfor-
mance on simulated data. All test data was generated by a simple program that
creates multiple sets of three permutations by applying random reversals to the
identity permutation, such that each set of three permutations represents three taxa
derived from a common ancestor under an inversions-only model of evolution. In
addition to the number of genes n to model and the number of sets s to create,
this program accepts a parameter r that determines how many random reversals to
apply in obtaining the permutation for each taxon. Thus, if n = 100, r = 10, and
s = 10, the program generates ten sets of three signed permutations, each of size 100,
and obtains each permutation by applying ten random reversals to the permutation
(+1,+2, . . . ,+100). A random reversal is defined as a reversal between two random
positions i and j such that 1 ≤ i, j ≤ n (if i = j, a single gene simply changes
its sign). When r is small compared to n, each permutation in a set tends to be a
distance of 2r from each other.
We used several algorithmic engineering techniques to improve the efficiency of
find reversal median. For example, we avoided dynamic memory allocation and
reused records representing graph vertices. We were able to gain a significant speedup
by optimizing the hash table used for marking vertices: a custom hash table offered
23
Chapter 2. A Simple Algorithm for Finding an Exact Median
a fourfold increase in the overall speed of the program, as compared with UNIX’s
db implementation. With circular genomes, we achieved a further improvement in
performance by hashing on the circular identity of each permutation rather than
on the permutation itself. We define the circular identity of a permutation as that
equivalent permutation that begins with the gene labeled +1. By hashing on circular
identities, we reduced the number of vertices to visit and the number of permutations
to mark by approximately a factor of 2n.
To improve performance further, we adapted our sequential implementation to
run in parallel on shared memory architectures. Two steps in the algorithm are read-
ily parallelizable: the major loop (step 6), during each iteration of which a new vertex
is popped from the priority stack, and the minor loop (step 8), in which the neighbors
of a vertex v are generated, examined for marks, and evaluated for feasibility as me-
dians. We enabled parallel processing at both levels, using pthreads for maximum
portability across shared-memory architectures. With careful use of semaphores and
pthreads mutex functions, we were able to reduce the cost of synchronization among
threads to an acceptable level.
2.5 Experimental Results
2.5.1 Performance of Bounds
Being especially concerned with the effectiveness of the pruning strategy, we have
chosen as a measure of performance the number of vertices V of the reversal graph
that the algorithm visited. In particular, we have taken V to be the number of times
the program executed the loop at step 8 of the algorithm. Note that the number of
calls to distance is approximately 3V .
After observing that the program occasionally took much longer to find a median
24
Chapter 2. A Simple Algorithm for Finding an Exact Median
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0 10000 20000 30000 40000 50000 60000 70000 80000 90000
RelativeFrequency
V
Distribution of V over 500 Experiments (n = 50, r = 7)
Figure 2.4: Number of vertices visited while finding a median, in the course of 500experiments with n = 50 and r = 7.
than it did on average, we recorded the distribution of V over many experiments.
We used various values of the number of genes n and the number of reversals per
tree edge r. Figure 2.4 is typical of our results. It summarizes 500 experiments with
n = 50 and r = 7 and shows a roughly exponential distribution, with high relative
frequencies in a few intervals having small V : in 87% of the experiments, fewer than
10,000 vertices were visited, and in 95%, fewer than 20,000 were visited. This figure
demonstrates that the algorithm generally finds a median rapidly, but occasionally
becomes mired in an unprofitable region of the search space. We have observed that
the tail of the exponential distribution becomes more substantial as r grows larger
with respect to n.
In order to characterize typical performance, we recorded the statistical medians
of V as n and r varied independently. The results are shown in Figures 2.5 and
2.6. For comparison, we have also plotted the mean values of V and, in Figure 2.5,
a theoretical quadratic curve. Note that, at least for r = 5, median and mean V
25
Chapter 2. A Simple Algorithm for Finding an Exact Median
0
5000
10000
15000
20000
25000
10 20 30 40 50 60 70 80 90 100
V
n
Number of Vertices Visited (r = 5)
Statistical Mean
♦ ♦♦
♦♦
♦
♦
♦
♦
♦♦Statistical Median
+ ++
++
+
+
+
+
++f(n)
Figure 2.5: Statistical median of V for r = 5 and 10 ≤ n ≤ 100, plotted with meanof V and the curve f(n) = 2.1 · n2, over 50 experiments for each value of n. For thisvalue of r, growth of the median and the mean of V appears to be quadratic in nover a large range of genome sizes.
appear to grow quadratically over a considerably large range of values for n, and
that, for n = 50, median V grows approximately linearly with r, at least as long
as r remains small (mean V grows somewhat faster than median V ). To put the
observed rate of growth into perspective, note that in the theoretical worst-case of
O(n3d), because d ≈ 2r and V = O(n3d
n) = O(n(6r−1)), one would see (given r = 5
and n = 50) growth of V with n29 and 506r−1.
2.5.2 Running Time and Parallel Speedup
We have tested program find reversal median sequentially on a 700 MHz Intel
Pentium III with 128MB of memory, and using various levels of parallelism on a Sun
E10000 with 64 333 MHz UltraSPARC II processors and 64GB of memory. Figure
2.7 shows average running times for r = 5 and n between 50 and 125.
26
Chapter 2. A Simple Algorithm for Finding an Exact Median
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1 2 3 4 5 6 7 8
V
r
Number of Vertices Visited (n = 50)
Statistical Mean
♦♦
♦♦
♦♦
♦
♦♦Statistical Median
++
++
++
++
+
Figure 2.6: Statistical median of V for n = 50 and 1 ≤ r ≤ 8, plotted with mean ofV . The number of experiments for each value of r is 50.
Sequential running times are shown for the Sun and Intel processors and parallel
running times for the Sun with the number of processors p ∈ 1, 2, 4, 6. In all
cases, the average time to find a median is about 12 seconds or less. Observe that
for n = 100 (a realistic size for chloroplast or mitochondrial genomes) medians can
generally be found in an average of about 2 seconds using a reasonably fast computer.
We should note that the memory requirements for the program are considerable, and
that the level of performance shown here is partly a consequence of the large amount
of RAM available on the Sun.
It is evident from Figure 2.7 that we achieve a good parallel speedup for small p,
but that the benefits of parallelization begin to erode between p = 4 and p = 6 (this
tendency becomes more pronounced at p = 8, which we have not plotted here for
clarity of presentation). Anecdotal evidence suggests that the cause of this trend is
a combination of the overhead of synchronization and uneven load balancing among
the computing threads. We also observed that parallelism in the minor loop of the
27
Chapter 2. A Simple Algorithm for Finding an Exact Median
0
2
4
6
8
10
12
50 60 70 80 90 100 110 120
AverageTime
to Finda Median
(sec)
n
Running Times for r = 5
Sun E10000 (p = 1)
♦
♦
♦
♦♦
Sun E10000 (p = 2)
++
+
+
+Sun E10000 (p = 4)
Sun E10000 (p = 6)
× ××
×
×Sun E10000 (seq)
4
4
4
4
4Pentium III
??
?
?
?
Figure 2.7: Sequential and parallel running times for r = 5 and n ∈ 50, 75, 100, 125.Each data point represents an average taken over 10 experiments. Parallel configu-rations used parallelism only in the minor loop of the algorithm.
algorithm was far more effective than parallelism in the major loop, presumably
because the heuristic for prioritization is sufficiently effective that the latter strategy
results in a large amount of unnecessary work.
2.5.3 Reversal Medians vs. Breakpoint Medians
Using program find reversal median, we evaluated the significance of reversal me-
dians, by comparing them with breakpoint medians, trivial medians, and “actual”
medians (i.e., the ancestral permutations from which observed taxa actually arose—
in this case, always equal to the identity permutation). Figure 2.8, which shows
results over 1 ≤ r ≤ 5 for n = 25, is typical of what we observed. It illustrates
that true reversal medians achieve comparable scores to actual medians1 and that
1Reversal medians are slightly better than actual medians when r becomes large withrespect to n, because saturation begins to cause convergence between taxa.
28
Chapter 2. A Simple Algorithm for Finding an Exact Median
0
2
4
6
8
10
12
14
16
18
20
1 2 3 4 5
Averagemedian score
(reversals)
r
Comparison of Medians of Three (n = 25)
Actual
♦
♦
♦
♦
♦
♦Trivial
+
+
+
+
++
Breakpoint
Inversion
×
×
×
×
××
Figure 2.8: Comparison of reversal medians with breakpoint medians, trivial medi-ans, and actual medians, for n = 25. Averages were taken over 50 experiments.
breakpoint medians, when scored in terms of reversal distance, perform significantly
more poorly. A comparison in terms of reversal median scores is clearly biased in
favor of reversal medians; however, if it is true that reversal distances are (in at least
some cases) more meaningful than breakpoint distances, then these results suggest
that reversal medians are worth obtaining.
By adapting it slightly, we were able to use program find reversal median to
find all medians, and thus to characterize the extent to which reversal medians are
unique. An example of our results is shown in Figure 2.9, which describes the number
of reversal medians for n = 15 and 1 ≤ r ≤ 5, over 50 experiments for each value
of r. Observe that when r is small compared to n (roughly r ≤ 0.15n), the reversal
median is virtually always unique; and even when r is moderately large with respect
to n (roughly 0.15n < r ≤ 0.3n)2, the reversal median is unique or nearly unique
2Recall that the distance between permutations is approximately 2r, and that randompermutations tend to be separated by a distance of approximately n. We have observed
29
Chapter 2. A Simple Algorithm for Finding an Exact Median
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3 4 5
RelativeFrequency
Number of Optimal Medians
Uniqueness of the Inversion Median (n = 15)
r = 1r = 3r = 4r = 5
Figure 2.9: Distribution of number of medians in the course of 50 experiments forn = 15 and 1 ≤ r ≤ 5. The histogram for r = 2 is not shown because it isindistinguishable from that for r = 1.
most of the time.
In addition, we observed a strong relationship between unique reversal medians
and actual medians. For example, with n = 15 and r = 1, for which all reversal
medians were unique, 49 out of 50 reversal medians were identical to actual medians;
similarly, for n = 15 and r = 2, 48 out of 50 were identical to actual medians (in
both cases the exceptional reversal medians differed from actual medians by a single
reversal). As r becomes greater compared to n, this relationship weakens but remains
significant. For example, with n = 15 and r = 4, 38 out of 50 reversal medians were
unique, and 22 of those 38 were identical to actual medians (an additional 10 non-
unique reversal medians equaled actual medians).
that the effects of saturation are evident at r = 0.2n and are pronounced by r = 0.3n.
30
Chapter 2. A Simple Algorithm for Finding an Exact Median
80
85
90
95
100
1 2 3 4 5 6 7 8
%
r
n = 50
♦ ♦ ♦♦
♦♦ ♦
♦
♦n = 100
+ + ++
+ +
+
Figure 2.10: Percentage of medians that are perfect, for n = 50 and n = 100 over1 ≤ r ≤ 8. Each data point reflects 100 trials. Experiments did not complete forn = 100, r > 6.
2.5.4 Preponderance of Perfect Medians
We made one additional discovery, in the course of our experiments, that will be par-
ticularly important to the remainder of this thesis: we found that the vast majority
of medians were perfect medians—that is, medians having a score equal to the lower
bound of Lemma 2.1. After noting this surprising phenomenon accidentally, we per-
formed several experiments to quantify it. Figure 2.10 illustrates results for n = 50
and n = 100, over 1 ≤ r ≤ 8. In this figure, each data point indicates the percentage
of times (in 100 trials) that the reversal median of the input permutations was a
perfect median. In all cases, the percentage was 96% or higher, and for r ≤ 3, the
percentage was 100%. In these experiments, every imperfect median had a score of
exactly one greater than that of a perfect median. The incidence of perfect medians
decreases as r grows, but slowly. Note that the rate of decrease is slower at n = 100
than at n = 50, presumably because it is the ratio of r to n that is important.
31
Chapter 2. A Simple Algorithm for Finding an Exact Median
2.6 Summary
In this chapter, we have derived a branch-and-bound algorithm to find an optimal
reversal median. The algorithm depends on bounds that are computed using only
the metric property of reversal distance, and thus could be used for many other
types of measurements (including ones not related to genome rearrangements). The
algorithm requires many distance computations, however, and will only be practical
for measurements that can be computed very efficiently. When applied to the reversal
median problem, it performs surprisingly well, considering the enormous size of the
search space. It finds a median in time that is distributed roughly exponentially,
with the tail of the distribution becoming more substantial as the pairwise reversal
distances among input permutations grow relative to the sizes of the permutations.
The excellent performance of the algorithm when input permutations are close in
distance is likely related to the high incidence of “perfect medians”—medians having
scores equal to the global lower bound of the search—because when such a median
is found, a search can terminate immediately. When distances are larger, perfect
medians become less prevalent, but still appear the vast majority of times.
Reversal medians appear to have several useful properties that are likely to make
them preferable to breakpoint medians for many applications, despite that they
are more costly to compute. Reversal medians appear to be highly unique (espe-
cially when input permutations are close), often equal “actual medians” (under an
inversions-only model of evolution), and score significantly better than breakpoint
medians when evaluated in terms of reversal distance.
32
Chapter 3
Finding All Sorting Reversals
Untwisting all the chains that tie
The hidden soul of harmony.
–John Milton
The preponderance of perfect medians leads to the following idea: Suppose we take
permutation π1 as our origin as we search for a median of π1, π2, and π3. If there exists
a perfect median M , then M is on a shortest path from π1 to π2 and from π1 to π3
(as well as from π2 to π3). We could restrict our search to such paths by considering
at each intermediate permutation φ only those neighbors of φ that are closer than
φ to both π2 and π3 (see Figure 3.1). This simple idea provides the motivation for
chapter 3 and is the basis of the improved median algorithm introduced in chapter 4.
The branching step of the previous algorithm involves generating all(n+1
2
)neigh-
bors of an intermediate permutation φ and testing them against the current bounds
of the search. Furthermore, each test requires two de novo linear-time distance calcu-
lations. Hence, the branching step takes Ω(n3) time, and turns out to be a bottleneck
for the algorithm.
33
Chapter 3. Finding All Sorting Reversals
π1
φ
M
π2
π3
B
AA ∩B
Figure 3.1: Suppose φ is an intermediate permutation encountered during a “walk”from π1 toward a perfect median M of π1, π2, and π3. Suppose further that Arepresents the set of all sorting reversals of φ with respect to π2, and B representsthe set of all sorting reversals with respect to π3. Then we need only consider asviable neighbors of φ those permutations induced by the intersection of these sets,A ∩B.
We would prefer to generate neighbors in a more efficient way by taking advantage
of the unique structure of the problem of sorting by reversals, as described by the
powerful cycle-decomposition theory of Hannenhalli and Pevzner. Suppose at an
intermediate permutation φ we could directly enumerate the set A of all sorting
reversals with respect to π2 and the set B of all sorting reversals with respect to π3,
using the breakpoint graphs of φ with respect to π2 and φ with respect to π3. Then
A ∩ B would induce exactly those neighbors of φ to consider in pursuit of a perfect
median. Thus, an efficient solution to the problem of finding all sorting reversals
of one permutation with respect to another might enable us to improve the median
algorithm markedly, with minimal increase in complexity to our median algorithm1.
We will refer to this problem as the “all sorting reversals” (ASR) problem.
1That is, if we assume that the complexity of finding all sorting reversals would beencapsulated in another algorithm.
34
Chapter 3. Finding All Sorting Reversals
Note that a solution to ASR also immediately induces an algorithm to find all
sequences of reversals that sort one permutation with respect to another—that is,
the problem of finding “all sequences of sorting reversals” (ASSR) reduces easily to
ASR. While several authors have presented fast algorithms that find some sequence
of sorting reversals [18, 4, 20, 2, 3], no algorithm has been published that efficiently
addresses ASSR. As will be seen later in this chapter, however, for most problem
instances there exist many sequences of sorting reversals; therefore, for many appli-
cations, finding only one is of limited usefulness. Some such applications may be
relatively far-removed from the median problem. For example, a biologist studying
the permutations that describe hypothesized ancestral genomes in a phylogenetic
tree may wish to consider the merits of various alternative sequences of reversals
separating those permutations.
In this chapter, we will derive an efficient solution to ASR. The algorithm is
developed as follows: we begin by outlining a straighforward classification of all
possible reversals; then we introduce a simplified version of the problem, which we
call the “Fortress-Free Model” (FFM), and using the FFM , we prove exactly which
classes of reversals can be sorting reversals, and under what conditions they sort;
finally, we adapt our results for the general case by re-introducing fortresses. Using
principles developed in this way, we can easily describe an algorithm that solves
ASR. Our solution to ASR requires an efficient algorithm to solve a critical sub-
problem: detecting whether a candidate reversal introduces into the breakpoint graph
a new unoriented component (and potentially a new hurdle). We also derive a new
algorithm to solve this sub-problem efficiently. After presenting our algorithms,
we report experimental results that demonstrate their efficiency and affirm their
correctness.
35
Chapter 3. Finding All Sorting Reversals
3.1 Notation and Definitions
Let π and φ be signed permutations of size n, such that π = (π1, π2, . . . , πn) and φ =
(φ1, φ2, . . . , φn). Let the unsigned permutation π′ = (π′0, π′1, . . . , π
′2n+1) be defined
such that π′0 = 0, π′2n+1 = 2n+ 1, and for all i, 1 ≤ i ≤ n, π′2i = 2πi, π′2i−1 = 2πi − 1
(if πi > 0) or π′2i = 2|πi| − 1, π′2i−1 = 2|πi| (if πi < 0); let the unsigned permutation
φ′ = (φ′0, φ′1, . . . , φ
′2n+1) be defined exactly the same way with respect to φ. We say
two elements πi and πi+1 are adjacent in π, and we say the corresponding elements
π′2i and π′2i+1 are adjacent in π′; similarly for φ and φ′. Let the breakpoint graph B
of π with respect to φ be defined as follows (see Figure 3.2)2. B contains a sequence
of 2n + 2 vertices labeled with the elements of π′. Every two of these vertices that
reflect an adjacency in π′ are connected with a black edge (a “reality” edge), and
every two that reflect an adjacency in φ′ are connected with a gray edge (a “desire”
edge, often depicted as a dashed line). Let the overlap graph O = (V,E) for B be
defined such that there exists a distinct ve ∈ V for every gray edge e in B, and two
vertices ve and ve′ are connected by an edge (ve, ve′ ∈ E) iff gray edges e and
e′ overlap in B (see Figure 3.2). A cycle in B is a sequence of connected vertices
(v0, v1, . . . , v2i, v2i+1, . . . , v2n, v2n+1, v0) (n ≥ 0), such that, for all i, 0 ≤ i ≤ n, v2i
and v2i+1 are connected with a black edge, and v2i+1 and v2i+2 (or v2i+1 and v0, if
i = n) are connected with a gray edge. A connected component in O has the usual
meaning, and is sometimes called simply a “component”.
Every gray edge is said to be oriented if it spans an odd number of vertices in B,
and unoriented otherwise. A cycle in B and a connected component in O are each
said to be oriented if they contain at least one oriented gray edge. We call cycles
and components unoriented if they are not oriented, except when they are trivial. A
trivial cycle consists of a single gray edge and a single black edge, and corresponds
2The breakpoint graph in [34] is given the more euphonious but unwieldy name, “theDiagram of Reality and Desire”.
36
Chapter 3. Finding All Sorting Reversals
0 9 5 7 3 1 210 8 4 11 12 15 16 18 17 19 20 25 26 23 24 21 22 27 28 29 31 32 13 14 336 30
(0,1)(2,3)
(4,5)
(6,7)
(8,9)
(14,15)
(16,17)
(22,23)
(24,25)
(28,29)
(12,13)
(10,11)
(18,19)
(20,21)
(26,27)
(30,31)
(32,33)
v
xu
t
y zw
B
w
y
z
vx
t
O
u
Figure 3.2: Breakpoint graph B and overlap graph O for the permutationπ = −5,−3,−4,−2,+1,+6,+8,−9,+10,+13,+12,+11,+14,+15,+16,+7 withrespect to the identity permutation of size n = 16. Connected components t and ware oriented, y and z are trivial, and u, v, and x are unoriented. Unoriented com-ponents u and x are hurdles, but unoriented component v is a protected nonhurdlebecause it separates u and x. Oriented edges in O are represented by solid circles.
to an adjacency shared in permutations π and φ. A trivial cycle will always create
a trivial connected component—that is, a component consisting of a single, isolated
vertex in O—and a trivial component can only arise from a trivial cycle. Note that
the gray edges of a cycle always belong to the same connected component, so we can
say that the cycle belongs to that component. For convenience, we will refer to a
component that is either oriented or trivial as a benign component3.
3We differ from the literature also in the way we have distinguished and named trivialcycles and components.
37
Chapter 3. Finding All Sorting Reversals
Every unoriented component can be classified as either a hurdle or a protected
nonhurdle. A hurdle is an unoriented component that does not separate other un-
oriented components, and a protected nonhurdle is one that does. A component u is
said to separate two other components v and w if, in a traversal of the vertices of B,
it is impossible to pass (in either circular direction) from a vertex belonging to v to a
vertex belonging to w without encountering a vertex belonging to u. Note that, while
separation is primarily used with respect to unoriented components, the definition
applies as well to oriented and trivial ones4. A hurdle is called a superhurdle if, were
it eliminated, a protected nonhurdle would emerge as a hurdle; otherwise it is called
a simple hurdle.
By Hannenhalli and Pevzner’s duality theorem [18], the distance d(π, φ) between
π and φ is given by d(π, φ) = n+ 1− c+h+f , where c is the number of cycles and h
is the number of hurdles in B. The parameter f is equal to one if there is a fortress
in B and zero otherwise. A fortress exists iff there are an odd number of hurdles and
all are superhurdles [34].
A reversal ρ(i, j) (1 ≤ i < j ≤ n) applied to π = (π1, . . . , πn) transforms π
into ρ(π) = (π1, . . . ,−πj, . . . ,−πi, . . . , πn). A reversal ρ is a sorting reversal iff
d(ρ(π), φ) = d(π, φ) − 1. Note that d(ρ(π), φ) < d(π, φ) iff d(ρ(π), φ) = d(π, φ) − 1.
We will use the term ∆d to indicate the quantity d(ρ(π), φ)−d(π, φ). The end-points
i and j of a reversal ρ(i, j) correspond to the ith and (j + 1)st black edges of B. We
say that ρ acts on these edges. Note that ∆d ∈ −1, 0, 1 for every reversal.
Setubal and Meidanis [34] have presented an alternative distinction to the one
between oriented and unoriented gray edges, based on black edges. They define
two black edges of the same cycle as convergent if in a traversal of the cycle, these
edges induce the same (circular) ordering of the vertices of B; otherwise the edges
4Separation was introduced in [34]. Our definition is stated so as not to require acircular representation of B.
38
Chapter 3. Finding All Sorting Reversals
are divergent. It can be shown easily that an oriented gray edge always connects
divergent black edges and an unoriented gray edge always connects convergent black
edges (if the unoriented gray edge is part of a trivial cycle, it connects a black edge
to itself, and the correspondence holds trivially). Because every pair of black edges
in a cycle can be said to be convergent or divergent, however, and there are more
pairs of black edges than there are gray edges, not every divergent pair of black edges
corresponds to an oriented gray edge and not every convergent pair to an unoriented
gray edge. Setubal and Meidanis have shown that any reversal that acts on divergent
black edges will split the cycle to which the edges belong, and any reversal that acts
on convergent black edges will not split the cycle to which they belong. Furthermore,
any cycle that contains two divergent black edges must contain an oriented gray edge,
and hence is oriented (intuitively, an oriented cycle is a “splittable” cycle). Therefore,
a connected component is oriented iff at least one of its cycles has divergent black
edges (an oriented component is one that contains a splittable cycle).
Let M be the set of connected components in O, and let Ci be the set of cycles
that belong to mi ∈ M . Every ci,j ∈ Ci has a set of black edges Bi,j. If there
exist bi,j,k, bi,j,l ∈ Bi,j such that bi,j,k and bi,j,l are divergent, then ci,j is oriented;
otherwise ci,j is unoriented, unless |Bi,j| = 1, in which case ci,j is trivial. If there
exists ci,j belonging to mi such that ci,j is oriented, then mi is oriented; otherwise,
mi is unoriented, unless mi = ci,j and ci,j is trivial, in which case mi also is trivial.
If mi is unoriented, then mi is either a hurdle or a protected nonhurdle, depending
on whether it separates other unoriented components.
The following definitions will enable us to define precisely the effect an arbitrary
reversal has on the orientation of edges and components.
Definition 3.1 A reversal ρ acting on black edges i and j induces a bipartitioning
of the vertices in the breakpoint graph, with one set R containing the vertices between
i and j, and the other set R′ containing all other vertices. We say that R and R′ are
39
Chapter 3. Finding All Sorting Reversals
the ranges of ρ.
In the general case of circular genomes, the labels R and R′ are arbitrary (they
depend on how one draws the breakpoint graph). Indeed, the relationship between
the two sets is symmetric, in that what we think of as a reversal to the elements of
one set can as well be modeled as a reversal to the elements of the other. What is
important is that the bipartitioning is unambiguously defined by ρ.
Definition 3.2 We say that a reversal ρ with ranges R and R′ affects a gray edge
iff the edge has one vertex in R and one vertex in R′.
We say that ρ affects a component iff ρ affects at least one gray edge belonging to
the component. It can be shown easily that a component is affected by a reversal
with ranges R and R′ iff at least one vertex of the component (in the breakpoint
graph) belongs to each of R and R′.
Lemma 3.1 A reversal ρ causes a gray edge e to change orientation iff ρ affects e.
Proof: The edge e is oriented iff the black edges adjacent to e are divergent, and
unoriented iff the black edges adjacent to e are convergent. If ρ affects e, then one
of its adjoining black edges will be “reversed” with respect to the other (in terms of
the direction it induces in a traversal of the cycle); and if ρ does not affect e, then
e’s black edges will remain unchanged with respect to one another. Thus, if ρ affects
e its adjoining black edges will change from convergent to divergent or vice versa,
and if ρ does not affect e, its adjoining black edges will remain as they were.
Corollary 3.1 A reversal ρ causes an unoriented component u to become oriented
iff ρ affects u.
40
Chapter 3. Finding All Sorting Reversals
(a)
0 543 1 2
(b)
2 90 17781314111210151665143
Figure 3.3: Ways to cut a simple hurdle (a) and ways to merge two mergeable hurdles(b). Any reversal that acts on black edges of a single cycle in a hurdle will orient atleast one edge of the cycle, and thus cut the hurdle. Any reversal that acts on blackedges from different hurdles will combine them into a single oriented component, andthus merge the hurdles.
Proof: At least one gray edge belonging to u will be affected by ρ iff ρ affects u; a
gray edge will change from unoriented to oriented iff it is affected; and u will become
oriented iff at least one of its gray edges becomes oriented.
Corollary 3.1 provides us with a simple way to explain the phenomena Hannen-
halli and Pevzner have called “hurdle cutting” and “hurdles merging” [18]. In hurdle
cutting, a single hurdle is affected and rendered oriented, and in hurdles merging,
two hurdles are affected and oriented, as are any protected nonhurdles that separate
them (see Figure 3.3). Not all reversals that affect a single hurdle cut the hurdle,
however, and not all reversals that affect two hurdles merge them, as will be shown
in the next section.
3.2 Sorting Reversals in the Absence of Fortresses
Based on the definitions above, we can describe an exhaustive classification scheme
for reversals.
41
Chapter 3. Finding All Sorting Reversals
Lemma 3.2 Suppose ρ is an arbitrary reversal, such that ρ acts on black edges bi,j,k
and bi′,j′,k′, belonging respectively to cycles ci,j and ci′j′ and to connected components
mi and mi′. Then one of the following must be true:
1. i = i′ and j = j′ (bi,j,k and bi′,j′,k′ belong to the same cycle, ci,j)
(a) ci,j is oriented and mi is oriented
i. bi,j,k and bi′,j′,k′ are convergent
ii. bi,j,k and bi′,j′,k′ are divergent
(b) ci,j is unoriented and mi is either oriented or unoriented
2. i = i′ and j 6= j′ (bi,j,k and bi′,j′,k′ belong to different cycles of the same compo-
nent, mi). Here, mi may be oriented or unoriented.
3. i 6= i′ (bi,j,k and bi′,j′,k′ belong to different components, mi and mi′). Each of
mi and mi′ may be benign or unoriented.
Proof: Either i = i′ or i 6= i′ (the edges are part of the same or different components);
if i = i′ then either j = j′ or j 6= j′ (the edges are part of the same or different
cycles); if i = i′ and j = j′ then bi,j,k and bi′,j′,k′ are either convergent or divergent.
Furthermore, each of ci,j, ci′,j′ , mi, and mi′ by definition is oriented, unoriented, or
trivial. Trivial cycles or components are not possible in cases 1 and 2, and case
1 need not consider the possibility of an oriented cycle belonging to an unoriented
component (which is prohibited by definition).
We define the “Fortress-Free Model” (FFM) of ASR to be a simplified version
of the problem in which it is assumed that a fortress does not exist. Thus, under
the FFM , d(π, φ) = n + 1 − c + h. The FFM allows us to introduce a simple but
powerful rule we will call “conservation of distance”.
42
Chapter 3. Finding All Sorting Reversals
Lemma 3.3 (Conservation of Distance) Under the FFM , a reversal is a sort-
ing reversal iff: ∆c = −1 and ∆h = −2; or ∆c = 0 and ∆h = −1; or ∆c = 1 and
∆h = 0.
Proof: We know that ∆c ∈ −1, 0, 1 (because a reversal can only merge cycles,
be neutral with respect to cycle number, or split a cycle). In the case of a sorting
reversal, ∆d = −1. Because d = n+ 1− c+ h, it must be true that ∆h−∆c = −1.
The lemmata below address Cases 1a, 1b, 2, and 3 of Lemma 3.2. They rely
heavily on the idea of conservation of distance.
Lemma 3.4 (Case 1a) Under the FFM , a reversal ρ that acts on two black edges
belonging to the same oriented cycle is a sorting reversal iff the edges are divergent
and ρ does not increase the number of hurdles.
Proof: Either the black edges are divergent and ρ splits the cycle, or the black edges
are convergent and ρ is neutral with respect to cycle number [34]. If ρ splits the cycle
(∆c = 1), then it is a sorting reversal iff ∆h = 0 (conservation of distance). If ρ is
neutral (∆c = 0), then it is a sorting reversal iff ∆h = −1 (conservation of distance).
But if ρ is neutral, we know that ∆h = 0, because the reversal acts on black edges of
an oriented cycle, and therefore of an oriented component. A reversal that acts only
on edges of a single component cannot affect any other component (otherwise the
two component would overlap, and thus would not be separate components), so in
this case, ρ cannot change the number of hurdles. Therefore, to be a sorting reversal,
ρ must split the cycle in question and avoid increasing the number of hurdles.
Lemma 3.5 (Case 1b) Under the FFM , a reversal ρ that acts on two black edges
belonging to the same unoriented cycle c is a sorting reversal iff c belongs to a simple
hurdle.
43
Chapter 3. Finding All Sorting Reversals
Proof: Because c is unoriented, all of its black edges are convergent; therefore ρ
must result in ∆c = 0. The component m to which c belongs is either oriented, a
hurdle, or a protected nonhurdle. If m is oriented, then ∆h = 0, as described in the
proof of Lemma 3.4. If m is a protected nonhurdle, ρ cannot orient a hurdle, and
∆h ≥ 0 (orienting a protected nonhurdle cannot decrease the number of hurdles). If
m is a hurdle, however, then ρ cuts m. But if m is a superhurdle, then when it is
cut another hurdle will emerge, and ∆h = 0; thus, ρ will not be a sorting reversal.
Only if m is a simple hurdle will ∆h = −1, and only in this case will ρ be a sorting
reversal.
Lemma 3.6 (Case 2) Under the FFM , a reversal ρ cannot be a sorting reversal
if ρ acts on two black edges belonging to different cycles of the same component.
Proof: Because ρ acts on different cycles, ∆c = −1 [34]. Therefore, by conservation
of distance, ρ is a sorting reversal only if ∆h = −2. It is impossible, however, for ρ
to orient two hurdles, because it affects only a single component.
To address Case 3 of Lemma 3.2, we must introduce several new ideas.
Definition 3.3 Suppose in a permutation π there exist four unoriented components
u, v, w, and p, such that p is a protected nonhurdle that separates hurdles u and v
from w but does not separate u and p from each other (Figure 3.4). Further suppose
that every other unoriented component in π either separates u and v or is separated
by p from both u and v, and that p does not separate any two of the components
that it separates from u and v. Then we say that hurdles u and v form a double
superhurdle.
Definition 3.4 A hurdle that separates a benign component from all other unori-
ented components is said to be the separating hurdle of the benign component.
44
Chapter 3. Finding All Sorting Reversals
u z
p
y
x v w
Figure 3.4: Breakpoint graph of the permutation π = +2,+1,+3,+6,+16,+15,+7,+10,+12,+11,+13,+9,+8,+14,+17,+19,+18,+20,+5,+4,+21,+23,+22, in whichhurdles u and v form a double superhurdle with respect to protected nonhurdlep. Notice that all unoriented components besides u, v, and p either separate u and v(y and z) or are separated from both u and v by p (x and w). We call this construct a“double superhurdle” because a reversal that destroys u and v will cause p to becomea hurdle; thus, the pair acts as a kind of superhurdle of p, even though neither ofthe individuals is a superhurdle of p. Notice in this example that x and w also forma double superhurdle with respect to p.
It follows from the definition of a separating hurdle that a benign component may
have at most one separating hurdle, but a hurdle may be the separator of multiple
benign components. All separating hurdles and the benign components that they
separate can be found easily in a single pass of the breakpoint graph, as shown in
Algorithm 3.1.
Lemma 3.7 (Case 3) Under the FFM , a reversal ρ that acts on black edges be-
longing to different components u and v is a sorting reversal iff all of the following
are true:
1. Each of u and v is a hurdle or a benign component that has a separating hurdle.
2. u and v are not benign components sharing the same separating hurdle.
3. u and v or their separating hurdles do not form a double superhurdle.
45
Chapter 3. Finding All Sorting Reversals
Input: A breakpoint graph B; a corresponding array lab such that lab [i] isthe label of the connected component to which the ith vertex, vi ∈ B,belongs; whether each connected component is “benign”, a “hurdle”, ora “protected nonhurdle”
Output: A list of pairs (x, y) such that x is a benign component and y is itsseparating hurdle.
begininitialize lists L, M ;initialize array mark ;start← −1;for i← 0 to 2n+ 2 do if lab [i] is a hurdle then start← i; break;if start = −1 then return L;currenth← l[start];for i← start+ 1 to start+ 2n+ 2 do
comp← lab [i mod 2n+ 2];if comp = currenth then
/* another instance of current hurdle—save all members of M */foreach c ∈M do
mark [c] ← 0;append(L, (c, currenth));
endendelse if comp is a hurdle then
/* new hurdle—empty M*/currenth← comp;foreach c ∈ list do mark [c] ← 0;
endelse if comp is benign and mark [comp] = 0 then
/* new benign component—add to M*/mark [comp] ← 1;append(M, comp);
endendreturn L;
end
Algorithm 3.1: find separating hurdles
46
Chapter 3. Finding All Sorting Reversals
0 13101615172543 16 7 8 18 9 14 191112
ρ
q
u vp
Figure 3.5: A reversal ρ that acts on black edges belonging to two benign components(u and v) can still cause the destruction of two hurdles (p and q), if the benigncomponents have distinct separating hurdles. The effect is similar to what would beobserved if the hurdles were merged.
Proof: As in Lemma 3.6, ∆c = −1 and ρ is a sorting reversal only if ∆h = −2.
If ∆h = −2, then ρ must orient at least two hurdles; and it is impossible for a
single reversal to orient more than two hurdles, so ρ must orient exactly two hurdles.
To orient two hurdles, ρ must act on two black edges, each of which belongs to a
hurdle or to a benign component separated from the other by a hurdle (a protected
nonhurdle cannot be separated from a hurdle by a hurdle; otherwise a hurdle would
separate unoriented components). Furthermore, if each black edge belongs to a
benign component, the two benign components cannot be separated by a single
hurdle (if they were, ρ would orient only one hurdle). If a benign component is
separated by a hurdle from another hurdle, the former hurdle must be its separating
hurdle (another separating hurdle could not exist; if it did a hurdle would separate
unoriented components). Thus, each of u and v must be a hurdle or a benign
component that has a separating hurdle; and if both are benign components, they
cannot share a separating hurdle. For it to be true that ∆h = −2, however, ρ
must avoid introducing any new hurdles. We claim that a reversal that destroys two
hurdles causes a new hurdle to emerge iff those hurdles form a double superhurdle.
It is clear that, if each of u and v is a hurdle or a benign component separated
by a hurdle, and they do not share a separating hurdle, then ρ will orient exactly
two hurdles; thus, if we prove our claim about double superhurdles, we will have
completed both directions of the proof of Lemma 3.
47
Chapter 3. Finding All Sorting Reversals
A reversal ρ that destroys two hurdles u and v will cause a new hurdle to emerge
iff u and v form a double superhurdle. It follows directly from the definition of a
double superhurdle that, if u and v form a double superhurdle, then a reversal that
destroys them must cause a new hurdle to emerge; therefore, we will focus on the
converse claim. Suppose ρ causes a protected nonhurdle p to emerge as a hurdle.
Then p must separate each member of a nonempty set U of unoriented components
from each member of a nonempty set V of unoriented components, and ρ must orient
all members of U or all members of V (it cannot orient members of both, because if ρ
spanned U and V it would orient p). Assume that ρ orients all members of U . Then
u, v ∈ U , and u and v are separated from all members of V . Furthermore, u and
v must be separated from one another by all other members of U ; otherwise either
they could not be hurdles, or a reversal that destroyed them could not also destroy
the other members of U . Thus, u and v meet the definition of a double superhurdle.
3.3 Accommodating Fortresses
Now we shall abandon the simplification of the FFM and re-introduce fortresses.
With fortresses, we have two major cases to consider:
1. Before a reversal ρ, there exists no fortress (f = 0). Here we must consider the
possibility that ρ introduces a fortress, and thus is not a sorting reversal in the
general model.
2. Before a reversal ρ, there exists a fortress (f = 1). Here there is no danger
of introducing a fortress (there can be only one fortress). Instead, however,
we must consider the possibility that a reversal ρ that is not a sorting reversal
under the FFM eliminates the fortress, thus is a sorting reversal in the general
model.
48
Chapter 3. Finding All Sorting Reversals
Case 1 is addressed by the following lemma.
Lemma 3.8 A reversal ρ that meets the criteria for a sorting reversal under the
FFM will introduce a fortress iff one of the following is true:
1. ρ acts on divergent black edges of the same oriented cycle, and ρ introduces at
least one unoriented component so that there are an odd number of hurdles all
of which are superhurdles.
2. ρ cuts the only simple hurdle and there are an odd number of superhurdles.
3. ρ acts on two black edges belonging to different components such that two hur-
dles are destroyed, and the set of hurdles is reduced so that there remain an
odd number and all are superhurdles.
Proof: We can introduce a fortress only by changing the set of hurdles such that it
has an odd size and consists only of superhurdles. Three classes of sorting reversals
under the FFM are capable of changing the set of hurdles: (a) those that introduce
unoriented components as they split cycles (Lemma 3.4); (b) those that cut simple
hurdles (Lemma 3.5); and (c) those that destroy pairs of hurdles (Lemma 3.7). A
reversal of any of these three classes is capable of changing the set of hurdles so as
to create a fortress. A reversal of class (b), however, can only remove a single simple
hurdle, so for it to create a fortress, it must cut the sole simple hurdle and there
already exist an odd number of superhurdles.
The second case is more difficult and requires the introduction of several new
concepts.
Definition 3.5 We say two unoriented components u and v are adjacent in a break-
point graph B iff at least one pair of vertices from u and v have between them, in
49
Chapter 3. Finding All Sorting Reversals
x
w
p
z
u
y
v
Figure 3.6: The hurdle graph for the breakpoint graph of Figure 3.4. Every un-oriented component is represented by a vertex, and adjacencies between unorientedcomponents are represented by edges. Notice here that unoriented components y, z,and u form a hurdle chain for hurdle u, with y as the anchor. Also notice that eachof the double superhurdles has a corresponding 3-vertex cycle.
a circular traversal of the vertices of the breakpoint graph, no vertex belonging to
another unoriented component.
Definition 3.6 Let U be the set of unoriented components in a breakpoint graph B.
We define the hurdle graph for B to be a graph H = (V,E) such that V = U and
E = vi, vj | vi, vj ∈ V and vi, vj are adjacent in B.
We can easily construct the hurdle graph in a single traversal of the breakpoint
graph, after we have labeled B with the connected components in O, and identified
all unoriented components. Figure 3.6 shows the hurdle graph for the breakpoint
graph of Figure 3.4.
The hurdle graph has a number of useful properties. We will describe several of
them here without providing detailed proofs. For convenience of discussion, we will
use the same name to refer to an unoriented component and the vertex in the hurdle
graph that represents it; e.g., unoriented component u is represented by vertex u.
50
Chapter 3. Finding All Sorting Reversals
1. Two unoriented components u and v are separated by a third unoriented com-
ponent w iff there exists no path in the hurdle graph between vertices u and v
that does not pass through vertex w.
2. A vertex in the hurdle graph cannot separate other vertices if it belongs to a
cycle and has degree 2, or if it has degree 1. Such a vertex u appears in the
hurdle graph iff u is a hurdle.
3. A node in the hurdle graph can belong to multiple cycles, but an edge cannot.
The reason is that the rules of separation are such that multiple cycles can
occur only if there exists at least one vertex v that separates all members
(other than v) of cycle from all members (other than v) of the other. As a
result, a vertex in the hurdle graph must separate other vertices if it belongs
to a cycle and has degree greater than 2, or if it does not belong to a cycle and
has degree 2. Such a vertex v appears in the hurdle graph iff v is a protected
nonhurdle.
4. A hurdle u is a superhurdle iff u has degree 1 and the single neighbor v of u
either has degree 3 and belongs to a cycle, or has degree 2 and does not belong
to a cycle.
Definition 3.7 A hurdle chain is a chain of vertices in the hurdle graph that
consists of a hurdle and zero or more other vertices, such that every vertex v in the
chain is a hurdle, has degree 2 and does not belong to a cycle, or is the last vertex in
the chain and either belongs to a cycle or has degree greater than 2. A superhurdle
chain is a hurdle chain that contains a superhurdle.
If a hurdle chain has one end that is not a hurdle, we call the vertex at that end
the anchor of the chain. A hurdle chain that has hurdles at both ends is said to
be “unanchored”. If there exists an unanchored chain, it must encompass the entire
51
Chapter 3. Finding All Sorting Reversals
hurdle graph (that the hurdle graph must be connected is implicit in its definition);
therefore, if there exists at least one anchored chain, all chains must be anchored.
The anchors of anchored chains always belong to cycles. We say that the chain is
“anchored by” the cycle to which its anchor belongs.
Lemma 3.9 Two hurdles u and v form a double superhurdle iff u and v belong to
hurdle chains anchored by w and x, such that w and x belong to a 3-vertex cycle,
the third vertex y of that cycle (y 6= w, y 6= x) has degree 3, and each of w and x has
degree of at most 3.
Proof: If w, x, and y form a 3-vertex cycle as described, and if y has degree 3,
there must exist a vertex z such that y separates z from both u and v, yet y does
not separate u and v from each other. Furthermore, any other vertices must belong
to the hurdle chains of u and v (in which case they separate u and v) or must be
separated from u and v by y. Finally, because y has degree of 3 (and not more), it
cannot separate any two of the components that it separates from u and v. Thus, u
and v meet the definition of a double superhurdle. To prove the converse, construct
a hurdle graph according to the definition of a double superhurdle as follows. Begin
with four components, u, v, w, and p. Place r − 1 vertices and r edges (r ≥ 1)
between u and p, and place s − 1 vertices and s edges (s ≥ 1) between v and p.
Place the vertex w on the other side of p from u and v, and connect it to p, using
any number of intermediate nodes and edges, making sure that only a single edge
connects this sub-graph to p (otherwise p would separate vertices that it separates
from u and v). Recall that u and v must not be separated from each other by p, and
all vertices that separate them from p must separate them from one another. The
only way to meet these criteria is to connect with a new edge the closest to p of the
nodes that separate u and p, and the closest to p of the nodes that separate v and p.
Call these nodes y and z. Thus, p, y, and z form a 3-vertex cycle, y and z are the
52
Chapter 3. Finding All Sorting Reversals
anchors of the hurdle chains of u and v, y and z have degrees of at most three, and
p has degree greater than 2.
Definition 3.8 We say a superhurdle is a single protector if it belongs to an
anchored hurdle chain of length 2. We call the neighbor of a single protector a
pseudohurdle.
The important implication of this definition is that if a single protector were oriented,
then the corresponding pseudohurdle would become a simple hurdle; and if a pseudo-
hurdle were oriented, the corresponding superhurdle would become a simple hurdle.
The reason for the name pseudohurdle will become apparent below. Essentially, when
there is a fortress, such a component can be treated much like a hurdle.
Lemma 3.10 A reversal ρ that affects a single unoriented component eliminates a
fortress iff it affects a single protector or a pseudohurdle.
Proof: First we will show that if ρ eliminates a fortress, it must affect a single
protector or a pseudohurdle. Suppose to the contrary that ρ eliminates a fortress
but affects a component that is (1) a superhurdle that is not a single protector, or (2)
a protected nonhurdle that is not a pseudohurdle (u cannot be a simple hurdle if there
is a fortress). If u is a superhurdle that is not a single protector, then it protects
at least two nonhurdles. Therefore, if ρ orients u, it will simply create another
superhurdle in its place, and the fortress will not be removed. Similarly, if u is a
protected nonhurdle that is not a pseudohurdle, then when ρ orients u, no superhurdle
will be converted into a simple hurdle, and the fortress will remain. In either case our
assumption is impossible; thus, if ρ eliminates a fortress, it must affect and orient a
single protector or a pseudohurdle. Now we will show the converse: that if ρ affects
a single protector or a pseudohurdle, it must eliminate a fortress. If ρ affects and
orients a single protector or a pseudohurdle, it will create a simple hurdle. Because
53
Chapter 3. Finding All Sorting Reversals
we have assumed that ρ does not affect or create any other unoriented component,
the simple hurdle must remain a simple hurdle. A fortress cannot exist if there is a
simple hurdle, so ρ must eliminate the fortress.
Lemma 3.11 A sorting reversal ρ that orients a set U of two or more unoriented
components eliminates a fortress iff one of the following is true:
1. U includes all members of exactly one superhurdle chain.
2. U includes the members of a double superhurdle
Proof: Let us show first that if ρ meets one of the two criteria listed, it must elim-
inate a fortress. Suppose U includes all members of superhurdle chain k. Then ρ
must remove k from the hurdle graph. Furthermore, U includes all members of no
other superhurdle chain, so ρ removes no other hurdle chain, and the number of
superhurdles must decrease by one; thus, there can no longer be an odd number
of superhurdles, and ρ eliminates the fortress. Suppose instead that U includes a
double superhurdle. In this case, U must contain exactly the members of the double
superhurdle and all unoriented components that separate them. Thus, ρ will destroy
two superhurdles and cause a new hurdle to emerge (the protected nonhurdle of the
double superhurdle). Therefore, an even number of hurdles will remain, and the
fortress can no longer exist.
Now we must show that if ρ eliminates a fortress, it must meet one of the criteria
listed. Assume to the contrary that ρ eliminates a fortress and neither includes all
members of exactly one superhurdle chain nor includes a double superhurdle. The
first criterion is the fundamental one; we will focus on it and address the other in
passing. U must include all members of exactly two superhurdle chains. We know this
because U cannot include all members of no chain (because ρ must affect multiple
cycles, it must cause ∆c = −1; thus, to be a sorting reversal, with ∆f = −1,
54
Chapter 3. Finding All Sorting Reversals
ρ must achieve ∆h = −1; but decreasing the number of hurdles is only possible
by orienting all members of at least one hurdle chain), and no single reversal can
orient all members of more than two chains. Therefore, ρ must eliminate exactly two
superhurdles. However, ρ cannot remove a fortress by eliminating two superhurdles
unless the fortress is a 3-fortress (the smallest possible fortress) or the superhurdles
form a double superhurdle (the only way that eliminating the two superhurdles could
cause a new hurdle to emerge). In a 3-fortress, every two superhurdles form a double
superhurdle (if any two were merged, the anchor of the third’s chain would emerge
as a hurdle), so it is enough to say that U must contain a double superhurdle.
This, however, is prohibited by the second part of our assumption so we have a
contradiction. Thus, if ρ eliminates a fortress, it must meet one of the two criteria
of the Lemma.
Lemma 3.12 A reversal ρ eliminates all members of exactly one superhurdle chain
iff it acts on edges belonging to two components u and v such that:
1. One of u and v is either a superhurdle, or a benign component with a separating
hurdle. Without loss of generality, assume that u is this component. Let w be
either u (if u is a superhurdle) or the separating hurdle of u (if u is a benign
component).
2. One of the following is true for v:
(a) v is the anchor of w’s superhurdle chain.
(b) v is a protected nonhurdle that does not belong to w’s superhurdle chain.
(c) v is a benign component that has no separating hurdle, and v is not sepa-
rated from one unoriented component in w’s chain by another.
Proof: First we will show that if ρ meets the criteria of the Lemma, it eliminates all
members of exactly one superhurdle chain. Suppose that u and w are defined as in
55
Chapter 3. Finding All Sorting Reversals
part 1. Suppose further that k is the hurdle chain of w. Then, if v is the anchor of
k, all members of k, and no members of any other hurdle chain, will be affected by
ρ. If instead v is a protected nonhurdle not belonging to k, then either v belongs to
another hurdle chain, or v belongs to no hurdle chain. In both cases, all members
of k are affected by ρ, and all members are affected of no other hurdle chain (if v
belongs to another hurdle chain k′, then at least one member of k′ [its hurdle] is not
affected by ρ). Finally, if v is a benign component that has no separating hurdle, and
v is not separated from one unoriented component in k by another, then no hurdle
other than w can separate u and v, and again, ρ eliminates all members of exactly
one hurdle chain.
Now we shall show that if ρ eliminates all members of exactly one superhurdle chain,
then ρ meets the criteria of the Lemma. If ρ eliminates all member of superhurdle
chain k, and does not eliminate all member of any other superhurdle chain, then ρ
must act on the black edges of two components u and v such that (without loss of
generality) u is equal to, or separated from v by, the superhurdle w of k, and v is
equal to, or separated from u by, the anchor a of k. Thus, u must either equal w or
be a benign component separated by w, and we have proved the necessity of the first
criterion. If v equals the anchor a, then the second criterion is satisfied via option
(a). Otherwise, v must not be another superhurdle (i.e., a superhurdle besides w),
a benign component separated by another superhurdle, or a member of k. If v were
another superhurdle or a benign component separated by another superhurdle, then
ρ would destroy all members of two superhurdle chains; and if v were a member of
k other than a, then ρ would not eliminate all members of k. All that remains is
for v to be a protected nonhurdle that does not belong to k (option (b)) or a benign
component that has no separating hurdle. If the latter, that benign component also
cannot be separated from one unoriented component in w by another; otherwise,
ρ would not orient all components of k. Thus, we obtain option (c). Therefore, if
ρ eliminates all members of exactly one superhurdle chain, then ρ meets the first
56
Chapter 3. Finding All Sorting Reversals
criterion and one of the three options of the second criterion, and the Lemma is
proved.
Finally, we are prepared to enumerate the ways in which a fortress can be elimi-
nated.
Lemma 3.13 If there exists a fortress, then a sorting reversal ρ will eliminate the
fortress iff one of the following is true:
1. ρ splits a cycle and increases the number of hurdles.
2. ρ cuts a single protector or a pseudohurdle.
3. ρ acts on edges of two components u and v such that one of u and v is a
superhurdle or a benign component that has a separating superhurdle. Let u be
this component, and let the hurdle w be either u or u’s separating hurdle. Then
one of the following is true:
(a) v is the anchor of w’s chain.
(b) v is a protected nonhurdle not belonging to w’s chain.
(c) v is a benign component that is not separated by a hurdle, and v is not
separated by one component in w’s chain from another.
(d) v is a superhurdle or a benign component with a separating superhurdle,
and v or its separating hurdle forms a double superhurdle with w.
Proof: Any reversal that eliminates a fortress must make the number of hurdles even
or cause there to arise at least one simple hurdle. Such a change in the set of hurdles
can occur only by the elimination or creation of hurdles, or by the elimination of
pseudohurdles (it cannot occur through the creation of a protected nonhurdle). We
know that d = n+ 1− c+ h+ f and, because ρ is a sorting reversal, that ∆d = −1;
57
Chapter 3. Finding All Sorting Reversals
therefore, if ∆f = −1, then ∆h = ∆c. Furthermore, ∆h = ∆c ∈ −1, 0, 1. Let us
consider these three possibilities:
1. ∆h = ∆c = 1 (a cycle is split and the number of hurdles increases by one).
This case will occur only if at least one portion of the split cycle forms a new
unoriented component, and if this new unoriented component neither separates
existing unoriented components nor protects an existing superhurdle. The
increase in the number of hurdles will cause an odd number to become even,
and thus will eliminate the fortress.
2. ∆h = ∆c = 0 (neither the number of cycles nor the number of hurdles changes).
This case can occur only if ρ acts on convergent edges of the same cycle (oth-
erwise ∆c 6= 0). In addition, ∆f = −1 and ∆h = 0 will be true iff the affected
cycle belongs to a single protector or a pseudohurdle (Lemma 3.10).
3. ∆h = ∆c = −1 (two cycles are merged and the number of hurdles decreases
by one). Here ρ must act on edges of different components (if it acted on
different cycles of the same component, it could not decrease the number of
hurdles). Because all hurdles are superhurdles, the number of hurdles can-
not change unless multiple unoriented components are affected; thus Lemma
3.11 applies here. By combining Lemma 3.11 with Lemma 3.12 we obtain the
characterization presented in Lemma 3.13.
Now we can synthesize everything developed so far by generalizing Lemmata 3.4,
3.5, 3.6, and 3.7 to accommodate fortresses.
Lemma 3.14 (Generalization of Lemma 3.4) A reversal ρ that acts on two
black edges belonging to the same oriented cycle is a sorting reversal iff the edges are
divergent and one of the following is true:
58
Chapter 3. Finding All Sorting Reversals
1. ρ does not introduce an unoriented component
2. There exists no fortress (f = 0) and ρ introduces an unoriented component, but
that oriented component does not cause there to be an odd number of hurdles
all of which are superhurdles.
3. There exists a fortress (f = 1) and ρ increases the number of hurdles.
Proof: In case 1, the set of hurdles cannot be changed, so a fortress cannot be
introduced. Case 2 follows from Lemma 3.8, and case 3 from Lemma 3.13.
Lemma 3.15 (Generalization of Lemma 3.5) A reversal ρ that acts on two
black edges belonging to the same unoriented cycle c is a sorting reversal iff one of
the following is true:
1. There exists no fortress (f = 0), c belongs to a simple hurdle u, and either u
is not the only simple hurdle or the number of superhurdles is even.
2. There exists a fortress (f = 1) and c belongs to a single protector or a pseudo-
hurdle.
Proof: Case 1 follows from Lemma 3.8 and case 2 from Lemma 3.13.
Lemma 3.16 (Generalization of Lemma 3.6) A reversal ρ cannot be a sorting
reversal if ρ acts on two black edges belonging to different cycles of the same compo-
nent.
Proof: We have shown in Lemma 3.13 that, even if a fortress exists, a reversal that
acts on different cycles of the same component cannot remove the fortress. Thus
Lemma 3.6 applies without change to the general case.
59
Chapter 3. Finding All Sorting Reversals
Lemma 3.17 (Generalization of Lemma 3.7) A reversal ρ that acts on black
edges belonging to different components u and v is a sorting reversal iff one of the
following is true:
1. There exists no fortress (f = 0) and all of the following are true:
(a) Each of u and v is a hurdle or a benign component that has a separating
hurdle.
(b) u and v are not benign components sharing the same separating hurdle.
(c) u and v or their separating hurdles do not form a double superhurdle.
(d) The elimination of the hurdles associated with u and v will not leave an
odd number of hurdles all of which are superhurdles.
2. There exists a fortress (f = 1) and either:
(a) Each of u and v is a superhurdle or a benign component that has a sep-
arating superhurdle, and u and v are not benign components sharing the
same separating hurdle; or
(b) Case 3 of Lemma 3.13 applies
Proof: Case 1 comes directly from Lemma 3.7 and case 2 from Lemma 3.13.
3.4 The Algorithm
Lemmata 3.14, 3.15, 3.16, and 3.17, lead directly to an algorithm to address ASR
(Algorithm 3.2). For clarity of presentation, we have broken out as separate sub-
routines the steps that find sorting reversals that split cycles (Algorithm 3.3), that
cut hurdles (or pseudohurdles; Algorithm 3.4), and that merge separate components
60
Chapter 3. Finding All Sorting Reversals
(Algorithms 3.5 and 3.6). For the moment, we shall assume in Algorithm 3.3 the
existence of a routine to detect whether a candidate reversal introduces new unori-
ented components; the details of deriving such a routine will be the subject of the
next section.
Input: Two signed permutations of size n, π and φ; assume functionsget revs split cycles, get revs cut hurdles, get revs merge-
nofort, and get revs merge fort that find all sorting reversals thatrespectively split cycles, cut hurdles (or pseudohurdles), merge compo-nents (when there is no fortress), and merge components (when there isa fortress).
Output: A list L of all sorting reversals of π with respect to φ.
beginConstruct the breakpoint graph B of π with respect to φ;
Identify all black edges, cycles, and connected components in B. Let ei,j,krepresent the ith black edge belonging to the jth cycle of the kth component;
For each (i, j, k), let oi,j,k be defined such that oi,j,k ∈ −1,+1 and oi1,j,k ·oi2,j,k = −1 iff ei1,j,k and ei2,j,k are divergent ;
Label each component as oriented, trivial, or unoriented;
Build hurdle graph H; use H to label each unoriented component as a simplehurdle, a single protector, a pseudohurdle, a superhurdle (i.e., a non-single-protector), or a protected nonhurdle (i.e., a non-pseudohurdle). Also labeleach member of a double superhurdle with its partner (a single hurdle canhave at most two);
Let c be the number of cycles in B, h be the number of hurdles, and s be thenumber of superhurdles; let f = 1 if h = s and s is odd; otherwise let f = 0.
Initialize list L;
append all(L, get revs split cycles() );
append all(L, get revs cut hurdles() );
if f = 0 then append all(L, get revs merge nofort() );else append all(L, get revs merge fort() );
return L;end
Algorithm 3.2: find all sorting reversals
61
Chapter 3. Finding All Sorting Reversals
Input: All ei,j,k, oi,j,k, and the values of h, s, and f from find all sorting-
reversals; assume the existence of a function detect new-
unoriented components, that returns a list of new unoriented compo-nents introduced by a reversal (or ∅ if none is introduced).
Output: A list M of all sorting reversals that split cycles
beginInitialize list M ;
foreach ei1,j,k and ei2,j,k such that oi1,j,k · oi2,j,k = −1 doP ← detect new unoriented components(ρ(ei1,j,k, ei2,j,k));
if P = ∅ then/* no new unoriented component */append(M,ρ(ei1,j,k, ei2,j,k));
end
1 else/* at least one new unoriented component */Add components represented by the elements of P to hurdle graph;Label types of new components, and relabel neighbors in hurdle graphas needed;Compute new number of hurdles (h′) and whether a fortress exists innew permutation (f ′);
2 if h′ + f ′ ≤ h+ f thenappend(M,ρ(ei1,j,k, ei2,j,k));
endend
end
return M ;end
Algorithm 3.3: get revs split cycles
62
Chapter 3. Finding All Sorting Reversals
Input: All ei,j,k, oi,j,k, and the values of h, s, and f from find all sorting-
reversals
Output: A list M of all sorting reversals that cut hurdles (or pseudohurdles)
beginInitialize list M ;
H ← k | component k is a simple hurdle;
1 if f = 1 then H ← H ∪ k | component k is a pseudohurdle or a singleprotector;
2 if f = 0 and s = 2a+ 1, a ∈ Z+ and h = 2a+ 2 thendo not cut hurdles; /* avoid a fortress! */
end
3 else foreach ei1,j,k, ei2,j,k such that k ∈ H and i1 6= i2 doappend(M,ρ(ei1,j,k, ei2,j,k));
end
return M ;end
Algorithm 3.4: get revs cut hurdles
63
Chapter 3. Finding All Sorting Reversals
Input: All ei,j,k, oi,j,k, and the values of h, s, and f from find all sorting-
reversals.Output: A list M of sorting reversals that merge separate components, assuming
there is not a fortressbegin
Initialize list M ;Find all separating hurdles (Algorithm 3.1). Let Sk ∈ S be defined such thatSk = j | component j is a benign component whose separating hurdle iscomponent k;H ← k | component k is a hurdle;foreach i ∈ H do
foreach k ∈ H such that k 6= i do/* avoid a fortress */
1 if (s = 2a + 1, a ∈ Z+ and h = 2a + 3 and hurdles i and k are bothsimple hurdles) or(s = 2a + 2, a ∈ Z+ and h = 2a + 3 and one of hurdles i and k is asuperhurdle) then
Do not merge i and k;end/* avoid a double superhurdle */
2 else if hurdles i and k form a double superhurdle thenDo not merge i and k;
endelse
3 foreach j ∈ i ∪ Si do4 foreach ex1,y1,z1 , ex2,y2,z2 such that z1 = j and z2 ∈ k ∪ Sk do
append(M,ρ(ex1,y1,z1 , ex2,y2,z2));end
endend
endendreturn M ;
end
Algorithm 3.5: get revs merge nofort
64
Chapter 3. Finding All Sorting Reversals
Input: All ei,j,k, oi,j,k, and the values of h, s, and f from find all sorting-
reversals.Output: A list M of sorting reversals that merge separate components, assuming
there is a fortressbegin
Initialize list M ;Find all separating hurdles (Algorithm 3.1). Let Sk ∈ S be defined such thatSk = j | component j is a benign component whose separating hurdle iscomponent k, and let S¬ ∈ S be defined such that S¬ = j | component j isa benign component that has no separating hurdle;H ← k | component k is a hurdle;U ← k | component k is an unoriented component;foreach k ∈ H do
1 foreach j ∈ U such that j belongs to hurdle chain k do γj ← k;2 if j is the anchor of the chain then αk ← j;
end3 foreach j ∈ U such that j belongs to no hurdle chain do γj ← −1;
foreach i ∈ H doV ← k | k ∈ U and k /∈ H and γk 6= i;P ← k | k is a double superhurdle partner of i;W ← k | (k ∈ H and k 6= i) or (k ∈ Sj such that j ∈ H and j 6= i);S ′¬ ← k | k ∈ S¬ and k is not separated by one component of chain γifrom another;foreach j ∈ i ∪ Si do
4 foreach ex1,y1,z1 , ex2,y2,z2 such that z1 = j and z2 ∈ αi∪V ∪P∪W∪S ′¬do
append(M,ρ(ex1,y1,z1 , ex2,y2,z2));end
endendreturn M ;
end
Algorithm 3.6: get revs merge fort
65
Chapter 3. Finding All Sorting Reversals
Let us briefly discuss a few subtleties in obtaining Algorithms 3.3, 3.4, 3.5, and
3.6 from Lemmata 3.14, 3.15, and 3.17. Note that the “else” clause of Algorithm
3.3 (step 1) covers cases 2 and 3 of Lemma 3.14. Because the introduction of a new
unoriented component can disrupt the separation relationships among unoriented
components in various ways, it turns out to be simplest to add the new unoriented
component to the hurdle graph, to relabel neighbors as necessary, and to recompute
the sum of the numbers of hurdles and fortresses. In Algorithm 3.4, note that, in the
case of a fortress, pseudohurdles and single protectors can be handled exactly as are
simple hurdles (step 1; case 2 of Lemma 3.15). In this algorithm, step 2 explicitly
avoids walking into a fortress (case 1). Once an unoriented component is identified
as able to be cut, a separate sorting reversal is defined by every pair of black edges
that belong to the same cycle in that component (step 3).
In Algorithm 3.5, benign components with separating hurdles are essentially han-
dled like hurdles themselves (step 3; case 1a of Lemma 3.17), except that checks to
avoid double superhurdles (step 2; case 1c) and to avoid walking into a fortress (step
1; case 1d) are executed with respect to the separating hurdles of such benign com-
ponents. The algorithm implicitly avoids merging two benign hurdles that share the
same separating hurdle (case 1b) by the way its loops are nested. Note that, once
two components are found that can be merged, a sorting reversal is defined by every
pair of black edges such that one belongs to each component (step 4).
In Algorithm 3.6, it is necessary to label every unoriented component with its
hurdle chain (step 1), or to mark it appropriately if it belongs to no hurdle chain
(step 3). It is also necessary to associate each hurdle with the anchor of its chain
(step 2). Once these steps are accomplished, four sets are identified with respect to
every hurdle i: V contains all protected nonhurdles not belonging to i’s chain; P
contains every double superhurdle partner of i; W contains all hurdles besides i and
the benign components that they separate; and S ′¬ contains all members of S¬ (the
66
Chapter 3. Finding All Sorting Reversals
set of benign components without separating hurdles) except those that are separated
by one component in i’s chain from another. The union of the set containing the
anchor of i’s chain (αi), V , P , W , and S ′¬ represents the set of all components to
merge with i and with all of the benign components that i separates (step 4). As
before, a sorting reversal is defined by every pair of black edges drawn from two
mergeable components.
Theorem 3.1 Algorithm 3.2 will correctly find all sorting reversals of one permu-
tation π with respect to another permutation φ (π and φ both of size n).
Proof: We have shown that Lemmata 3.14, 3.15, 3.16, and 3.17 are correct; further-
more, because they are tied directly to the exhaustive classification of reversals of
Lemma 3.2, these Lemmata describe all possible sorting reversals. Algorithm 3.2
is constructed directly from these Lemmata, with the subroutine of Algorithm 3.3
representing Lemma 3.14, the subroutine of Algorithm 3.4 representing Lemma 3.15,
and the subroutine of Algorithms 3.5 and 3.6 representing Lemma 3.17. Lemma 3.16
is represented implicitly by the absence of any steps that consider edges belonging
to different cycles of the same component.
Let us comment briefly on the asymptotic running time of Algorithm 3.2. The
initial setup (finding and labeling black edges, cycles, and components; building the
hurdle graph5) can all be done in average time linear in the number of genes, n. The
number of divergent black edges isO(n2) (consider the case of a breakpoint graph con-
sisting of one oriented cycle with n+12
edges of each orientation), so Algorithm 3.3 re-
quires O(n2 ·f(n)) time, if f(n) is the time required to run detect new unoriented-
components. The number of ways to cut hurdles is of size O(n2) (consider the case
of a breakpoint graph consisting of a single unoriented cycle); thus, Algorithm 3.4
5Note that while the number of nodes in the hurdle graph is O(n), the number of edgesis also limited to O(n), rather than the O(n2) that one might expect.
67
Chapter 3. Finding All Sorting Reversals
takes O(n2). Algorithms 3.5 and 3.6 also take O(n2) time, because although there
can be O(n2) pairs of components and O(n) black edges per component, there can
only be O(n2) pairs of black edges in total. Thus, Algorithm 3.2 is dominated by
Algorithm 3.3, and its running time is O(n2 · f(n)). In the next section, we shall see
that detect new unoriented components can be performed in O(n) time, yielding a
running time of O(n3) for Algorithm 3.2. We shall also see, however, that the fastest
implementation of detect new unoriented components is one that takes O(n2) time
(implying O(n4) for Algorithm 3.2).
3.5 Detecting New Unoriented Components
The purpose of the routine we have called detect new unoriented components
(which we will now abbreviate “detect”) is to find whether a reversal that splits
a cycle introduces unoriented components. If it does not, we know that the number
of hurdles stays the same, and hence that the reversal is a sorting reversal; if it does,
we must examine any new unoriented components and their effect on the hurdle
graph to see whether the number of hurdles increases6. Because oriented cycles are
far more common than unoriented ones, and because detect must be executed anew
for every pair of divergent edges of every oriented cycle, the efficiency of this routine
is critical for the efficiency of Algorithm 3.2. Note that detect can facilitate the ad-
ditional examination required when new unoriented components are introduced by
returning the vertices in the breakpoint graph that compose such components. Thus,
we will define detect to return not simply true or false, but a list of sets of ver-
tices; if this list is empty, then it is understood that no new unoriented components
6If the reader is in doubt of the possibility of introducing an unoriented componentwhile splitting a cycle, consider a reversal that merges hurdles: performing the “same”reversal once more (that is, to “undo” the merge) will split an oriented component intotwo unoriented components (both hurdles).
68
Chapter 3. Finding All Sorting Reversals
were created.
It is worth mentioning why the problem of detect has not been studied, since
one might imagine that solving it would be a prerequisite to any algorithm that sorts
signed permutations by reversals. Algorithms that seek only one sequence of sorting
reversals, however, are able to avoid checking for new unoriented components by
carefully selecting reversals that are guaranteed not to introduce them. Because we
seek all sorting reversals, we do not enjoy this luxury.
We will begin this section by discussing briefly a simple linear-time algorithm for
detect; then we will introduce a more complicated O(n2) algorithm that turns out
to be more efficient in practice.
3.5.1 A Simple Linear-Time Solution
The most straightforward way to implement detect is simply to have it rerun the
linear-time connected components routine of [1], and then test whether that routine
yields a greater number of unoriented components after the reversal than before.
This approach can be improved slightly by noting that a candidate reversal can
alter only the connected component that it affects. As a result, it is possible to run
connected components on only the portion of the breakpoint graph consisting of
vertices that belong to that component.
It might seem unlikely that one could improve on this strategy. As it turns out,
however, many reversals alter only very slightly the component that they affect,
and as a result, re-examining that entire component is wasteful. We will see below
how, by studying the effect of a reversal on the overlap graph, we can derive an
algorithm that performs better in practice (although worse in asymptotic terms)
than the simple one described here.
69
Chapter 3. Finding All Sorting Reversals
3.5.2 The Effect of a Reversal on the Overlap Graph
Bergeron [2] and Bergeron and Strasbourg [3] have introduced an elegant and sim-
ple technique for sorting by reversals that sidesteps much of the complexity of
Hannenhalli-Pevzner theory. An important part of their method is to model sim-
ply and efficiently the effect on the overlap graph of a reversal corresponding to
an oriented gray edge in the breakpoint graph. Their technique uses the following
Lemma:
Lemma 3.18 ([20, 2]) If one performs the reversal corresponding to an oriented
vertex v, the effect on the overlap graph will be to complement the sub-graph of v and
its adjacent vertices.
Unfortunately, Lemma 3.18 is too restrictive for our purposes, because it considers
only reversals that correspond to oriented gray edges (which are equivalent to oriented
vertices in the overlap graph). Without too much trouble, however, we can generalize
it to characterize the effect on the overlap graph of any reversal.
Lemma 3.19 (Generalization of Lemma 3.18) The effect on the overlap graph
of any reversal will be to complement the sub-graph of all vertices corresponding to
affected gray edges.
Proof: Let u and v be two gray edges affected by a reversal ρ. Then exactly as
described in [2], if u and v overlap, ρ will cause them not to overlap, and if u and
v do not overlap, ρ will cause them to overlap (the effect is as if there existed an
oriented vertex corresponding to ρ, in Bergeron’s scenario; see Figure 3.7). Suppose
instead that one or both of u and v are unaffected by ρ. If both are unaffected,
their overlapping relationship cannot change. Suppose (without loss of generality)
that u is affected and v is unaffected. If R and R′ are the ranges of ρ, then one of
70
Chapter 3. Finding All Sorting Reversals
v
uv
vu
u u
u
u
v
v
v
reversal
reversal
reversal
v
v
u
u
vu
reversal
reversal
reversal
v
v
u
u
vu
Figure 3.7: Some examples illustrating that whether two edges overlap changes ifand only if both are affected by a reversal.
R and R′ must contain both vertices of v and one of u, and the other must contain
the other vertex of u. Without loss of generality, call the former R and the latter
R′. The reversal will simply reverse the order in the breakpoint graph of vertices
that belong to R with respect to those that belong to R′ (and vice versa), without
changing the relative order of the vertices within each set. If u and v overlap, then in
a circular traversal of the vertices of R in the breakpoint graph, one will encounter
first a vertex of v, then one of u, then the second vertex of v. For this reason, we say
that u’s vertex is “between” those of v. The reversal cannot alter this arrangement;
that is, after ρ occurs, u’s vertex will still be between those of v, and thus, u and
v will still overlap. If, on the other hand, u and v do not overlap, then the vertex
of u in R will not occur between the vertices of v. The reversal cannot alter this
arrangement either. Thus, the overlapping relationship between two gray edges is
negated if and only if both edges are affected by a reversal, and consequently, any
reversal will complement the sub-graph of the overlap graph that includes exactly
those vertices corresponding to affected edges.
Recall that we have already established (Lemma 3.1) that a reversal changes the
orientation of exactly those gray edges that it affects. This finding, together with
Lemma 3.19, allows us to characterize completely the effect on the overlap graph of
any given reversal. Figure 3.8 illustrates this effect using a simple example.
71
Chapter 3. Finding All Sorting Reversals
16 15 10 9 12 11 14 13 8 5 6 3 18 17 1 270 194
(0,1)
(18,19)
(16,17)
(14,15)
(12,13)(10,11)
(8,9)
(6,7)
(4,5)
(2,3)
(0,1)
(18,19)
(16,17)
(14,15)
(12,13)(10,11)
(8,9)
(6,7)
(4,5)
(2,3)
16 15 10 9 12 11 6 5 7 13 14 4 3 18 17 1 280 19
Figure 3.8: The effect of a reversal on the breakpoint and overlap graphs. Notethat gray edges (4,5), (8,9), (12,13), and (14,15) are affected; as a result, these edgeschange orientation and their sub-graph is complemented in the overlap graph. Thisparticular reversal splits a cycle and introduces a hurdle.
72
Chapter 3. Finding All Sorting Reversals
3.5.3 A Bitwise Algorithm
Algorithm 3.7 is an implementation of detect based on Lemmata 3.1 and 3.19. It
makes use of several techniques introduced by Bergeron and Strasbourg, including
the following:
1. The overlap graph for k gray edges is represented as a “bit matrix” composed
of k bit vectors v0 . . . vk−1, each of length k, such that the ith bit of the jth
vector and the jth bit of the ith vector are set to 1 if edges i and j overlap,
and set to 0 otherwise. An example of a bit matrix is shown in Figure 3.9.
2. An auxiliary bit vector p indicates the “polarity”, or orientation, of each edge.
That is, the ith bit of p is set to 1 if the ith edge is oriented, and set to 0
otherwise.
3. Bitwise operators are used to model efficiently the changes to the overlap graph
induced by a candidate reversal. The “exclusive or” operator is particularly
useful for “flipping” selected bits to reflect a change in edge orientation, or to
find the complement of a sub-graph of the overlap graph.
Our algorithm introduces two additional bit vectors, a and l, whose purposes
will be seen below. Moreover, we take advantage of our previous observation that
connected components can be examined separately by representing each component
with a separate bit matrix (which allows considerable savings of time and space, since
the bit matrix takes O(k2) time to build). Algorithm 3.7 assumes an initialization
step in Algorithm 3.2 that constructs a bit matrix for each oriented component; bit
matrices need not be built for unoriented and trivial components.
The algorithm alters the overlap graph to reflect the candidate reversal, then
searches the graph for unoriented components (if there exist unoriented components,
they must be new, because the original overlap graph represented only a single
73
Chapter 3. Finding All Sorting Reversals
oriented component). It begins by constructing a bitwise representation of which
vertices are affected by the reversal (bit vector a), then uses this representation to
compute updated versions of the bit matrix (step 1; Lemma 3.19) and the orientation
bit vector, p (step 2; Lemma 3.1). In addition, the algorithm uses bitwise operators
to construct a bit vector (l) that indicates which affected vertices are rendered un-
oriented by the reversal (step 3). It uses this vector to limit the possible starting
points for its search of the graph (step 4), because any new unoriented component
must contain at least one such vertex. Any search starting from such a vertex can be
aborted as soon as an oriented vertex is found (step 5), because a single oriented ver-
tex is sufficient to ensure a component is oriented. Because only an oriented vertex
can cause a search to abort early, however, the presence of a mark from a previous
search indicates that an oriented vertex must also be present (step 8). Note that the
algorithm must distinguish trivial components from unoriented ones (steps 7 and 9),
since neither contain oriented edges.
Figure 3.9 shows the changes the algorithm will make in the bit matrix and
auxiliary bit vectors that correspond to the example of Figure 3.8.
Theorem 3.2 Given inputs as specified, Algorithm 3.7 will return a nonempty list
if and only if a candidate reversal ρ creates at least one new unoriented component.
Furthermore, iff ρ introduces p new unoriented components, then the elements Ui of
L, 1 ≤ i ≤ p, will contain the vertices that compose these new components.
Proof: First we must show that, if ρ introduces no new unoriented component,
then Algorithm 3.7 will return an empty list, and if ρ introduces p new unoriented
components, Algorithm 3.7 will return a list of p sets each containing the vertices
of a new unoriented component. Steps 1 and 2 ensure that the overlap graph will
be altered exactly as described by Lemmata 3.19 and 3.1. Suppose ρ splits a cycle
74
Chapter 3. Finding All Sorting Reversals
without introducing an unoriented component. Then every vertex in the altered
overlap graph will be part of an oriented or trivial component, so every search at
step 5 will result in either the flag oriented being set, or the flag trivial being
left as true, and the algorithm will never append any set to the list L; thus, L will
be empty when the algorithm exits. Suppose instead that ρ introduces p unoriented
components. For each one of these components, ci, there must exist a vertex v in ci
that was affected and rendered unoriented by ρ. We know this is true because the only
changes ρ can cause to the overlap graph are in the orientation of affected vertices
and in the presence or absence of edges between affected vertices; and ci cannot
contain an affected, oriented vertex (if it did, it would be an oriented component).
Steps 3 and 4 ensure that we begin a search at every affected, unoriented vertex,
so eventually we will begin a search with v, or with some other affected, unoriented
vertex u that is also part of ci. Because ci is an unoriented component, a search from
v or u cannot abort until the entire component has been traversed. When the search
finishes, no oriented vertex will have been found, and no mark from another search
will have been encountered, so the oriented flag will not have been set. In addition,
the trivial flag will have been changed to false, so the algorithm will append V
to L at step 9. Furthermore, V will correctly contain all of the vertices in ci. Thus,
when the algorithm exits, the list L will contain a set of vertices representing each
new unoriented component.
Now we must show that, if Algorithm 3.7 returns an empty list, then ρ must have
introduced no new unoriented component, and if Algorithm 3.7 returns a list of p
sets, then each set must contain the vertices of a separate new unoriented component.
Suppose the algorithm returns an empty list. Then the test at step 9 must always
have failed, so that no set was ever appended to the list. If this test failed, then
each search at step 8 must have encountered an oriented vertex or a mark from
a previous search, or have been executed on a vertex with no neighbors. Because
a search was begun at each affected, unoriented vertex, a search was initiated of
75
Chapter 3. Finding All Sorting Reversals
every component that could possibly be unoriented. If in a search of a component
c, the algorithm encountered an oriented vertex, then clearly c must be oriented.
If instead the algorithm encountered a mark from a previous search, then c also
must be oriented, because the previous search could have aborted before marking
all vertices of the component only if it encountered an oriented vertex or a mark
from a previous search (the first in a sequence of such searches must have actually
encountered an oriented vertex). If instead c consisted of a vertex with no neighbors,
c must be trivial. In any case, c could not be unoriented. Therefore, in this case, all
components that could have been unoriented must have not have been unoriented,
and the algorithm must have correctly returned the empty list. Suppose instead
that the algorithm returns a list of p sets. Each set Ui, 1 ≤ i ≤ p, must have
been appended to the list at step 9. Furthermore, each such set must include all
of the vertices in a nontrivial connected component, because if the algorithm had
encountered an oriented vertex or a previous mark, and aborted the search early, it
would have set the oriented flag, and if the component had consisted of a single
vertex, the trivial flag would have remained true; in either case, the set could
not have been appended to the list. Thus, each set Ui must consist of all vertices
of a component that has no oriented vertex, and properly represents an unoriented
component.
Algorithm 3.7 takes Ω(k2) time, but runs very fast in practice. Reasons for
its speed include that k tends to be much smaller than n when distances between
permutations are modest, that the steps taking Ω(k2) time are very efficient (copying
the bit matrix and executing the exclusive-or computations), and that if oriented
vertices are present (as they usually are), the graph search often can abort quickly.
76
Chapter 3. Finding All Sorting Reversals
Input: (1) A candidate reversal ρ; (2) the overlap graph for the k-vertex compo-nent affected by ρ, represented (as described in [2]) as an n×n bit-matrixwith rows and columns corresponding to vertices v0 . . . vk−1; and (3) abit-vector p indicating the “parity” or orientation of each vertex in theoverlap graph (pi = 1 iff vi is oriented). Assume bit-vector operators for“exclusive or” (⊕), “and” (∧), and “negation” (¬).
Output: A list L of sets (U1, U2, . . . , Up) such that each set contains those verticescomposing a new unoriented component; L = ∅ if no such componentis detected.
begin/* identify affected vertices */a← (a0 . . . ak−1 | ∀i, 0 ≤ i ≤ k − 1, ai ∈ 0, 1, ai = 1 iff vi is affected by ρ);
/* complement sub-graph of affected vertices */1 foreach i | ai = 1 do vii ← 1; vi ← vi ⊕ a;
/* negate parity of affected vertices */2 p← p⊕ a;
/* identify affected vertices that are now unoriented */3 l← ¬p ∧ a;
/* search for unoriented components in the graph */Initialize mark, an array of k integers, with values of −1;Initialize list L;
4 foreach i | li = 1 doif mark[i] 6= −1 then continue;
/* starting with vi, traverse the overlap graph until exhaustionor evidence of an oriented vertex */Initialize stack S; V ← ∅; oriented ← false; trivial ← true ;push(S, i);
5 while oriented = false and S not empty dopop (S, j); V ← V ∪ vj; mark[j]← i;
6 foreach m | vjm = 1 do7 trivial ← false ;8 if pm = 1 or −1 < mark[m] < i then oriented ← true; break ;
if mark[m] = −1 then push(S,m);end
end9 if oriented = false and trivial = false then append(L, V );
endreturn L;
end
Algorithm 3.7: detect new unoriented components
77
Chapter 3. Finding All Sorting Reversals
Before reversal
(4,5) (6,7) (8,9) (10,11) (12,13) (14,15)v0 v1 v2 v3 v4 v5
(4,5) v0 0 1 1 0 1 1(6,7) v1 1 0 0 0 0 0(8,9) v2 1 0 0 1 1 0
(10,11) v3 0 0 1 0 1 0(12,13) v4 1 0 1 1 0 0(14,15) v5 1 0 0 0 0 0
p 0 1 1 0 1 1a 1 0 1 0 1 1
After reversal
(4,5) (6,7) (8,9) (10,11) (12,13) (14,15)v0 v1 v2 v3 v4 v5
(4,5) v0 0 1 0 0 0 0(6,7) v1 1 0 0 0 0 0(8,9) v2 0 0 0 1 0 1
(10,11) v3 0 0 1 0 1 0(12,13) v4 0 0 0 1 0 1(14,15) v5 0 0 1 0 1 0
p 1 1 0 0 0 0l 0 0 1 0 1 1
Figure 3.9: The bit matrix for the affected component of Figure 3.8, before andafter the reversal. The bit vectors p, a, and l are also shown (only p is relevantboth before and after the reversal). Algorithm 3.7 will detect the new unorientedcomponent during a graph traversal beginning at v2.
78
Chapter 3. Finding All Sorting Reversals
3.6 Experimental Methods and Results
We implemented Algorithm 3.2 in C and tested it for correctness and speed. Our
implementation, program find-all-sr, includes code for both detect algorithms
discussed in the previous section, and allows either one to be selected at compile
time. Program find-all-sr also re-implements the connected components routine
of [1], instead of using their existing implementation. We chose this approach for two
reasons: so that we could adapt the routine to use the edge-overlap rather than the
cycle-overlap formulation of the overlap graph, the former being more compatible
with the rest of Algorithm 3.2; and so that we could parameterize the routine for the
option to consider only a single connected component, as is useful for the linear-time
implementation of detect. Program find-all-sr comprises about 1600 lines of C
code.
Test data fell into three classes. The largest class consisted of pairs of signed
permutations, such that one member of each pair had been “scrambled” with respect
to the other by a specified number of reversals. These pairs were generated by a
program that took three parameters: the permutation size n, the number of pairs
to generate p, and the number of random reversals r to execute on a member of
each pair. For each of the p pairs, this program would generate a random signed
permutation of size n, copy the permutation, execute r random reversals on the
copy, and output the original and the scrambled copy as a pair. A random reversal
was defined as a reversal from position i to position j, such that i and j are two
random integers, 0 ≤ i < j ≤ n + 1. The second class of test data was similar to
the first except that permutations were scrambled not with random reversals, but
with a procedure designed to introduce as many unoriented components as possible.
This class was intended to stress the many parts of the algorithm (and of the code)
that are exercised only when multiple unoriented components are present7. The third
7Random reversals only very rarely create an unoriented component, and virtually never
79
Chapter 3. Finding All Sorting Reversals
class of data consisted of hand-picked pairs of permutations representing special cases
unlikely to appear in the other two classes (configurations involving fortresses, long
hurdle chains, double superhurdles, and the like). A total of several thousand pairs
of permutations were produced.
Correctness testing was performed by comparing the output of program find-
all-sr with that of a control called program find-all-bf. Program find-all-bf
finds all sorting reversals of a permutation π with respect to a permutation φ by brute
force—that is, by considering all(n+1
2
)reversals that can be executed on π, finding
the neighbor of π corresponding to each reversal, calculating that neighbor’s distance
from φ, and outputting descriptions of only those reversals that produce neighbors
closer to φ than is π (this approach takes Ω(n3) time). This program directly uses
the well-tested code for inversion distance by Bader, et al. [1], and because it is
also very simple, is believed to be highly reliable. Program find-all-sr was not
confirmed to be correct until on all test cases it produced identical results to program
find-all-bf (aside from the order in which reversals appeared).
Performance testing focused on two types of comparisons: the performance of
find-all-sr versus that of find-all-bf, and the performance of find-all-sr
when using the “connected-components” version of the detect algorithm versus that
when using the “bitwise” version. All testing was performed on a SONY laptop with
a 700 MHz Pentium III processor and 128 MB of RAM, running the Linux operating
system (RedHat 7.0). The most extensive testing was performed using test data of
the first class (the one with permutations that had been scrambled by reversals), as
the presence of multiple unoriented components did not seem to change performance
significantly (if anything, the performance of both versions of find-all-sr improved
[relative to reversal distance] for test data of the second class). We ran tests for values
of n between 25 and 100 and values of r between 0% and 100% of n; for each value
create multiple unoriented components.
80
Chapter 3. Finding All Sorting Reversals
0
20
40
60
80
100
120
140
0 10 20 30 40 50 60 70 80 90 100
t (sec)
r
find-all-bf
+
+
++
+ + + ++ + +
+find-all-sr1
find-all-sr2
4 4 4 4 4 4 4 4 44 4
4
Figure 3.10: Running times of programs find-all-bf and find-all-sr for n = 100and eleven values of r between 0 and 100. Plotted are total times required toprocess 500 pairs of permutations. Times are plotted for find-all-sr using boththe connected-components (find-all-sr1) and bitwise (find-all-sr2) implemen-tations of detect.
of n and r, we recorded the cumulative times required to find all sorting reversals
for each of 500 pairs of permutations.
Figure 3.10 shows results for n = 100 and 0 < r ≤ 100, which are typical of what
we observed. Plots are shown for find-all-bf and both versions of find-all-sr.
As would be expected the brute force algorithm shows approximately constant per-
formance8 for all values of r, and the implementations of Algorithm 3.2 perform
significantly better9 at small r than at large r. What one might not have expected,
however, is that Algorithm 3.2 still performs three to four times as fast as the brute
force approach even when r = n, when one would expect Algorithm 3.2 to be closest
8The slightly better performance seen at small r is probably the result of improvementsin the speed of distance calculations due to the presence of numerous trivial components.
9As r increases, so does the average number of pairs of divergent black edges of thesame cycle, the dominant factor in the running time of the algorithm
81
Chapter 3. Finding All Sorting Reversals
0
50
100
150
200
250
300
350
400
450
500
0 10 20 30 40 50 60 70 80 90 100
tbftsr
r
××
×
×
×× × × × × ×
Figure 3.11: Speedup of find-all-sr2 (the version with bitwise detect) with respectto find-all-bf for the same experiment shown in Figure 3.10.
to its worst-case performance of O(n3) or O(n4) time (depending on which implemen-
tation of detect is in use). What is also surprising is that the bitwise implementation
of detect outperforms the connected-components implementation consistently, for
all values of r. Indeed, for r ≥ 10, the bitwise implementation results in total ex-
ecution times between 67% and 77% of those achieved with the other. One might
have expected, to the contrary, that the bitwise implementation would become less
efficient as r became large, when the average size of a connected component also
becomes large.
Figure 3.11 plots the speedup of find-all-sr with respect to find-all-bf, so
that the advantage of the former at small r can be observed in more detail. In this
experiment, a speedup of over 400 times is achieved for r ≤ 10. These results are
particularly encouraging because it is small values of r that are most of interest with
respect to our larger goal of solving the reversal median problem.
82
Chapter 3. Finding All Sorting Reversals
0
200
400
600
800
1000
1200
1400
1600
0 20 40 60 80 100r
× × × × × ××
×
×
××
Figure 3.12: Average number of sorting reversals for the experiment shown in Figure3.10. Error bars indicate one standard deviation.
Figure 3.12 shows the number of sorting reversals for 0 < r ≤ 100. Notice that
the number begins to rise steeply at about r = 0.4n. When r is close to n, as one
might expect, a significant fraction of all possible reversals are sorting reversals (here
the fraction is about 15); but even when r = 0.5n, hundreds of sorting reversals are
possible.
3.7 Summary
In this chapter, we have temporarily abandoned the reversal median problem, and
derived a solution to the problem of finding all sorting reversals of one permutation
with respect to another (which we call ASR). Our hope is that a solution to this
problem will allow us to explore more efficiently the search space of the median
problem. The ASR problem is of considerable interest independent of the median
problem—both theoretical interest, because it requires extending the Hannenhalli-
83
Chapter 3. Finding All Sorting Reversals
Pevzner theory in new ways, and practical interest, because an efficient solution
provides a general-purpose tool for exploring the space of genome rearrangements.
The problem is complex and requires a deliberate, step-by-step approach. We have
approached it by classifying all possible reversals, proving under a simplified version
of the problem (the “Fortress-Free Model”) which classes can be sorting reversals
and under what conditions they sort, then reintroducing fortresses and adapting
our results to accommodate them. In this way, we have derived an algorithm that,
while nontrivial, is reasonably understandable, and appears to perform well. We
have shown experimental results indicating a speedup compared to a brute-force
algorithm (the only reasonable alternative known to us) of between about 4 (when
pairwise distances are large) and about 400 (when distances are small). We have
also shown results indicating that the number of sorting reversals is very large—
much larger than one might think from a naive interpretation of Hannenhalli-Pevzner
theory, which focuses on particular subsets of the set of all sorting reversals. When
pairwise distances are large, a significant fraction of all reversals are sorting reversals,
and even when distances are modest, many sorting reversals exist (for n = 100 and
r = 50 the mean number is approximately 100).
As we have derived our solution to ASR, we have introduce several new ideas that
may be useful in solving related problems. For example, we have introduced a simple
graph representation of the separation relationships among unoriented components—
the “hurdle graph”—that allows complex relationships to be described easily and
precisely, in terms of the degrees of corresponding vertices, and whether those vertices
belong to cycles. In part using the hurdle graph, we have characterized several
new classes of unoriented component, including double superhurdles, pseudohurdles,
single protectors, and anchors of hurdle chains. We have also derived an efficient
solution to the problem of detecting whether a reversal introduces an unoriented
component, which turns out to be an important sub-problem of ASR. Our solution
involves extending the method previously described by Bergeron for modeling the
84
Chapter 3. Finding All Sorting Reversals
effects of reversals using bit-vector operators and a bitwise representation of the
overlap graph.
85
Chapter 4
An Improved Algorithm for
Finding an Exact Median
...the abysmal dark
of the unfathomed center...
–David Hartley Coleridge
Now that we have at hand an algorithm to enumerate all sorting reversals, we can
apply it to the median problem in the way described at the beginning of chapter 3:
by restricting our search to the neighbors of intermediate vertices that correspond to
sorting reversals. In this way, we hope to grope our way steadily toward the “abysmal
dark of the unfathomed center”, improving on the the much blinder exploration of
the algorithm of chapter 2.
One problem remains, however: what if there is no perfect median? In such a
case, we might simply fall back on the algorithm of chapter 2, but we would prefer
not to do so, especially now that we have seen how inefficient its method is for
finding feasible neighbors of an intermediate vertex. To extend our new method to
the case of non-perfect medians, however, we must be able to enumerate other classes
86
Chapter 4. An Improved Algorithm for Finding an Exact Median
of reversals besides sorting reversals. As we have noted, in all cases ∆d ∈ −1, 0, 1;
thus, only three classes of reversals are possible: sorting reversals (∆d = −1), what
we call neutral reversals (∆d = 0), and what we call anti-sorting reversals (∆d = 1).
We will see below that, without too much effort, we can extend our framework from
the previous chapter to enumerate neutral reversals as well as sorting reversals. In
this way, we will enable ourselves to enumerate all three sets, because the set of
anti-sorting reversals is simply the difference between the set of all reversals (which
is easily enumerable) and the union of the sets of neutral and sorting reversals.
In this chapter we will use these insights to develop a dramatically improved
algorithm to find an exact reveral median. We will call this algorithm the “sorting”
algorithm, and contrast it with the algorithm of chapter 2, which we will call the
“metric” algorithm. We will begin by precisely characterizing neutral reversals; then
we will overlay our new methods for enumerating reversals on the basic branch-and-
bound framework of the metric algorithm; and finally, we will show experimental
results demonstrating the performance gain of the sorting algorithm with respect to
the metric algorithm.
4.1 Enumerating Neutral Reversals
Let us begin with a precise definition of a neutral reversal.
Definition 4.1 A neutral reversal of a permutation π with respect to a permuta-
tion φ is a reversal ρ such that d(π, φ) = d(ρ(π), φ)
We can establish exactly which reversals are neutral reversals in the same way
that we established which reversals are sorting reversals. As before, we will begin
with the FFM , and we will follow the classification system of Lemma 3.2. We can
87
Chapter 4. An Improved Algorithm for Finding an Exact Median
use the idea of conservation of distance in this case as well. Here, ∆d = 0, so it must
be true that ∆c = ∆h. As before, ∆c ∈ −1, 0, 1.
Lemma 4.1 (Case 1a) Under the FFM , a reversal ρ that acts on two black edges
belonging to the same oriented cycle is a neutral reversal iff either the black edges
are divergent and ρ increases the number of hurdles by one, or the black edges are
convergent and ρ does not change the number of hurdles.
Proof: Either the black edges are divergent and ρ splits the cycle (∆c = 1) or the
black edges are convergent and ρ is neutral with respect to cycle number (∆c = 0).
In the former case, ρ is neutral iff ∆h = 1, and in the latter case ρ is neutral iff
∆h = 0.
Lemma 4.2 (Case 1b) Under the FFM , a reversal ρ that acts on two black edges
belonging to the same unoriented cycle c is a neutral reversal iff c does not belong to
a simple hurdle.
Proof: Because c is unoriented, all of its black edges are convergent; therefore, ∆c =
0, and to be a neutral reversal, ρ must cause ∆h = 0. The component m to which
c belongs is oriented, a protected nonhurdle, or a hurdle; and if it is a hurdle, it
is a superhurdle or a simple hurdle. If m is a simple hurdle, then ∆h = −1, as
shown in Lemma 3.5. In all of the other cases, however, ∆h = 0, and ρ is neutral.
Let us consider each case. If m is oriented, ρ cannot remove a hurdle (because it
only affects m), and it also cannot create a hurdle (ρ cannot create new components
because it acts on convergent black edges; and ρ affects and orients c, so m must
remain oriented); thus, ∆h = 0. If m is a protected nonhurdle, then ρ orients m but
does not change the number of hurdles. Finally, if m is a superhurdle, ρ cuts m and
causes another hurdle to emerge in its place, so ∆h = 0.
88
Chapter 4. An Improved Algorithm for Finding an Exact Median
Lemma 4.3 (Case 2) Under the FFM , a reversal ρ that acts on two black edges
belonging to different cycles of the same component m is a neutral reversal iff m is
a simple hurdle.
Proof: Because ρ acts on different cycles, ∆c = −1; therefore, to be a neutral reversal,
ρ must cause ∆h = −1. The only way that ρ can reduce the number of hurdles is if
it affects a hurdle; and because it affects only m, we know that m must be a hurdle.
However, m cannot be a superhurdle, or another hurdle would emerge when it was
oriented. Only when m is a simple hurdle will ρ be a neutral reversal.
Lemma 4.4 (Case 3) Under the FFM , a reversal ρ that acts on black edges be-
longing to different components u and v is a neutral reversal iff the following are
true:
1. One of u and v is either a hurdle or a benign component with a separating
hurdle. Without loss of generality, assume that u is this component. Let w
be either u (if u is a hurdle) or the separating hurdle of u (if u is a benign
component).
2. One of the following is true of v:
(a) If w belongs to an anchored hurdle chain, then:
i. v is the anchor of w’s hurdle chain; or
ii. v is a protected nonhurdle that does not belong to w’s hurdle chain; or
iii. v is a benign component that has no separating hurdle, and v is sep-
arated from w by all components in w’s chain; or
iv. v is either a hurdle or a benign component separated by a hurdle, and
v or its separating hurdle forms a double superhurdle with w.
89
Chapter 4. An Improved Algorithm for Finding an Exact Median
(b) If w does not belong to an anchored hurdle chain (in which case there
exists a single, unanchored hurdle chain), then:
i. w’s chain is of length one, v is a benign component separated by w,
and either u is w or u is a benign component separated from v by
w; or
ii. w’s chain is of length at least two, and v is a protected nonhurdle
adjacent to the other hurdle besides w; or
iii. w’s chain is of length at least two, v is a benign component that has
no separating hurdle, and v is separated from w by every unoriented
component but by no hurdle.
Proof: Because ρ acts on different cycles, ∆c = −1; therefore, to be a neutral reversal,
ρ must cause ∆h = −1. Let us show first that, if u and v meet the criteria of the
Lemma, then ρ results in ∆h = −1 and the reversal is neutral. The hurdle chain of w
must either be anchored or unanchored. If it is anchored, and v meets any of the first
three criteria of case (a)—that is, it is the anchor of w’s chain, a protected nonhurdle
that does not belong to w’s chain, or a benign component as described—then ρ will
eliminate w’s entire hurdle chain but will not completely eliminate any other hurdle
chain; thus, exactly one hurdle will be removed and none will emerge. If the chain is
anchored and v or its separating hurdle forms a double superhurdle with w, then as
shown previously, exactly two hurdles will be removed and exactly one will emerge.
On the other hand, if w’s chain is unanchored, then there is a single, unanchored
hurdle chain, and that chain is either of length one or has a hurdle at either end. If
the chain is of length one and v meets the first criterion under case (b), then ρ will
eliminate the only hurdle. If the chain is of length at least two, then there must exist
a hurdle x opposite w in the chain. If v meets either of the remaining criteria under
case (b) then ρ will eliminate all of the hurdle chain except for x. Thus, in all cases
∆h = −1.
90
Chapter 4. An Improved Algorithm for Finding an Exact Median
Now let us show that, if ρ is neutral, then u and v must meet the criteria of the
Lemma. To cause ∆h = −1, ρ must orient one hurdle and avoid causing another
hurdle to emerge, or it must orient two hurdles and cause another hurdle to emerge
(it is impossible for ρ to orient more than two hurdles). Suppose that ρ orients
one hurdle and avoids causing another to emerge. Either there is or is not a single,
unanchored hurdle chain. If there is, then ρ must eliminate the whole chain if it is
of length one, or all but one component of the chain if it is of length greater than
one (if more than one unoriented component remained, there would still exist two
hurdles). If ρ eliminates an entire chain of length one, then w must be the only
unoriented component; therefore, v must be a benign component, and u must either
be w or another benign component separated from v by w. In this case, every benign
component will have w as its separating hurdle, so all aspects of the first criterion
under case (b) must be met. If ρ eliminates all members but one of an unanchored
hurdle chain, then v must be separated from w by all unoriented components besides
the hurdle opposite w. If v is unoriented, it must meet the second criterion under
case (b), and if it is benign, it must meet the third. If on the other hand, there is
not a single, unanchored chain, then ρ must eliminate all members of exactly one
hurdle chain (if it eliminated all members of no chain, it would either fail to orient a
hurdle or cause another to emerge; and if it eliminated all members of two chains, it
would orient two hurdles). If ρ eliminates all members of one chain, then, as shown in
Lemma 3.121, it must act on one component that meets the crition for u, and another
that meets one of the first three criteria under case (a) for v. Suppose instead that
ρ orients two hurdles and causes another to emerge. We showed in Lemma 3.7 that
this will occur iff u and v or their separating hurdles form a double superhurdle.
Furthermore, if there is a double superhurdle, there cannot be a single, unanchored
hurdle chain (because a double superhurdle always corresponds to a cycle). Thus we
1The arguments of Lemma 3.12 hold as well for a hurdle chain as for a superhurdlechain.
91
Chapter 4. An Improved Algorithm for Finding an Exact Median
address the fourth criterion under case (a) for v.
The re-introduction of fortresses also can be handled much the same way as in
section 3.3. We will not present results on fortresses here, however, as addressing
them is tedious, and for all practical purposes they are irrelevant to the median
problem (we have never seen a fortress in a real or synthetic data set, unless it was
manually inserted).
Using Lemmata 4.1, 4.2, 4.3, and 4.4, we can easily extend Algorithm 3.2 to
enumerate neutral reversals as well as sorting reversals (notice that we must use
the detect routine for Lemma 4.1). We will not present the details of this exercise
either, as it is straightforward and would only require several pages much like those
of the last chapter. Instead, for the remainder of this section, we will simply assume
a version of find all sorting reversals that returns a pair (S,N), where S is the
set of all sorting reversals and N is the set of all neutral reversals. We will call this
the “augmented version” of Algorithm 3.2.
4.2 The Algorithm
We are now prepared to present the sorting algorithm for the reversal median prob-
lem. Notice that this algorithm (see Algorithm 4.1) is very similar to the metric
algorithm (cf. Algorithm 2.1). The essential differences are that the sorting algo-
rithm finds appropriate neighbors of each vertex using the augmented version of
Algorithm 3.2, and that it runs in two passes; on the first pass it tries quickly to
find a perfect median, and on the second pass it thoroughly searches for an optimal
median.
A few details deserve comment. As mentioned, the algorithm runs in two passes;
when pass = 1 it seeks only a perfect median, and when pass = 2, it visits exactly
92
Chapter 4. An Improved Algorithm for Finding an Exact Median
the vertices visited by Algorithm 2.1 (but identifies them more efficiently). The set
R contains all reversals that are initial candidates for determining viable neighbors
of an intermediate vertex v. When pass = 1, R contains only reversals that sort
with respect to both ψ1 and ψ2 (step 1), and when pass = 2, R contains all possible
reversals except neutral or sorting reversals with respect to the origin, ψorig (step 2;
such reversals can be ignored in the same way that vertices not farther from the
origin were ignored in step 9 of Algorithm 2.1). A candidate neighbor w is generated
from each candidate reversal ρ; its distance from the origin is known to be one more
than that of v (step 3). The distances of w from ψ1 and ψ2 can be determined from
those of v by examining the membership of ρ in the sets S1, S2, N1, and N2—which
include all sorting and neutral reversals with respect to ψ1 and ψ2 (steps 4 and 5;
note that, during pass 1, ρ will always belong to S1 and S2). Unlike in Algorithm 2.1,
the distances from ψ1 and ψ2 must now be stored with each vertex v in the priority
stack. A flag is set when a perfect median is found, and that flag is checked at the
end of pass 1 (step 6), so that if a perfect median has been found, the algorithm can
avoid its second pass. Note that the algorithm can also exit if a median has been
found that has a score of one more than a perfect median (step 7). The reason is
that, in this case, there can exist no perfect median, so the candidate that has been
found must be minimal. If pass 1 does fail, all marks of vertices must be cleared
before pass 2 begins (step 8; note that, in this case, the priority stack must already
be empty).
93
Chapter 4. An Improved Algorithm for Finding an Exact Median
Input: Three signed permutations of size n: π1, π2, and π3.
Output: An optimal reversal median M .
beginSet up d1,2, d1,3, d2,3, Mmin, Mmax, ψorig, ψ1, ψ2, and priority stack s as inAlgorithm 2.1;Create vertex v with vlabel = ψorig, vdist = 0, vd1 = dψorig ,ψ1 , vd2 = dψorig ,ψ2 ,vbest = Mmin, vworst = Mmax;push(s, v); M ← ψorig; dsep ← dψ1,ψ2 ; pmed← false;for pass← 1 to 2 do
while s is not empty dopop(s, v);if vbest ≥Mmax then break;(S1, N1)←find all sorting reversals(vlabel, ψ1);(S2, N2)←find all sorting reversals(vlabel, ψ2);
1 if pass = 1 then R← S1 ∩ S2;else
(Sorig, Norig)←find all sorting reversals(vlabel, ψorig);2 R← ρ | ρ is a possible reversal on vlabel and ρ /∈ (Sorig ∪Norig);
endforeach ρ ∈ R do
3 create vertex w with wlabel = ρ(vlabel), wdist = vdist + 1;if w is marked then continue else mark w;
4 if ρ ∈ S1 then wd1 ← vd1 − 1; else if ρ ∈ N1 then wd1 ← vd1 ;else wd1 ← vd1 + 1;
5 if ρ ∈ S2 then wd2 ← vd2 − 1; else if ρ ∈ N2 then wd2 ← vd2 ;else wd2 ← vd2 + 1;
wbest ← wdist + dwd1+wd2+dsep
2e;
wworst ← wdist + min(wd1 + dsep), (wd1 + wd2), (dsep + wd2);if wworst = Mmin then M ← wlabel; pmed← true; break;if wbest < Mmax then push(s, w, wbest);if wworst < Mmax then M ← wlabel; Mmax ← wworst;
endendif pass = 1 then
6 if pmed = true then break;7 else if Mmax = d1,2+d1,3+d2,3
2+ 1 then break;
8 else clear all marksend
endend
Algorithm 4.1: find reversal median sorting
94
Chapter 4. An Improved Algorithm for Finding an Exact Median
One subtlety is important about step 7. By definition, a perfect median must
have score dd1,2+d1,3+d2,3
2e. If the sum d1,2 + d1,3 + d2,3 is even, then dd1,2+d1,3+d2,3
2e =
d1,2+d1,3+d2,3
2, and a perfect median indeed must fall (as we have assumed) on a shortest
path between every two of the three permutations π1, π2, and π3. If that sum is odd,
however, then a perfect median must fall on a shortest path between the members
of two pairs of the starting permutations, but not on a shortest path between the
members of the third pair (otherwise its median score would not be an integer,
which is impossible). Therefore, we can apply the assumption inherent in step 7—
that there exists no perfect median—only if the sum is even (hence the use of the
term d1,2+d1,3+d2,3
2in step 7, rather than the term dd1,2+d1,3+d2,3
2e).
Now we shall formally prove the correctness of the algorithm.
Theorem 4.1 Algorithm 4.1 will return a median M only if M is a median of the
input permutations π1, π2, and π3.
Proof: The algorithm either returns after the first pass or after the second pass. If the
algorithm returns after the first pass, it has found a candidate median M that either
has the lowest possible score (step 6), in which case M is clearly a median, or has a
score one greater than that of a perfect median, while at the same time the sum is
even of the distances between every pair of π1, π2, and π3 (step 7). In the latter case,
a perfect median must have score d1,2+d1,3+d2,3
2, and hence must exist on a shortest
path between every two of π1, π2, and π3 (otherwise the triangle inequality would
be violated). The algorithm has examined every potential median (according to the
bounds of Theorem 2.1) on shortest paths from ψorig to ψ1, and from ψorig to ψ2, and
has not found a perfect median. Therefore, there can exist no perfect median, and
the best possible median has a score of one greater than that of a perfect median.
Thus, M has the lowest possible score and must be a median.
On the other hand, if the algorithm returns after the second pass, then it has exactly
95
Chapter 4. An Improved Algorithm for Finding an Exact Median
emulated Algorithm 2.1, which we have shown to be correct. On pass 2, the algo-
rithm differs from Algorithm 2.1 only in the way it computes the distance of each
intermediate vertex w from π1, π2, and π3. It derives the distances of w from those
of v, the neighbor of w. The distance of w with respect to each ψ ∈ ψorig, ψ1, ψ2
is set to be one less than that of v iff w is obtained by a sorting reversal of v with
respect to ψ, equal to that of v iff w is obtained by a neutral reversal of v with respect
to ψ, and one more than that of v iff w is obtained by an anti-sorting reversal of v
with respect to ψ. We have shown that the routine find all sorting reversals
correctly enumerates sorting and neutral reversals, and we know that all other re-
versals are anti-sorting reversals. Thus, the distances of w with respect to ψorig, ψ1,
and ψ2 must be correct, and the sorting algorithm on pass 2 must be equivalent to
the metric algorithm.
4.3 Experimental Methods and Results
We implemented Algorithm 4.1 in C, re-using much of the code of program find-
reversal median from chapter 2, and calling program find-all-sr from chap-
ter 3 as a subroutine (we augmented program find-all-sr to enumerate neu-
tral reversals). The hashing mechanism and priority stack from program find-
reversal median were re-used directly. One other implementation detail is worthy
of comment: we used a fixed array, rather than a hash table, to record the set
membership of each candidate reversal. This approach allows steps 4 and 5 of the
algorithm to be executed in constant time (and is also useful in determining the mem-
bership of R during pass 2); it is possible because the number of reversals cannot
exceed (n+ 1)2, and for our purposes n is rarely more than 100, and certainly never
more than 1000. Because we need to track membership only in sets S1, S2, N1, N2,
and (Sorig ∪ Norig), we can use a single array of 8-bit elements, and accomplish
96
Chapter 4. An Improved Algorithm for Finding an Exact Median
marking using bit masks.
Using test data of the same type as described in section 2.4, we compared the
performance of the sorting and metric algorithms. Figure 4.1 shows results for n = 50
and n = 100, with the mean and median execution times plotted over 100 experi-
ments for various values of r. The sorting median clearly is much faster for all values
of r. In addition, its running time appears to grow more slowly as r becomes larger.
Figure 4.2 shows a detailed view of the running time of the sorting algorithm. Notice
that the median time grows extremely slowly and approximately linearly; the mean
time, on the other hand, suddenly shoots up at r = 9 for n = 50 and at r = 12
for n = 100. At these values, the sorting algorithm appears to have to resort to its
second pass more frequently, and when executing that pass it becomes “mired” in
the same way as reported in chapter 2.
Figure 4.3 shows the speedup of the sorting algorithm with respect to the metric
algorithm. A speedup of more than two orders of magnitude is achieved over a range
of values. It is notable that the speedup becomes larger as r increases, although at
a much greater rate for n = 100 than for n = 50. The median and mean speedups
appear to be approximately equivalent.
4.4 Summary
In this chapter, we combine the metric median algorithm of chapter 2 with the
solution of chapter 3 for the “all sorting reversals” problem, and arrive at a much
faster algorithm to find a reversal median. Before addressing the median problem,
however, we must adapt our results of the previous chapter to enumerate neutral
reversals as well as sorting reversals; otherwise, we could not address the case where
a perfect median does not exist. Our final algorithm uses two passes, attempting
on the first to find a perfect median rapidly, and then if the first pass fails, falling
97
Chapter 4. An Improved Algorithm for Finding an Exact Median
back on a strategy that effectively emulates the metric algorithm, but performs its
branching step much more efficiently. Experimental results indicate that the new
“sorting” algorithm performs dramatically better than the metric algorithm over a
range of parameters, enabling a speedup of between about one and more than two
orders of magnitude. The new algorithm still has difficulty, however, when distances
between permutations become moderately large (when r reaches 14-18% of n, or
pairwise distances reach about 2r, or 28-36%, of n). At these larger distances, the
algorithm appears to find most medians rapidly, but occasionally becomes “stuck”
(presumably by problems that do not have perfect medians). Interestingly, in these
problematic cases, the algorithm appears usually to find quite good medians on its
first pass; it simply cannot prove that they are optimal. This behavior suggests that
the first pass of the algorithm could serve as an excellent heuristic solution for the
median problem.
Finally, let us comment briefly on the reversal median algorithm of Caprara [8].
This algorithm also uses a branch-and-bound strategy2, but one of a completely
different nature: the algorithm performs edge contraction on a multi-breakpoint
graph (a version of the breakpoint graph that captures relationships among more
than two permutations), rather than directly exploring the space of all possible signed
permutations. Unfortunately, at the time of submission of this thesis, a thorough
experimental comparison has not yet been possible of our sorting algorithm and
Caprara’s algorithm. We have, however, run preliminary tests that indicate that the
two algorithms behave quite differently. Caprara’s algorithm appears to be much
more sensitive to n than is ours, and much less sensitive to r. As a result, it appears
to be preferable when n is small and r is large, and ours appears to be preferable
when n is large and r is small (note that the latter is perhaps a more interesting
case for phylogenetic analysis). We hope to compare the two comprehensively in
2Caprara presents both an exact algorithm and a heuristic algorithm for the problem;the exact algorithm uses branch-and-bound.
98
Chapter 4. An Improved Algorithm for Finding an Exact Median
the coming months, and to explore ways in which these different strategies might be
combined.
99
Chapter 4. An Improved Algorithm for Finding an Exact Median
0
200
400
600
800
1000
1200
1400
1 2 3 4 5 6 7 8 9
t (msec)
r
metric (mean)
♦♦
♦♦
♦♦
♦
♦♦metric (median)
++
++
++
++
+sorting (mean)
sorting (median)
× × × × × × × × ×
×
0
500
1000
1500
2000
2500
3000
3500
4000
2 4 6 8 10 12 14
t (msec)
r
metric (mean)
♦
♦
♦
♦
♦
♦♦
metric (median)
+
+
+
+
+
+
+sorting (mean)
sorting (median)
× × × × × × × × × ×
×
Figure 4.1: Comparison of metric and sorting median algorithms for n = 50 andn = 100 over various values of r. Plotted are mean and median execution time inmilliseconds. Each point corresponds to 100 experiments. Metric algorithm did notcomplete for n = 50, r > 8 or n = 100, r > 6.
100
Chapter 4. An Improved Algorithm for Finding an Exact Median
0
20
40
60
80
100
120
2 4 6 8 10 12 14
t (msec)
r
n = 50 (mean)
♦ ♦ ♦ ♦♦ ♦ ♦ ♦
♦♦
n = 50 (median)
+ + + + + + + + +
+n = 100 (mean)
n = 100 (median)
× × × × × × × × ××
×
Figure 4.2: Detailed view of performance of sorting algorithm. Experiments are asdescribed in Figure 4.1.
0
50
100
150
200
250
1 2 3 4 5 6 7 8
tmetrictsorting
r
n = 50 (mean)
♦♦
♦♦ ♦
♦
♦
♦
♦n = 50 (median)
++
++
++ + +
+n = 100 (mean)
n = 100 (median)
×
×
×
×
×
××
Figure 4.3: Speedup of sorting algorithm with respect to metric algorithm. Experi-ments are as described in Figure 4.1.
101
References
[1] D.A. Bader, B.M.E. Moret, and M. Yan, A linear-time algorithm for computinginversion distance between signed permutations with an experimental study, Al-gorithms and Data Structures: Seventh International Workshop, WADS 2001,Brown University, Providence, RI, August 8-10, 2001, Proceedings (F. Dehne,J.-R. Sack, and R. Tamassia, eds.), Lecture Notes in Computer Science, vol.2125, Springer, 2001, pp. 365–376.
[2] A. Bergeron, A very elementary presentation of the Hannenhalli-Pevzner the-ory, Combinatorial Pattern Matching, 12th Annual Symposium, CPM 2001Jerusalem, Israel, July 1-4, 2001 Proceedings (Amihood Amir and Gad M.Landau, eds.), Lecture Notes in Computer Science, vol. 2089, Springer, 2001,pp. 106–117.
[3] A. Bergeron and F. Strasbourg, Experiments in computing sequences of re-versals, Algorithms in Bioinformatics, First International Workshop, WABI2001, Aarhus, Denmark, August 28-31, 2001, Proceedings (Olivier Gascuel andBernard M. E. Moret, eds.), Lecture Notes in Computer Science, vol. 2149,Springer, 2001, pp. 164–174.
[4] P. Berman and S. Hannenhalli, Fast sorting by reversal, Proceedings of the 7thAnnual Symposium on Combinatorial Pattern Matching (D. Hirschberg andE. Myers, eds.), 1996, pp. 168–185.
[5] M. Blanchette, T. Kunisawa, and D. Sankoff, Parametric genome rearrange-ment, Gene 172 (1996), GC11–GC17.
[6] A. Caprara, Sorting by reversals is difficult, Proceedings of the First AnnualInternational Conference on Computational Molecular Biology (RECOMB-97)(S. Istrail, P.A. Pevzner, and M.S. Waterman, eds.), 1997, pp. 75–83.
[7] , Formulations and complexity of multiple sorting by reversals, Proceed-ings of the Third Annual International Conference on Computational Molecular
102
References
Biology (RECOMB-99), Lyon, France (S. Istrail, P.A. Pevzner, and M.S. Wa-terman, eds.), April 1999, pp. 84–93.
[8] , On the practical solution of the reversal median problem, Algorithms inBioinformatics, First International Workshop, WABI 2001, Aarhus, Denmark,August 28-31, 2001, Proceedings (Olivier Gascuel and Bernard M. E. Moret,eds.), Lecture Notes in Computer Science, vol. 2149, Springer, 2001, pp. 238–251.
[9] L.L. Cavalli-Sforza and A.W.F. Edwards, Phylogenetic analysis: models andestimation procedures, Am. J. Hum. Genet. 19 (1967), 233–257.
[10] J. Dicks, Chromtree: Maximum likelihood estimation of chromosomal phyloge-nies, Comparative Genomics (D. Sankoff and J.H. Nadeau, eds.), Kluwer Aca-demic Press, 2000.
[11] T. Dobzhansky and A.H. Sturtevant, Inversions in the chromosomes ofDrosophila pseudoobscura, Genetics 23 (1938), 28–64.
[12] R.V. Eck and M.O. Dayhoff, Atlas of protein sequence and structure, Natl.Biomed. Res. Found., Washington, DC, 1966.
[13] J. Felsenstein, Maximum-likelihood and minimum-step methods for estimatingevolutionary trees from data on discrete characters, Syst. Zool. 22 (1973), 240–249.
[14] W. Fitch, On the problem of discovering the most parsimonious tree, Am. Nat.111 (1977), 223–257.
[15] W.H. Gates and C.H. Papadimitriou, Bounds for sorting by prefix reversals,Discrete Mathematics 27 (1979), 47–57.
[16] D.S. Goldberg, S. McCouch, and J. Kleinberg, Algorithms for constructing com-parative maps, Comparative Genomics (D. Sankoff and J.H. Nadeau, eds.),Kluwer Academic Press, 2000.
[17] S. Hannenhalli, C. Chappey, E.V. Koonin, and P.A. Pevzner, Genome sequencecomparison and scenarios for gene rearrangements: A test case, Genomics 30(1995), 299–311.
[18] S. Hannenhalli and P.A. Pevzner, Transforming cabbage into turnip (polynomialalgorithm for sorting signed permutations by reversals), Proceedings of the 27thAnnual ACM Symposium on the Theory of Computing, 1995, pp. 178–189.
103
References
[19] , Transforming men into mice (polynomial algorithm for genomic dis-tance problem), Proceedings of the 36th Annual IEEE Symposium on Founda-tions of Computer Science, 1995, pp. 581–592.
[20] H. Kaplan, R. Shamir, and R.E. Tarjan, A faster and simpler algorithm forsorting signed permutations by reversals, SIAM Journal of Computing 29 (1999),no. 3, 880–892.
[21] A. McLysaght, C. Seoighe, and K.H. Wolfe, High frequency of inversions duringeukaryote gene order evolution, Comparative Genomics (D. Sankoff and J.H.Nadeau, eds.), Kluwer Academic Press, 2000, pp. 47–58.
[22] B.M.E. Moret, L.-S. Wang, T. Warnow, and S.K. Wyman, New approachesfor reconstructing phylogenies from gene-order data, Bioinformatics 17 (2001),S165–S173, Presented at the Ninth International Conference on Intelligent Sys-tems for Molecular Biology, ISMB-2001.
[23] J.H. Nadeau and B.A. Taylor, Lengths of chromosomal segments conserved sincedivergence of man and mouse, Proceedings of the National Academy of Sciences81 (1984), 814–818.
[24] Pavel A. Pevzner, Computational molecular biology: An algorithmic approach,MIT Press, 2000.
[25] N. Saitou and M. Nei, The neighbor-joining method: a new method for recon-structing phylogenetic trees, Journal of Molecular Biology 4 (1987), 406–425.
[26] D. Sankoff, Genome rearrangements with gene families, Bioinformatics 15(1999), 909–917.
[27] D. Sankoff and M. Blanchette, Multiple genome rearrangement and breakpointphylogeny, Journal of Computational Biology 5 (1998), no. 3, 555–570.
[28] D. Sankoff, D. Bryant, M. Deneault, B.F. Lang, and G. Burger, Early eukaryoteevolution based on mitochondrial gene order breakpoints, Journal of Computa-tional Biology 7 (2000), no. 3/4, 521–535.
[29] D. Sankoff and N. El-Mabrouk, Duplication, rearrangement and reconciliation,Comparative Genomics (D. Sankoff and J.H. Nadeau, eds.), Kluwer AcademicPress, 2000, pp. 537–550.
[30] , Genome rearrangement, Topics in Computational Biology (T. Jiang,Y. Xu, and M. Zhang, eds.), MIT Press, 2001.
104
References
[31] D. Sankoff, V. Ferretti, and J.H. Nadeau, Conserved segment identification,Journal of Computational Biology 4 (1997), no. 4, 559–565.
[32] D. Sankoff, G. Leduc, N. Antoine, B. Paquin, and B.F. Lang, Gene order com-parisons for phylogenetic inference: Evolution of the mitochondrial genome, Pro-ceedings of the National Academy of Sciences 89 (1992), 6575–6579.
[33] D. Sankoff and J.H. Nadeau (eds.), Comparative genomics, Kluwer AcademicPress, 2000.
[34] J. Setubal and J. Meidanis, Introduction to computational molecular biology,PWS Publishing, 1997.
[35] A.C. Siepel and B.M.E. Moret, Finding an optimal inversion median: Exper-imental results, Algorithms in Bioinformatics, First International Workshop,WABI 2001, Aarhus, Denmark, August 28-31, 2001, Proceedings (Olivier Gas-cuel and Bernard M. E. Moret, eds.), Lecture Notes in Computer Science, vol.2149, Springer, 2001, pp. 189–203.
[36] D. Waddington, Estimating the number of conserved segments between speciesusing a chromosome-based model, Comparative Genomics (D. Sankoff and J.H.Nadeau, eds.), Kluwer Academic Press, 2000.
[37] G.A. Watterson, W.J. Ewens, and T.E. Hall, The chromosome inversion prob-lem, Journal of Theoretical Biology 99 (1982), 1–7.
105
Recommended