Exact Algorithms for the Reversal Median Problem - Siepel Lab: Home

Exact Algorithms for the ReversalMedian Problem

by

Adam C. Siepel

B.S., Cornell University, 1994

THESIS

Submitted in Partial Fulfillment of the

Requirements for the Degree of

Master of Science

Computer Science

The University of New Mexico

Albuquerque, New Mexico

December, 2001

c©2001, Adam C. Siepel

iii

Acknowledgments

Many people have supported me, directly or indirectly, in the months of research,programming, and writing that went into this thesis, but several deserve specialrecognition. Of these, my adviser, Bernard Moret, has been the most directly in-volved. His encouragement has been more important than he is likely to realize; ithas many times given me the confidence to face another long weekend of solitarywriting or programming. The other members of my committee, David Sankoff andDavid Bader, also deserve special thanks. David Sankoff more than anyone else hascreated the field to which I hope to contribute with this work. Before I took hisseminar at the University of New Mexico (taught while he was there on sabbatical),I knew neither what a reversal nor a median was, let alone that the two togetherposed a problem. David Bader has always been generous and encouraging with me,and I have benefited enormously from his classes. Without the use of his flawless(and very fast) code for computing inversion distance, my first naive attempts at thereversal median problem would likely have ended in frustration and failure. My wife,Amber, has been the least, yet the most, involved of all in this project. It is she whohas kept the life we share moving along, when I have been too busy to do anythingbut work; who has endured my long periods of silent contemplation of matters that,to her humanist mind, were utterly devoid of interest; and who has still managedto feign believable enthusiasm when discussing a new species of hurdle. I cannotneglect to mention my colleagues at the National Center for Genome Resources,and my supervisor, Bill Beavis, who have been exceptionally accommodating of mysometimes irregular schedule. Finally, as I so rarely think to do, I acknowledge myparents, Timothy and Virginia Siepel; they are responsible for instilling in me theinquisitive nature and appetite for learning that are required to sustain even thehumblest attempt at research.

iv


by

Adam C. Siepel

ABSTRACT OF THESIS

Submitted in Partial Fulfillment of the

Requirements for the Degree of

Master of Science

Computer Science

The University of New Mexico

Albuquerque, New Mexico

December, 2001


by

Adam C. Siepel

B.S., Cornell University, 1994

M.S., Computer Science, University of New Mexico, 2001

Abstract

While most work in computational molecular biology since its inception in the

1970s has focused on problems involving DNA and amino acid sequences, there has

been growing interest during the past decade in the use of alternative models of

molecular evolution that are based on the order and content of genes in complete

genomes. Phylogeny reconstruction using gene order is of particular interest, and

promises to offer improved results in cases where sequence-based methods perform

poorly, such as when taxa are distant or sequences mutate rapidly. Evolutionary

distance between genomes can be estimated as the minimum number of inversions

required to transform one into the other, a measure that can be computed efficiently

as the reversal distance between signed permutations. Much progress has been made

in recent years on this problem and the related one of finding a minimum sequence

of sorting reversals, but the application of these techniques to phylogeny so far has

been limited to distance-based reconstruction methods. An alternative method of

reconstruction, presented by Sankoff and Blanchette, uses a Steinerization process to

vi

establish an optimal tree in which internal nodes are labeled with the configurations

of ancestral genomes. This method relies on repeatedly finding “medians” of three

signed permutations. Medians have previously been computed using a heuristic

called “breakpoint distance” rather than the more precise reversal distance, largely

because no efficient solution existed to the reversal median problem. In this thesis, we

derive in three stages an efficient algorithm to find a reversal median of three signed

permutations. In the first stage, we develop a simple branch-and-bound algorithm

that uses only the metric properties of the distance measure; in the second stage, we

derive a solution to the problem of finding all sorting reversals of one permutation

with respect to another; and in the third stage, we synthesize from the results of the

first two a dramatically improved algorithm for the median problem.

vii

Contents

List of Figures xi

1 Introduction 1

1.1 Computing Genomic Distance . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Phylogeny Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Reconstruction Using Medians of Three . . . . . . . . . . . . . . . . . 10

1.4 The Reversal Median Problem . . . . . . . . . . . . . . . . . . . . . . 12

2 A Simple Algorithm for Finding an Exact Median 14

2.1 Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Experimental Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5.1 Performance of Bounds . . . . . . . . . . . . . . . . . . . . . . 24

viii

Contents

2.5.2 Running Time and Parallel Speedup . . . . . . . . . . . . . . 26

2.5.3 Reversal Medians vs. Breakpoint Medians . . . . . . . . . . . 28

2.5.4 Preponderance of Perfect Medians . . . . . . . . . . . . . . . . 31

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Finding All Sorting Reversals 33

3.1 Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Sorting Reversals in the Absence of Fortresses . . . . . . . . . . . . . 41

3.3 Accommodating Fortresses . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.5 Detecting New Unoriented Components . . . . . . . . . . . . . . . . . 68

3.5.1 A Simple Linear-Time Solution . . . . . . . . . . . . . . . . . 69

3.5.2 The Effect of a Reversal on the Overlap Graph . . . . . . . . 70

3.5.3 A Bitwise Algorithm . . . . . . . . . . . . . . . . . . . . . . . 73

3.6 Experimental Methods and Results . . . . . . . . . . . . . . . . . . . 79

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4 An Improved Algorithm for Finding an Exact Median 86

4.1 Enumerating Neutral Reversals . . . . . . . . . . . . . . . . . . . . . 87

4.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.3 Experimental Methods and Results . . . . . . . . . . . . . . . . . . . 96

ix

Contents

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

References 102

x

List of Figures

1.1 Representation of distance problem with signed permutations . . . . 6

1.2 Illustration of median-based phylogeny reconstruction . . . . . . . . 11

2.1 Illustration of reversal graph . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Illustration of global lower bound for median . . . . . . . . . . . . . 17

2.3 Illustration of bounds for intermediate vertex . . . . . . . . . . . . . 18

2.4 Number of vertices visited while finding a median . . . . . . . . . . 25

2.5 Statistical median of V for constant r . . . . . . . . . . . . . . . . . 26

2.6 Statistical median of V for constant n . . . . . . . . . . . . . . . . . 27

2.7 Sequential and parallel running times . . . . . . . . . . . . . . . . . 28

2.8 Comparison of various medians . . . . . . . . . . . . . . . . . . . . . 29

2.9 Distribution of number of medians . . . . . . . . . . . . . . . . . . . 30

2.10 Percentage of medians that are perfect . . . . . . . . . . . . . . . . . 31

3.1 Illustration of idea for sorting median . . . . . . . . . . . . . . . . . 34

3.2 Example breakpoint graph and overlap graph . . . . . . . . . . . . . 37

xi

List of Figures

3.3 Ways to cut a hurdle and ways to merge hurdles . . . . . . . . . . . 41

3.4 Example of a double superhurdle . . . . . . . . . . . . . . . . . . . . 45

3.5 Example of a sorting reversal that merges two benign components . 47

3.6 Example hurdle graph . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.7 Example showing effect of a reversal on overlap of edges . . . . . . . 71

3.8 The effect of a reversal on the breakpoint and overlap graphs . . . . 72

3.9 The effect of an arbitrary reversal on the bit matrix . . . . . . . . . 78

3.10 Running times of programs find-all-bf and find-all-sr . . . . . 81

3.11 Speedup of find-all-sr . . . . . . . . . . . . . . . . . . . . . . . . 82

3.12 Average number of sorting reversals . . . . . . . . . . . . . . . . . . 83

4.1 Comparison of metric and sorting median algorithms . . . . . . . . . 100

4.2 Performance of sorting algorithm (detail) . . . . . . . . . . . . . . . 101

4.3 Speedup of sorting algorithm with respect to metric algorithm . . . 101

xii

Chapter 1

Introduction

Recent years have seen numerous advancements in the mathematical and computa-

tional study of genome rearrangements—those processes that change the way seg-

ments of chromosomal DNA are organized into complete genomes (see summaries in

[34, 24, 33, 30]). Because genome rearrangements become evident through a compar-

ative analysis of contemporary genomes, this sub-field of computational molecular

biology has been called “comparative genomics” [33]1. Comparative genomics uses

differences in the gene order and gene content of the genomes of related organisms

as a source of information about mechanisms of molecular change during evolution-

ary history and about the evolutionary relationships among organisms. Applications

include constructing comparative genomic maps [16], reconstructing phylogenetic re-

lationships among organisms [32, 27, 28, 22], estimating the lengths or boundaries of

homologous chromosomal segments [23, 31, 16, 36], and estimating the relative fre-

quencies of mechanisms of genome rearrangement [21]. As the emphasis of genomic

science shifts from individual genes to whole genomes, the methods of comparative

genomics are expected only to become of greater interest.

1Note that the same term has been used in a broader sense, to suggest any sort ofcomparison of genomic data from different organisms.

1

Chapter 1. Introduction

Computer scientists and mathematicians working in computational molecular bi-

ology since its inception in the early 1970s have focused largely on the analysis

of DNA and amino acid sequences. Problems that have received the most atten-

tion include similarity searching (searching a database of sequences for the closest

matches to a query sequence), multiple alignment (aligning homologous regions of a

set of sequences, and placing “gaps” to indicate likely insertions or deletions), con-

tig assembly (building a contiguous sequence from a set of overlapping fragments),

gene prediction (finding segments of genomic DNA believed to code for proteins),

and phylogeny reconstruction (hypothesizing the evolutionary relationships among

organisms from comparisons of their sequences). In general, solutions to these prob-

lems are based on models of evolutionary change. For example, to compute a useful

“distance” between two sequences (as one often does in similarity searching, multi-

ple sequence alignment, and phylogeny reconstruction), one must assume a model of

sequence mutation over evolutionary time. The mutation of sequences, however, is

only one aspect of molecular evolution; at a level above the genes, chromosomes split

and combine, whole genes are duplicated and lost, and genes are rearranged in order.

Comparative genomics, by characterizing genomes in terms of gene content and or-

der, provides an alternative framework for modeling evolution—one that allows for

phenomena, such as gene loss, gene duplication, and genome rearrangement, that are

not easily accommodated with sequence-based approaches, and one that provides the

means to measure evolutionary change at a much greater time scale (because changes

to gene order and content occur much less frequently than mutations at the sequence

level).

Comparative genomics is being used both to address new problems and to enable

new approaches to old problems. Perhaps the best example of the latter use is

phylogeny reconstruction, which can be performed according to comparisons of gene

content and gene order instead of according to sequence comparison. Performing

phylogeny reconstruction at the genome, rather than the sequence, level promises to

2


be particularly valuable in cases where taxa are extremely distant or where sequences

are evolving rapidly [24] (in these cases, sequence-based methods perform poorly

because of the problem of “saturation” in edit distances).

The subject of this thesis, the “reversal median problem”, is related to a method

used to perform phylogeny reconstruction according to gene order. Below we will

briefly outline the background of the problem.

1.1 Computing Genomic Distance

One of the most fundamental computational problems in comparative genomics,

which must be solved before many higher level problems can be attacked, is to com-

pute the “distance” between two genomes. The idea is to come up with a measure,

based on gene order and gene content, that reflects as closely as possible the evo-

lutionary distance of the given organisms. The challenge is to find a measurement

that is biologically meaningful yet efficiently computable.

To be realistic, a measurement should reflect several known mechanisms of ge-

nomic rearrangement. In the case of single-chromosome genomes (such as those of

prokaryotes, chloroplasts, and mitchondria), these mechanisms include the following:

• inversion: A section of a chromosome is excised, reversed in orientation, and

re-inserted.

• transposition: A section of a chromosome is excised and inserted at new

position in the chromosome, without changing orientation.

• inverted transposition: Exactly like transposition, except that the trans-

posed segment changes orientation.

3


• gene duplication: A section of a chromosome is duplicated, so that multiple

copies exist of every gene in that section.

• gene loss: A section of a chromosome is excised and lost, so that all of its

genes are effectively deleted from the organism.

If a genome has multiple chromosomes, then transposition and inverted transpo-

sition can occur between chromosomes. In this case, the following are also possible:

• translocation: The end of one chromosome is broken off and attached to the

end of another chromosome.

• fission: A chromosome splits and becomes two chromosomes.

• fusion: Two chromosomes combine and become one chromosome.

In early attempts to compute genomic distance (e.g., [32]), investigators found

that the problem was quite difficult, even when one considered only few mechanisms

(in [32], transpositions, insertions, deletions, and inversions were considered). It be-

came evident that the genomic distance problem was algorithmically much harder,

and likely to be computationally more intensive, than computing edit distances be-

tween sequences. As a result, Sankoff, et al. began to use a heuristic for genomic

distance called “breakpoint distance”, which applies when two genomes each con-

tain a single instance of each of n genes. The breakpoint distance is the number

of pairs of genes that are adjacent in one genome but not in the other (the mea-

sure is symmetric). Breakpoint distance is rapidly computable and appears to be

a reasonable estimator of genomic distance, but it is not directly correlated to any

mechanism of rearrangement (all mechanisms can create or remove breakpoints, but

no rearrangement-based measure of distance can be determined precisely from break-

4


point distance)2.

In the early 1990s, Pavel Pevzner, Vineet Bafna, and Sridhar Hannenhalli chose

instead to distill the distance problem to what may be its simplest useful formulation

directly based on known mechanisms of rearrangement: to find the minimum number

of inversions necessary to transform one genome into another. They worked also with

two genomes each containing a single instance of n genes, but in their formulation of

the distance problem, one configuration of genes must be transformed to the other

using only the mechanism of inversion. This problem can be restated mathematically

as that of finding the minimum number of reversals required to “sort” a permutation

of size n—i.e., to convert it to the identity permutation (note that two arbitrary

permutations can always be mapped such that one of them is represented as the

identity permutation). The problem of “sorting by reversals”3 can be seen as a

generalization of a problem known as “sorting by prefix reversals”, which incidentally

was studied (but not solved) in the late 1970s by a Harvard undergraduate named

W.H. Gates, who later discovered an interest in operating systems [15]. Circular

and non-circular versions of the distance problem can be defined, corresponding to

circular and non-circular genomes (it turns out to be easy to transform one version

into the other). Modeling rearrangement distance with reversal distance is supported

by reports in the biological literature that inversions are the primary mechanism

of genome rearrangement for many genomes [21, 5]. The idea of using a distance

measure based only on inversions was foreshadowed by observations as early as the

1930s that differences in gene order could be explained by sequences of chromosomal

inversions [11]. In the 1980s, Watterson, et al. explicitly proposed using inversion

distance as a measure of evolutionary distance that could be useful for phylogeny

2Note that some have considered this property to be an advantage, because little isknown about the relative likelihoods of alternative mechanisms of rearrangement

3Note that the problems of reversal distance and of sorting by reversals are subtlydifferent; it turns out one can compute reversal distance without actually finding a sequenceof sorting reversals.

5


DE

A

B

C

C

D

A

E

B

π1 = (+1,+2,+3,+4,+5)π2 = (−2,−1,−5,−4,−3)

Figure 1.1: If the orientation of each gene in each genome is known, then the problemof finding the inversion distance between genomes is equivalent to the problem offinding the reversal distance between signed permutations.

reconstruction, noting that inversion distance was a true metric [37].

If only the orders of genes are known, then the problem of finding inversion dis-

tance is equivalent to finding the reversal distance of unsigned permutations. In

this case, for example, the distance between the permutations (1, 3, 2) and (1, 2, 3)

is a single reversal. If the orientations of genes are also known, however, then the

problem can be modeled with signed permutations (orientation is taken to be rep-

resented by direction of transcription). The distance between the signed permuta-

tions (+1,+3,+2) and (+1,+2,+3) is three [e.g., (+1,+3,+2) → (+1,−2,−3) →

(+1,+2,−3) → (+1,+2,+3)]. With unsigned permutations, both reversal distance

and sorting by reversals are NP-Hard [6], but Hannenhalli and Pevzner showed that

the signed versions of the problems, surprisingly, can be solved in polynomial time

[18]. Their solution is the capstone of a baroque edifice of combinatorial theory that

has become known as the Hannenhalli-Pevzner cycle-decomposition theory. The

Hannenhalli-Pevzner theory describes the relationship between two signed permuta-

6


tions with a particular kind of diagram (a “breakpoint graph”), captures relation-

ships between “cycles” in that diagram in an interleaving graph, classifies certain

connected components in the interleaving graph (“oriented” and “unoriented” com-

ponents) and relationships among connected components (“hurdles”, “superhurdles”,

and “fortresses”), and establishes numerous useful properties about these diagrams

and graphs.

It is possible that no class of problems in computational biology has exerted a

stronger pull on theoretically-inclined computer scientists than those involving sort-

ing by reversals of signed permutations. During the past six years, several improved

algorithms have been produced for both the sorting problem and the distance prob-

lem [4, 20, 1, 2]. The fastest standing algorithms are those of Kaplan, et al. [20]

for the sorting problem (O(n2) time, where n is the permutation size) and of Bader,

et al. [1] for the distance problem (O(n) time). Recently, Bergeron has shown an

alternative method for sorting by reversals [2], also in O(n2) time, which is in many

ways simpler than that of Kaplan, et al.

In addition, some progress has been made on computing distances that take

into account other mechanisms of rearrangement. Hannenhalli and Pevzner pub-

lished an algorithm that finds the distance between multiple-chromosome genomes

in terms of equally-weighted translocations, fissions, fusions, and inversions, essen-

tially by reducing this problem to that of finding the inversion distance between two

single-chromosome genomes [19]. In addition, Sankoff has developed a method that

accommodates multiple members of gene families [26], and thus allows for duplica-

tion. No fast algorithm, however, has yet been published to compute exact distance

in terms of transpositions.

7


1.2 Phylogeny Reconstruction

Most methods for phylogeny reconstruction were originally developed for use with

sequence data (including sequences of morphological “characters”, as well as DNA

and amino acid sequences). These methods generally begin with a multiple align-

ment of N sequences (representing N taxa) and produce a binary tree describing

the evolutionary relationships among the sequences—that is, a branching pattern

through evolution that is likely to have allowed an ancestral sequence to give rise

to all of the observed, contemporary sequences (the observed sequences appear as

leaves of the tree). Note that, strictly speaking, the directed graph of evolutionary

history is not a tree; to the contrary, we know of many processes, such as horizontal

transfer and hybridization, that can allow a given leaf (an observed species) to have

more than one path to the root (the hypothetical common ancestor). Nevertheless,

a tree is believed to provide a reasonable approximation of the evolutionary history

of most sets of species. A reconstructed tree is important both in its topology and

in its branch lengths.

In general, algorithms for phylogeny reconstruction return one or more trees

that are optimal according to some appropriate cost function. The algorithms differ

primarily in the nature of their cost functions. The most widely-used methods can

be classified into three categories:

• Distance-based methods build trees that best fit the pairwise distances of

all sequences (usually in the sense of minimizing the sum of the lengths of all

tree edges). Distance between sequences is generally computed as a kind of

“edit distance” (e.g., a minimum cost of insertions/deletions and substitutions

required to transform one sequence into the other). The internal nodes of

distance-based trees have no biological meaning associated with them; they

simply represent abstract points in a high-dimensional space. Trees can be

8


computed much faster than with other methods (generally in polynomial time).

Distance-based methods are currently dominated by the “neighbor-joining”

algorithm of Saitou and Nei [25].

• Maximum parsimony methods build trees by attempting to find the least

costly pathways connecting the “character states” represented at nodes in the

tree (internal and external). These methods are preferable to distance-based

methods for some applications because they associate actual sequences with

internal tree nodes (so-called “hypothetical ancestors”). Maximum parsimony

was pioneered by Eck and Dayhoff [12] and adapted for DNA sequences by

Fitch [14].

• Maximum likelihood methods strive to find the most likely of all possible

trees according to a well-defined probabilistic model. Like maximum parsimony

methods, these methods label (or can be made to label) internal nodes with

ancestral sequences. They tend to be highly computational, however, and as

yet are only feasible for relatively small sets of sequences. Maximum likelihood

methods were first applied to phylogeny reconstruction by Cavalli-Sforza and

Edwards [9], and were first adapted for use with DNA and amino acid sequences

by Felsenstein [13].

Analogs based on gene order4, rather than sequences, have been developed for all

three of these classes. Distance-based methods can be used without alteration for

gene-order data, since they separate the computation of a distance matrix from the

construction of a tree; one only needs to compute pairwise distances using a measure

based on gene order. Maximum parsimony methods do not exist by the same name,

4Most phylogeny reconstruction at the genome, rather than the sequence, level has fo-cused on comparison of gene order, rather than of gene content. The study of Sankoff andEl-Mabrouk [29] is an exception, and in its synthesis of methods for phylogeny reconstruc-tion with Sankoff’s approach to handling multiple gene copies [26], suggests a promisingavenue for further exploration.

9


but we will discuss an approach below—called “median-based reconstruction”—that

is effectively analogous to them. At least one maximum likelihood method has been

proposed [10], but it requires enormous computation time, and appears to have to

resort to fairly drastic pruning strategies to solve problems of reasonable size. As

a result, tree-building methods for gene-order data are effectively limited, for the

present, to distance- and median-based reconstruction.

1.3 Reconstruction Using Medians of Three

The median-based method for phylogeny reconstruction was first proposed by Sank-

off and Blanchette [27]. Given a set of N signed permutations, each of size n, they

sought to construct an optimal tree such that each node was labeled with a signed

permutation, and the input permutations appeared as leaves of the tree. In this

way, as with maximum parsimony, they would ensure that internal nodes retained

biological meaning, and that edges between nodes represented transitions between

actually achievable states of the genome (note, in contrast, that with a distance-

based method, there is no guarantee of the existence of an internal node having

distances to its neighbors as hypothesized). The problem, then, was to find Steiner

points in the space of genome rearrangements (the algorithm has been described

as the “Steinerization algorithm” [30]). Their idea was to build a global solution

by aggregating local solutions for the simplest possible version of the problem: to

find a Steiner point of three genomes—that is, a permutation π such that sum is

minimized of the distances between π and each of the starting genomes. They called

such a point a “median of three”, or simply a “median”. After an initialization

step (which can be executed in various ways), their algorithm iterates over a tree,

repeatedly resetting the permutations of internal nodes to medians of their three

neighbors (the tree is always binary). It continues until convergence occurs. The

10


set to medianof neighbors

π2 π3 π4 π5 πN−1 πNπ1

Figure 1.2: The median-based reconstruction algorithm of Sankoff and Blanchetteiterates over a tree, resetting the permutations at internal nodes to the medians oftheir three neighbors, until convergence occurs.

algorithm guarantees only a locally optimal solution, but with multiple executions

and with various initialization configurations, appears effectively to approximate a

global optimum.

Sankoff and Blanchette computed medians using breakpoint distance rather than

inversion distance. They discovered a straightforward reduction of the breakpoint

median problem to a special case of the Traveling Salesman Problem, and were able to

compute medians relatively efficiently. Finding a median based on inversion distance,

in contrast, was believed to be too prohibitive to be performed as frequently as

required by the median-based reconstruction algorithm. (At the time, no algorithm

had been reported to find inversion medians, but in at least one study, they had been

obtained for a particular data set using a bounded exhaustive search [17]).

Breakpoint medians have drawbacks, however. While breakpoint distance is use-

ful as a heuristic measure, because breakpoints do not correlate directly to any actual

mechanism of rearrangement, a breakpoint median has no straightforward biological

interpretation (an inversion median, on the other hand, represents precisely a most-

parsimonious ancestor of the genomes in question under an inversions-only model

11


of evolution). In addition, breakpoint medians tend to be far from unique—that

is, a large number of permutations often score equally well as medians—so that the

median-based reconstruction algorithm must choose arbitrarily among many candi-

dates (some of which are likely better than others at advancing the search toward

a global minimum). We will also show in this thesis that breakpoint medians score

poorly compared to inversion medians when evaluated in terms of inversion distance.

1.4 The Reversal Median Problem

For these reasons, we seek a solution to the median problem in terms of inversion

distance. This problem is known alternatively as “multiple sorting by reversals”,

the “inversion median problem”, and the “reversal median problem”. We will re-

fer to it using the last of these names, because our goal is to enable the median-

based algorithm for phylogeny reconstruction, but our solution can be developed in

general mathematical terms. The reversal median problem has been shown to be

NP-Hard [7], and until the present study began, no efficient algorithm for it had

been reported (during the course of this study, two algorithms were reported: one by

Caprara [8], and another based on preliminary work of ours [35]). Note that, while

a solution to the reversal median problem directly addresses only the rearrangement

mechanism of inversion, it opens the door to median-based phylogeny reconstruc-

tion for equally-weighted inversions, translocations, fissions, and fusions, through

the methods of [19].

In this thesis, we develop an algorithm for the reversal median problem in three

stages. First (Chapter 2), we develop a simple branch-and-bound algorithm based

on distance computations from intermediate permutations; this algorithm does not

use the Hannenhalli-Pevzner cycle-decomposition theory, and depends only on the

availability of a rapidly computable distance metric (thus, it could be used for other

12


measures of distance, if fast algorithms were available to compute them). Next

(Chapter 3), we develop a solution, using Hannenhalli-Pevzner theory, to the pre-

viously unsolved problem of finding all sorting reversals of one permutation with

respect to another, with the goal of navigating more efficiently the space that the

algorithm of Chapter 2 must explore. Finally (Chapter 4), we synthesize the work

of the first two chapters, and develop a dramatically more efficient solution to the

median problem.

13

Chapter 2

A Simple Algorithm for Finding

an Exact Median

You will be safest in the middle.

–Ovid

In this chapter, we derive a simple branch-and-bound algorithm to find an exact

reversal median of three signed permutations. This algorithm does not depend on

properties specific to reversals, but can be used with any rapidly computable metric

(in applying it to the case of reversals, therefore, we depend heavily on an efficient

routine to compute reversal distance [1]). We also provide results from an exper-

imental study showing (1) that our algorithm performs surprisingly efficiently for

a range of parameters, but that it has a greater tendency to “bog down” as the

distances become large between input permutations; (2) that reversal medians score

significantly better than breakpoint medians, and tend to be far more unique; and

(3) that an unexpectedly large number of reversal medians are “perfect medians”

(having a score equal to the global lower bound). The simple algorithm presented

here is the basis of a more complicated algorithm developed in subsequent chapters.

14

Chapter 2. A Simple Algorithm for Finding an Exact Median

2.1 Notation and Definitions

We consider the case where all genomes have identical sets of n genes and inver-

sion is the single mechanism of rearrangement. We represent each genome Gi as

a permutation πi of size n, and we let all pairs of genomes Gi = (gi,1 . . . gi,n) and

Gj = (gj,1 . . . gj,n), in a set of genomes G, be represented by πi = (πi,1 . . . πi,n) and

πj = (πj,1 . . . πj,n) such that πi,k = πj,l iff gi,k = gj,l, and πi,k = −1 · πj,l iff gi,k is the

reverse complement of gj,l.

We will model inversions to genomes with reversals to permutations. A reversal

acting on permutation π from i to j, for i ≤ j, is that operation which transforms π

into φ = (π1, π2, . . . , πi−1,−πj,−πj−1, . . . ,−πi, πj+1, . . . , πn). The minimal number

of reversals required to change one permutation πi into another permutation πj is

the reversal distance, which we denote by d(πi, πj) (sometimes abbreviated as di,j).

Let the reversal median M of a set of N permutations Π = π1, π2, . . . , πN be

the signed permutation that minimizes the sum S(M,Π) =∑N

i=1 d(M,πi)). Let this

sum S(M,Π) = S(Π) be called the median score of M with respect to Π.

For a given number of genes n, we can construct an undirected graph Gn = (V,E)

such that each vertex in V corresponds to a signed permutation of size n and two

vertices are connected by an edge if and only if one of the corresponding permutations

can be obtained from the other through a single reversal; formally, E = vi, vj |

vi, vj ∈ V and d(πi, πj) = 1. We will call Gn the reversal graph of size n. In this

graph, the length of the shortest path between any two vertices, vi and vj, is the

same as the reversal distance between the corresponding permutations, πi and πj.

Furthermore, finding the median of a set of permutations Π is equivalent to finding

the minimum unweighted Steiner tree of the corresponding vertices in Gn. Note that

Gn is extremely large (|V | = n! · 2n), so this representation does not immediately

suggest a feasible graph-search algorithm, even for small n.

15


(−1,+2,+3) (+1,−2,+3) (−1,−2,+3) (−1,+2,−3) (+1,−2,−3) (−1,−2,−3)(+1,+2,+3) (+1,+2,−3)

(−2,+1,+3) (+2,−1,+3) (−2,−1,+3) (−2,+1,−3) (+2,−1,−3) (−2,−1,−3)(+2,+1,+3) (+2,+1,−3)

(−2,+3,+1) (+2,−3,+1) (−2,−3,+1) (−2,+3,−1) (+2,−3,−1) (−2,−3,−1)(+2,+3,+1) (+2,+3,−1)

(−3,+1,+2) (+3,−1,+2) (−3,−1,+2) (−3,+1,−2) (+3,−1,−2) (−3,−1,−2)(+3,+1,+2) (+3,+1,−2)

(−1,+3,+2) (+1,−3,+2) (−1,−3,+2) (−1,+3,−2) (+1,−3,−2) (−1,−3,−2)(+1,+3,+2) (+1,+3,−2)

(−3,+2,+1) (+3,−2,+1) (−3,−2,+1) (−3,+2,−1) (+3,−2,−1) (−3,−2,−1)(+3,+2,+1) (+3,+2,−1)

Figure 2.1: The reversal graph for n = 3. For clarity of presentation, edges havebeen drawn only for the first column of vertices.

Definition 2.1 A shortest path between two permutations of size n, π1 and π2,

is a connected subgraph of the reversal graph Gn containing only the vertices v1 and

v2 corresponding to π1 and π2, and the vertices and edges on a single shortest path

between v1 and v2.

Definition 2.2 A median path of a set of permutations Π each of size n is a con-

nected subgraph in the reversal graph of Gn containing only the vertices corresponding

to permutations in Π, the vertex corresponding to a median M of Π, and a shortest

path between M and each π ∈ Π.

Definition 2.3 A trivial median of a set of permutations Π is a median M such

that M ∈ Π.

Definition 2.4 A trivial median path of a set of permutations Π is a median

path that includes only the elements of Π and shortest paths between elements of Π.

16


v3

vMd3,M

v1

d1,3

v2

d2,M

d1,2 d1,M

d2,3

Figure 2.2: Let vertices v1, v2, and v3 correspond to permutations π1, π2, and π3, andlet vertex vM correspond to a median M . The lowest possible median score occurswhen d1,2 = d1,M + d2,M ; d1,3 = d1,M + d3,M ; and d2,3 = d2,M + d3,M .

2.2 Bounds

Lemma 2.1 The median score S(Π) of a set of equally sized permutations Π = π1,

π2, π3, separated by pairwise distances d1,2, d1,3, and d2,3, obeys these bounds:⌈d1,2 + d1,3 + d2,3

2

⌉≤ S(Π) ≤ min

(d1,2 + d2,3), (d1,2 + d1,3), (d2,3 + d1,3)

Proof: The upper bound follows directly from the possibility of a trivial median, and

the lower bound from properties of metric spaces (a median of lower score would

necessarily violate the triangle inequality with respect to two of π1, π2, and π3; see

Figure 2.2).

Definition 2.5 A perfect median of a set of equally sized permutations Π = π1,

π2, π3, separated by pairwise distances d1,2, d1,3, and d2,3, is a median having median

score S(Π) = dd1,2+d1,3+d2,3

2e.

The following Lemma will be useful in proving Theorem 2.1.

Lemma 2.2 If three permutations π1, π2, and π3 have a median M that is part of a

trivial median path, then M must be a trivial median.

17


v1

d2,3v3

vφ

v2

d3,φd2,φvM

d1,φ

Figure 2.3: A median path including vφ can be constructed using a shortest pathfrom v1 to vφ and any median path of vφ, v2, and v3.

Proof: Assume to the contrary that π1, π2, and π3 have a trivial median path and

have a median M that is not trivial. By Definition 2.4, M must be on a shortest

path between two of π1, π2, and π3. If M is not trivial, it must be distance d > 0

from the closest of π1, π2, and π3. But then its median score must be greater by d

than the score of a trivial median, so M cannot be a median.

Theorem 2.1 Let π1, π2, and π3 be permutations such that π2 and π3 are separated

by distance d2,3, and let φ be another permutation separated from π1, π2, and π3 by

distances d1,φ, d2,φ, and d3,φ, respectively (see Figure 2.3). Suppose that φ is on a

median path PM of π1, π2, and π3 such that φ is on a shortest path between π1 and

a median M . Then the score S(M) of M obeys these bounds:

d1,φ+

⌈d2,φ + d3,φ + d2,3

2

⌉≤ S(M) ≤ d1,φ+min

(d2,φ+d2,3), (d2,φ+d3,φ), (d2,3+d3,φ)

Proof: Let v1, v2, and v3 be vertices corresponding to π1, π2, and π3, in the reversal

graph of the appropriate size. In addition, let there be a vertex vφ corresponding to

φ, as illustrated in Figure 2.3. We claim that a median path PM including vφ and M ,

such that vφ is on a shortest path from v1 to M , can be constructed by combining a

shortest path between v1 and vφ and a median path of vφ, v2, and v3. Assume to the

contrary that there exists a shorter median path Pshort, which also includes vφ and

M , but does not include the shortest path between v1 and vφ or does not include

18


a median path of vφ, v2, and v3. Pshort has to include v1 via a vertex other than vφ

and consequently other than M (because vφ is on a shortest path between v1 and

M). By Definition 2.2, Pshort must consist only of v1, v2, v3,M , and vertices between

them (including vφ), so v1 must be connected to Pshort via v2 or v3. Consequently,

M must be on a shortest path between v2 and v3; otherwise including M in Pshort

would result in a score greater than that of a trivial median. Therefore, M is part

of a trivial median path, which means that by Lemma 2.2, M is a trivial median. In

particular, M must be the vertex vi ∈ v2, v3 to which v1 is connected. Furthermore,

our assumptions about φ require that vφ be on the shortest path between v1 and vi.

Then Pshort includes both the shortest path between v1 and vφ and the median path

of vφ, v2, and v3, and we obtain the desired contradiction.

Because PM can be constructed by combining a shortest path between v1 and vφ,

and a median path of vφ, v2, and v3, its score is equivalent to the sum of the distance

between v1 and vφ (d1,φ), and the score of the median of vφ, v2, and v3. By applying

Lemma 2.1 to the latter, we obtain the desired bound.

2.3 The Algorithm

Algorithm 2.1 is a branch-and-bound search for an optimal reversal median that

uses Theorem 2.1 to prune regions of the reversal graph from its search tree. The

algorithm also uses Theorem 2.1 to prioritize among search branches.

Prioritization is managed using a priority stack—which always returns an item of

highest priority, but returns items of equal priority in last-in-first-out order. Because

the range of possible priorities is small, we use a fixed array of priority values, each

pointing to a stack, and so can execute push and pop operations in fast constant

time. Using stacks rather than the more conventional queues in this application is not

required for correctness, but, by inducing depth-first searching among alternatives

of equal cost, rapidly produces a good upper bound for the search.

19


Input: Three signed permutations of size n: π1, π2, and π3. Assume a functiondistance(πi, πj) that returns the reversal distance between πi and πj inlinear time.

Output: An optimal reversal median M .

begind1,2 ← distance(π1, π2);d1,3 ← distance(π1, π3);d2,3 ← distance(π2, π3);

1 Mmin ← dd1,2+d1,3+d2,3

2e;

2 Mmax ← min(d1,2 + d2,3), (d1,2 + d1,3), (d2,3 + d1,3);Initialize priority stack s for range Mmin to Mmax;(ψorig, ψ1, ψ2)← (πi, πj, πk) such that πi, πj, πk = π1, π2, π3 anddi,j + di,k = Mmax;

3 create vertex v with vlabel = ψorig, vdist = 0, vbest = Mmin, vworst = Mmax;4 push(s, v);5 M ← ψorig;

dsep ← dψ1,ψ2 ;stop← false ;

6 while s is not empty and stop = false dopop(s, v);

7 if vbest ≥Mmax then stop← true ;else

8 foreach w | w is an unmarked neighbor of v dowdist ← distance(wlabel, ψorig);

9 if wdist ≤ vdist then continue;mark w;dψ1 ← distance(wlabel, ψ1);dψ2 ← distance(wlabel, ψ2);

10 wbest ← wdist + ddψ1+dψ2

+dsep

2e;

11 wworst ← wdist + min(dψ1 + dsep), (dψ1 + dψ2), (dsep + dψ2);12 if wworst = Mmin then M ← wlabel; stop ← true ;

elseif wbest < Mmax then push(s, w, wbest);

13 if wworst < Mmax thenM ← wlabel; Mmax ← wworst;

endend

endend

endend

Algorithm 2.1: find reversal median

20


The algorithm begins by establishing upper and lower bounds for the solution

using Lemma 2.1 (steps 1 and 2) and priming the priority stack with a best-scoring

vertex (steps 3 and 4). Then it enters a main loop (step 6) in which it repeatedly pops

the “most promising” vertex from the priority stack, finds all of its as-yet-unvisited

neighbors (step 8), and evaluates each one for feasibility. Neighbors are obtained by

generating all(n+1

2

)possible permutations that can be produced from a vertex by a

single reversal. Neighbors of a vertex v can be ignored if they are not farther from

the origin than is v (step 9); such vertices will be examined as neighbors of another

vertex if they can feasibly belong to a median path. The best possible score (i.e.,

lower bound) for a vertex w is is used as the basis for prioritization. Best and worst

possible scores are calculated using the bounds of Theorem 2.1 (steps 10 and 11)

and maintained for all vertices present in the priority stack. Vertices can be pruned

when their best possible scores exceed the current global upper bound. The global

upper bound can be lowered when a vertex is found that has a lesser upper bound

(step 13). The search ends when no vertex in the queue has a best-possible score

lower than the upper bound (step 7) or a score equal to the global lower bound is

found (step 12).

Theorem 2.2 Algorithm 2.1 will return a permutation M only if M is a true median

of the inputs π1, π2, and π3.

Proof: Assume to the contrary that a permutation M ′ returned by the algorithm is

not a true median. Because the algorithm returns the permutation having the lowest

median score of all of the permutations (vertices) it visits (steps 5 and 13), it must not

have visited some median. If the algorithm did not visit some median, then either

it pruned all paths to medians or it exited before reaching any median. Suppose

the algorithm pruned all paths to medians. It only prunes vertices when their best

possible scores are lower than the current global upper bound, Mmax. Note that the

global upper bound always corresponds to the actual median score of a vertex that

21


has been visited (steps 2 and 13), so it cannot be wrong. Consider a median M

with at least one median path PM . By Definition 2.2, PM must include at least one

path between M and each of the vertices v1, v2, and v3 corresponding to π1, π2, and

π3. The algorithm proceeds by examining neighbors of an origin ψorig ∈ π1, π2, π3.

Therefore, if the algorithm pruned all paths to M , then it must have pruned a vertex

on the path between ψorig and M . But the best scores of such vertices are calculated

using the lower bound of Theorem 2.1 (step 10), which we have shown to be correct.

Therefore, the algorithm cannot have pruned the shortest paths to medians.

Suppose instead that the algorithm exited before reaching a median. The algorithm

can exit for one of three reasons:

1. The priority stack s becomes empty (step 6);

2. The next item returned from s has a best possible score greater than or equal

to the current global upper bound (step 7);

3. A vertex w is found with a worst possible score equal to the global lower bound

(step 12);

Case 1 can occur only if all vertices have been visited, or if all remaining neighbors

have been pruned (because, except when the algorithm stops for another reason, each

new neighbor is either pruned or pushed onto s). If all vertices have been visited,

then a median must have been visited. We have shown above that all neighbors on

paths to a median cannot have been pruned. Because s always returns a vertex v

such that no other vertex in s has a lower best-possible score than v, and because

all neighbors that are not pruned are added to s, case 2 can only occur if a median

has been visited or if all paths to medians have been pruned. We have shown that

all paths to medians cannot have been pruned. Therefore, if case 2 occurs, a median

must have been visited. In case 3, w must be a median, since the global lower bound

is set directly according to Lemma 2.1 (step 1), which we have shown to be correct.

22


Thus, none of these three cases can arise before a median has been found, and the

algorithm must return a median.

The worst-case running time of Algorithm 2.1 is O(n3d), with d = mind1,2,

d2,3, d1,3, but as would be expected with a branch-and-bound algorithm, the average

running time appears to be much better.

2.4 Experimental Method

We implemented find reversal median in C, reusing the linear-time distance rou-

tine (as well as some auxiliary code) from GRAPPA [1], and we evaluated its perfor-

mance on simulated data. All test data was generated by a simple program that

creates multiple sets of three permutations by applying random reversals to the

identity permutation, such that each set of three permutations represents three taxa

derived from a common ancestor under an inversions-only model of evolution. In

addition to the number of genes n to model and the number of sets s to create,

this program accepts a parameter r that determines how many random reversals to

apply in obtaining the permutation for each taxon. Thus, if n = 100, r = 10, and

s = 10, the program generates ten sets of three signed permutations, each of size 100,

and obtains each permutation by applying ten random reversals to the permutation

(+1,+2, . . . ,+100). A random reversal is defined as a reversal between two random

positions i and j such that 1 ≤ i, j ≤ n (if i = j, a single gene simply changes

its sign). When r is small compared to n, each permutation in a set tends to be a

distance of 2r from each other.

We used several algorithmic engineering techniques to improve the efficiency of

find reversal median. For example, we avoided dynamic memory allocation and

reused records representing graph vertices. We were able to gain a significant speedup

by optimizing the hash table used for marking vertices: a custom hash table offered

23


a fourfold increase in the overall speed of the program, as compared with UNIX’s

db implementation. With circular genomes, we achieved a further improvement in

performance by hashing on the circular identity of each permutation rather than

on the permutation itself. We define the circular identity of a permutation as that

equivalent permutation that begins with the gene labeled +1. By hashing on circular

identities, we reduced the number of vertices to visit and the number of permutations

to mark by approximately a factor of 2n.

To improve performance further, we adapted our sequential implementation to

run in parallel on shared memory architectures. Two steps in the algorithm are read-

ily parallelizable: the major loop (step 6), during each iteration of which a new vertex

is popped from the priority stack, and the minor loop (step 8), in which the neighbors

of a vertex v are generated, examined for marks, and evaluated for feasibility as me-

dians. We enabled parallel processing at both levels, using pthreads for maximum

portability across shared-memory architectures. With careful use of semaphores and

pthreads mutex functions, we were able to reduce the cost of synchronization among

threads to an acceptable level.

2.5 Experimental Results

2.5.1 Performance of Bounds

Being especially concerned with the effectiveness of the pruning strategy, we have

chosen as a measure of performance the number of vertices V of the reversal graph

that the algorithm visited. In particular, we have taken V to be the number of times

the program executed the loop at step 8 of the algorithm. Note that the number of

calls to distance is approximately 3V .

After observing that the program occasionally took much longer to find a median

24


0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0 10000 20000 30000 40000 50000 60000 70000 80000 90000

RelativeFrequency

V

Distribution of V over 500 Experiments (n = 50, r = 7)

Figure 2.4: Number of vertices visited while finding a median, in the course of 500experiments with n = 50 and r = 7.

than it did on average, we recorded the distribution of V over many experiments.

We used various values of the number of genes n and the number of reversals per

tree edge r. Figure 2.4 is typical of our results. It summarizes 500 experiments with

n = 50 and r = 7 and shows a roughly exponential distribution, with high relative

frequencies in a few intervals having small V : in 87% of the experiments, fewer than

10,000 vertices were visited, and in 95%, fewer than 20,000 were visited. This figure

demonstrates that the algorithm generally finds a median rapidly, but occasionally

becomes mired in an unprofitable region of the search space. We have observed that

the tail of the exponential distribution becomes more substantial as r grows larger

with respect to n.

In order to characterize typical performance, we recorded the statistical medians

of V as n and r varied independently. The results are shown in Figures 2.5 and

2.6. For comparison, we have also plotted the mean values of V and, in Figure 2.5,

a theoretical quadratic curve. Note that, at least for r = 5, median and mean V

25


0

5000

10000

15000

20000

25000

10 20 30 40 50 60 70 80 90 100

V

n

Number of Vertices Visited (r = 5)

Statistical Mean

♦ ♦♦

♦♦

♦

♦

♦

♦

♦♦Statistical Median

+ ++

++

+

+

+

+

++f(n)

Figure 2.5: Statistical median of V for r = 5 and 10 ≤ n ≤ 100, plotted with meanof V and the curve f(n) = 2.1 · n2, over 50 experiments for each value of n. For thisvalue of r, growth of the median and the mean of V appears to be quadratic in nover a large range of genome sizes.

appear to grow quadratically over a considerably large range of values for n, and

that, for n = 50, median V grows approximately linearly with r, at least as long

as r remains small (mean V grows somewhat faster than median V ). To put the

observed rate of growth into perspective, note that in the theoretical worst-case of

O(n3d), because d ≈ 2r and V = O(n3d

n) = O(n(6r−1)), one would see (given r = 5

and n = 50) growth of V with n29 and 506r−1.

2.5.2 Running Time and Parallel Speedup

We have tested program find reversal median sequentially on a 700 MHz Intel

Pentium III with 128MB of memory, and using various levels of parallelism on a Sun

E10000 with 64 333 MHz UltraSPARC II processors and 64GB of memory. Figure

2.7 shows average running times for r = 5 and n between 50 and 125.

26


0

2000

4000

6000

8000

10000

12000

14000

16000

18000

1 2 3 4 5 6 7 8

V

r

Number of Vertices Visited (n = 50)

Statistical Mean

♦♦

♦♦

♦♦

♦

♦♦Statistical Median

++

++

++

++

+

Figure 2.6: Statistical median of V for n = 50 and 1 ≤ r ≤ 8, plotted with mean ofV . The number of experiments for each value of r is 50.

Sequential running times are shown for the Sun and Intel processors and parallel

running times for the Sun with the number of processors p ∈ 1, 2, 4, 6. In all

cases, the average time to find a median is about 12 seconds or less. Observe that

for n = 100 (a realistic size for chloroplast or mitochondrial genomes) medians can

generally be found in an average of about 2 seconds using a reasonably fast computer.

We should note that the memory requirements for the program are considerable, and

that the level of performance shown here is partly a consequence of the large amount

of RAM available on the Sun.

It is evident from Figure 2.7 that we achieve a good parallel speedup for small p,

but that the benefits of parallelization begin to erode between p = 4 and p = 6 (this

tendency becomes more pronounced at p = 8, which we have not plotted here for

clarity of presentation). Anecdotal evidence suggests that the cause of this trend is

a combination of the overhead of synchronization and uneven load balancing among

the computing threads. We also observed that parallelism in the minor loop of the

27


0

2

4

6

8

10

12

50 60 70 80 90 100 110 120

AverageTime

to Finda Median

(sec)

n

Running Times for r = 5

Sun E10000 (p = 1)

♦

♦

♦

♦♦

Sun E10000 (p = 2)

++

+

+

+Sun E10000 (p = 4)

Sun E10000 (p = 6)

× ××

×

×Sun E10000 (seq)

4

4

4

4

4Pentium III

??

?

?

?

Figure 2.7: Sequential and parallel running times for r = 5 and n ∈ 50, 75, 100, 125.Each data point represents an average taken over 10 experiments. Parallel configu-rations used parallelism only in the minor loop of the algorithm.

algorithm was far more effective than parallelism in the major loop, presumably

because the heuristic for prioritization is sufficiently effective that the latter strategy

results in a large amount of unnecessary work.

2.5.3 Reversal Medians vs. Breakpoint Medians

Using program find reversal median, we evaluated the significance of reversal me-

dians, by comparing them with breakpoint medians, trivial medians, and “actual”

medians (i.e., the ancestral permutations from which observed taxa actually arose—

in this case, always equal to the identity permutation). Figure 2.8, which shows

results over 1 ≤ r ≤ 5 for n = 25, is typical of what we observed. It illustrates

that true reversal medians achieve comparable scores to actual medians1 and that

1Reversal medians are slightly better than actual medians when r becomes large withrespect to n, because saturation begins to cause convergence between taxa.

28


0

2

4

6

8

10

12

14

16

18

20

1 2 3 4 5

Averagemedian score

(reversals)

r

Comparison of Medians of Three (n = 25)

Actual

♦

♦

♦

♦

♦

♦Trivial

+

+

+

+

++

Breakpoint

Inversion

×

×

×

×

××

Figure 2.8: Comparison of reversal medians with breakpoint medians, trivial medi-ans, and actual medians, for n = 25. Averages were taken over 50 experiments.

breakpoint medians, when scored in terms of reversal distance, perform significantly

more poorly. A comparison in terms of reversal median scores is clearly biased in

favor of reversal medians; however, if it is true that reversal distances are (in at least

some cases) more meaningful than breakpoint distances, then these results suggest

that reversal medians are worth obtaining.

By adapting it slightly, we were able to use program find reversal median to

find all medians, and thus to characterize the extent to which reversal medians are

unique. An example of our results is shown in Figure 2.9, which describes the number

of reversal medians for n = 15 and 1 ≤ r ≤ 5, over 50 experiments for each value

of r. Observe that when r is small compared to n (roughly r ≤ 0.15n), the reversal

median is virtually always unique; and even when r is moderately large with respect

to n (roughly 0.15n < r ≤ 0.3n)2, the reversal median is unique or nearly unique

2Recall that the distance between permutations is approximately 2r, and that randompermutations tend to be separated by a distance of approximately n. We have observed

29


0.0

0.2

0.4

0.6

0.8

1.0

1 2 3 4 5

RelativeFrequency

Number of Optimal Medians

Uniqueness of the Inversion Median (n = 15)

r = 1r = 3r = 4r = 5

Figure 2.9: Distribution of number of medians in the course of 50 experiments forn = 15 and 1 ≤ r ≤ 5. The histogram for r = 2 is not shown because it isindistinguishable from that for r = 1.

most of the time.

In addition, we observed a strong relationship between unique reversal medians

and actual medians. For example, with n = 15 and r = 1, for which all reversal

medians were unique, 49 out of 50 reversal medians were identical to actual medians;

similarly, for n = 15 and r = 2, 48 out of 50 were identical to actual medians (in

both cases the exceptional reversal medians differed from actual medians by a single

reversal). As r becomes greater compared to n, this relationship weakens but remains

significant. For example, with n = 15 and r = 4, 38 out of 50 reversal medians were

unique, and 22 of those 38 were identical to actual medians (an additional 10 non-

unique reversal medians equaled actual medians).

that the effects of saturation are evident at r = 0.2n and are pronounced by r = 0.3n.

30


80

85

90

95

100

1 2 3 4 5 6 7 8

%

r

n = 50

♦ ♦ ♦♦

♦♦ ♦

♦

♦n = 100

+ + ++

+ +

+

Figure 2.10: Percentage of medians that are perfect, for n = 50 and n = 100 over1 ≤ r ≤ 8. Each data point reflects 100 trials. Experiments did not complete forn = 100, r > 6.

2.5.4 Preponderance of Perfect Medians

We made one additional discovery, in the course of our experiments, that will be par-

ticularly important to the remainder of this thesis: we found that the vast majority

of medians were perfect medians—that is, medians having a score equal to the lower

bound of Lemma 2.1. After noting this surprising phenomenon accidentally, we per-

formed several experiments to quantify it. Figure 2.10 illustrates results for n = 50

and n = 100, over 1 ≤ r ≤ 8. In this figure, each data point indicates the percentage

of times (in 100 trials) that the reversal median of the input permutations was a

perfect median. In all cases, the percentage was 96% or higher, and for r ≤ 3, the

percentage was 100%. In these experiments, every imperfect median had a score of

exactly one greater than that of a perfect median. The incidence of perfect medians

decreases as r grows, but slowly. Note that the rate of decrease is slower at n = 100

than at n = 50, presumably because it is the ratio of r to n that is important.

31


2.6 Summary

In this chapter, we have derived a branch-and-bound algorithm to find an optimal

reversal median. The algorithm depends on bounds that are computed using only

the metric property of reversal distance, and thus could be used for many other

types of measurements (including ones not related to genome rearrangements). The

algorithm requires many distance computations, however, and will only be practical

for measurements that can be computed very efficiently. When applied to the reversal

median problem, it performs surprisingly well, considering the enormous size of the

search space. It finds a median in time that is distributed roughly exponentially,

with the tail of the distribution becoming more substantial as the pairwise reversal

distances among input permutations grow relative to the sizes of the permutations.

The excellent performance of the algorithm when input permutations are close in

distance is likely related to the high incidence of “perfect medians”—medians having

scores equal to the global lower bound of the search—because when such a median

is found, a search can terminate immediately. When distances are larger, perfect

medians become less prevalent, but still appear the vast majority of times.

Reversal medians appear to have several useful properties that are likely to make

them preferable to breakpoint medians for many applications, despite that they

are more costly to compute. Reversal medians appear to be highly unique (espe-

cially when input permutations are close), often equal “actual medians” (under an

inversions-only model of evolution), and score significantly better than breakpoint

medians when evaluated in terms of reversal distance.

32

Chapter 3

Finding All Sorting Reversals

Untwisting all the chains that tie

The hidden soul of harmony.

–John Milton

The preponderance of perfect medians leads to the following idea: Suppose we take

permutation π1 as our origin as we search for a median of π1, π2, and π3. If there exists

a perfect median M , then M is on a shortest path from π1 to π2 and from π1 to π3

(as well as from π2 to π3). We could restrict our search to such paths by considering

at each intermediate permutation φ only those neighbors of φ that are closer than

φ to both π2 and π3 (see Figure 3.1). This simple idea provides the motivation for

chapter 3 and is the basis of the improved median algorithm introduced in chapter 4.

The branching step of the previous algorithm involves generating all(n+1

2

)neigh-

bors of an intermediate permutation φ and testing them against the current bounds

of the search. Furthermore, each test requires two de novo linear-time distance calcu-

lations. Hence, the branching step takes Ω(n3) time, and turns out to be a bottleneck

for the algorithm.

33

Chapter 3. Finding All Sorting Reversals

π1

φ

M

π2

π3

B

AA ∩B

Figure 3.1: Suppose φ is an intermediate permutation encountered during a “walk”from π1 toward a perfect median M of π1, π2, and π3. Suppose further that Arepresents the set of all sorting reversals of φ with respect to π2, and B representsthe set of all sorting reversals with respect to π3. Then we need only consider asviable neighbors of φ those permutations induced by the intersection of these sets,A ∩B.

We would prefer to generate neighbors in a more efficient way by taking advantage

of the unique structure of the problem of sorting by reversals, as described by the

powerful cycle-decomposition theory of Hannenhalli and Pevzner. Suppose at an

intermediate permutation φ we could directly enumerate the set A of all sorting

reversals with respect to π2 and the set B of all sorting reversals with respect to π3,

using the breakpoint graphs of φ with respect to π2 and φ with respect to π3. Then

A ∩ B would induce exactly those neighbors of φ to consider in pursuit of a perfect

median. Thus, an efficient solution to the problem of finding all sorting reversals

of one permutation with respect to another might enable us to improve the median

algorithm markedly, with minimal increase in complexity to our median algorithm1.

We will refer to this problem as the “all sorting reversals” (ASR) problem.

1That is, if we assume that the complexity of finding all sorting reversals would beencapsulated in another algorithm.

34


Note that a solution to ASR also immediately induces an algorithm to find all

sequences of reversals that sort one permutation with respect to another—that is,

the problem of finding “all sequences of sorting reversals” (ASSR) reduces easily to

ASR. While several authors have presented fast algorithms that find some sequence

of sorting reversals [18, 4, 20, 2, 3], no algorithm has been published that efficiently

addresses ASSR. As will be seen later in this chapter, however, for most problem

instances there exist many sequences of sorting reversals; therefore, for many appli-

cations, finding only one is of limited usefulness. Some such applications may be

relatively far-removed from the median problem. For example, a biologist studying

the permutations that describe hypothesized ancestral genomes in a phylogenetic

tree may wish to consider the merits of various alternative sequences of reversals

separating those permutations.

In this chapter, we will derive an efficient solution to ASR. The algorithm is

developed as follows: we begin by outlining a straighforward classification of all

possible reversals; then we introduce a simplified version of the problem, which we

call the “Fortress-Free Model” (FFM), and using the FFM , we prove exactly which

classes of reversals can be sorting reversals, and under what conditions they sort;

finally, we adapt our results for the general case by re-introducing fortresses. Using

principles developed in this way, we can easily describe an algorithm that solves

ASR. Our solution to ASR requires an efficient algorithm to solve a critical sub-

problem: detecting whether a candidate reversal introduces into the breakpoint graph

a new unoriented component (and potentially a new hurdle). We also derive a new

algorithm to solve this sub-problem efficiently. After presenting our algorithms,

we report experimental results that demonstrate their efficiency and affirm their

correctness.

35


3.1 Notation and Definitions

Let π and φ be signed permutations of size n, such that π = (π1, π2, . . . , πn) and φ =

(φ1, φ2, . . . , φn). Let the unsigned permutation π′ = (π′0, π′1, . . . , π

′2n+1) be defined

such that π′0 = 0, π′2n+1 = 2n+ 1, and for all i, 1 ≤ i ≤ n, π′2i = 2πi, π′2i−1 = 2πi − 1

(if πi > 0) or π′2i = 2|πi| − 1, π′2i−1 = 2|πi| (if πi < 0); let the unsigned permutation

φ′ = (φ′0, φ′1, . . . , φ

′2n+1) be defined exactly the same way with respect to φ. We say

two elements πi and πi+1 are adjacent in π, and we say the corresponding elements

π′2i and π′2i+1 are adjacent in π′; similarly for φ and φ′. Let the breakpoint graph B

of π with respect to φ be defined as follows (see Figure 3.2)2. B contains a sequence

of 2n + 2 vertices labeled with the elements of π′. Every two of these vertices that

reflect an adjacency in π′ are connected with a black edge (a “reality” edge), and

every two that reflect an adjacency in φ′ are connected with a gray edge (a “desire”

edge, often depicted as a dashed line). Let the overlap graph O = (V,E) for B be

defined such that there exists a distinct ve ∈ V for every gray edge e in B, and two

vertices ve and ve′ are connected by an edge (ve, ve′ ∈ E) iff gray edges e and

e′ overlap in B (see Figure 3.2). A cycle in B is a sequence of connected vertices

(v0, v1, . . . , v2i, v2i+1, . . . , v2n, v2n+1, v0) (n ≥ 0), such that, for all i, 0 ≤ i ≤ n, v2i

and v2i+1 are connected with a black edge, and v2i+1 and v2i+2 (or v2i+1 and v0, if

i = n) are connected with a gray edge. A connected component in O has the usual

meaning, and is sometimes called simply a “component”.

Every gray edge is said to be oriented if it spans an odd number of vertices in B,

and unoriented otherwise. A cycle in B and a connected component in O are each

said to be oriented if they contain at least one oriented gray edge. We call cycles

and components unoriented if they are not oriented, except when they are trivial. A

trivial cycle consists of a single gray edge and a single black edge, and corresponds

2The breakpoint graph in [34] is given the more euphonious but unwieldy name, “theDiagram of Reality and Desire”.

36


0 9 5 7 3 1 210 8 4 11 12 15 16 18 17 19 20 25 26 23 24 21 22 27 28 29 31 32 13 14 336 30

(0,1)(2,3)

(4,5)

(6,7)

(8,9)

(14,15)

(16,17)

(22,23)

(24,25)

(28,29)

(12,13)

(10,11)

(18,19)

(20,21)

(26,27)

(30,31)

(32,33)

v

xu

t

y zw

B

w

y

z

vx

t

O

u

Figure 3.2: Breakpoint graph B and overlap graph O for the permutationπ = −5,−3,−4,−2,+1,+6,+8,−9,+10,+13,+12,+11,+14,+15,+16,+7 withrespect to the identity permutation of size n = 16. Connected components t and ware oriented, y and z are trivial, and u, v, and x are unoriented. Unoriented com-ponents u and x are hurdles, but unoriented component v is a protected nonhurdlebecause it separates u and x. Oriented edges in O are represented by solid circles.

to an adjacency shared in permutations π and φ. A trivial cycle will always create

a trivial connected component—that is, a component consisting of a single, isolated

vertex in O—and a trivial component can only arise from a trivial cycle. Note that

the gray edges of a cycle always belong to the same connected component, so we can

say that the cycle belongs to that component. For convenience, we will refer to a

component that is either oriented or trivial as a benign component3.

3We differ from the literature also in the way we have distinguished and named trivialcycles and components.

37


Every unoriented component can be classified as either a hurdle or a protected

nonhurdle. A hurdle is an unoriented component that does not separate other un-

oriented components, and a protected nonhurdle is one that does. A component u is

said to separate two other components v and w if, in a traversal of the vertices of B,

it is impossible to pass (in either circular direction) from a vertex belonging to v to a

vertex belonging to w without encountering a vertex belonging to u. Note that, while

separation is primarily used with respect to unoriented components, the definition

applies as well to oriented and trivial ones4. A hurdle is called a superhurdle if, were

it eliminated, a protected nonhurdle would emerge as a hurdle; otherwise it is called

a simple hurdle.

By Hannenhalli and Pevzner’s duality theorem [18], the distance d(π, φ) between

π and φ is given by d(π, φ) = n+ 1− c+h+f , where c is the number of cycles and h

is the number of hurdles in B. The parameter f is equal to one if there is a fortress

in B and zero otherwise. A fortress exists iff there are an odd number of hurdles and

all are superhurdles [34].

A reversal ρ(i, j) (1 ≤ i < j ≤ n) applied to π = (π1, . . . , πn) transforms π

into ρ(π) = (π1, . . . ,−πj, . . . ,−πi, . . . , πn). A reversal ρ is a sorting reversal iff

d(ρ(π), φ) = d(π, φ) − 1. Note that d(ρ(π), φ) < d(π, φ) iff d(ρ(π), φ) = d(π, φ) − 1.

We will use the term ∆d to indicate the quantity d(ρ(π), φ)−d(π, φ). The end-points

i and j of a reversal ρ(i, j) correspond to the ith and (j + 1)st black edges of B. We

say that ρ acts on these edges. Note that ∆d ∈ −1, 0, 1 for every reversal.

Setubal and Meidanis [34] have presented an alternative distinction to the one

between oriented and unoriented gray edges, based on black edges. They define

two black edges of the same cycle as convergent if in a traversal of the cycle, these

edges induce the same (circular) ordering of the vertices of B; otherwise the edges

4Separation was introduced in [34]. Our definition is stated so as not to require acircular representation of B.

38


are divergent. It can be shown easily that an oriented gray edge always connects

divergent black edges and an unoriented gray edge always connects convergent black

edges (if the unoriented gray edge is part of a trivial cycle, it connects a black edge

to itself, and the correspondence holds trivially). Because every pair of black edges

in a cycle can be said to be convergent or divergent, however, and there are more

pairs of black edges than there are gray edges, not every divergent pair of black edges

corresponds to an oriented gray edge and not every convergent pair to an unoriented

gray edge. Setubal and Meidanis have shown that any reversal that acts on divergent

black edges will split the cycle to which the edges belong, and any reversal that acts

on convergent black edges will not split the cycle to which they belong. Furthermore,

any cycle that contains two divergent black edges must contain an oriented gray edge,

and hence is oriented (intuitively, an oriented cycle is a “splittable” cycle). Therefore,

a connected component is oriented iff at least one of its cycles has divergent black

edges (an oriented component is one that contains a splittable cycle).

Let M be the set of connected components in O, and let Ci be the set of cycles

that belong to mi ∈ M . Every ci,j ∈ Ci has a set of black edges Bi,j. If there

exist bi,j,k, bi,j,l ∈ Bi,j such that bi,j,k and bi,j,l are divergent, then ci,j is oriented;

otherwise ci,j is unoriented, unless |Bi,j| = 1, in which case ci,j is trivial. If there

exists ci,j belonging to mi such that ci,j is oriented, then mi is oriented; otherwise,

mi is unoriented, unless mi = ci,j and ci,j is trivial, in which case mi also is trivial.

If mi is unoriented, then mi is either a hurdle or a protected nonhurdle, depending

on whether it separates other unoriented components.

The following definitions will enable us to define precisely the effect an arbitrary

reversal has on the orientation of edges and components.

Definition 3.1 A reversal ρ acting on black edges i and j induces a bipartitioning

of the vertices in the breakpoint graph, with one set R containing the vertices between

i and j, and the other set R′ containing all other vertices. We say that R and R′ are

39


the ranges of ρ.

In the general case of circular genomes, the labels R and R′ are arbitrary (they

depend on how one draws the breakpoint graph). Indeed, the relationship between

the two sets is symmetric, in that what we think of as a reversal to the elements of

one set can as well be modeled as a reversal to the elements of the other. What is

important is that the bipartitioning is unambiguously defined by ρ.

Definition 3.2 We say that a reversal ρ with ranges R and R′ affects a gray edge

iff the edge has one vertex in R and one vertex in R′.

We say that ρ affects a component iff ρ affects at least one gray edge belonging to

the component. It can be shown easily that a component is affected by a reversal

with ranges R and R′ iff at least one vertex of the component (in the breakpoint

graph) belongs to each of R and R′.

Lemma 3.1 A reversal ρ causes a gray edge e to change orientation iff ρ affects e.

Proof: The edge e is oriented iff the black edges adjacent to e are divergent, and

unoriented iff the black edges adjacent to e are convergent. If ρ affects e, then one

of its adjoining black edges will be “reversed” with respect to the other (in terms of

the direction it induces in a traversal of the cycle); and if ρ does not affect e, then

e’s black edges will remain unchanged with respect to one another. Thus, if ρ affects

e its adjoining black edges will change from convergent to divergent or vice versa,

and if ρ does not affect e, its adjoining black edges will remain as they were.

Corollary 3.1 A reversal ρ causes an unoriented component u to become oriented

iff ρ affects u.

40


(a)

0 543 1 2

(b)

2 90 17781314111210151665143

Figure 3.3: Ways to cut a simple hurdle (a) and ways to merge two mergeable hurdles(b). Any reversal that acts on black edges of a single cycle in a hurdle will orient atleast one edge of the cycle, and thus cut the hurdle. Any reversal that acts on blackedges from different hurdles will combine them into a single oriented component, andthus merge the hurdles.

Proof: At least one gray edge belonging to u will be affected by ρ iff ρ affects u; a

gray edge will change from unoriented to oriented iff it is affected; and u will become

oriented iff at least one of its gray edges becomes oriented.

Corollary 3.1 provides us with a simple way to explain the phenomena Hannen-

halli and Pevzner have called “hurdle cutting” and “hurdles merging” [18]. In hurdle

cutting, a single hurdle is affected and rendered oriented, and in hurdles merging,

two hurdles are affected and oriented, as are any protected nonhurdles that separate

them (see Figure 3.3). Not all reversals that affect a single hurdle cut the hurdle,

however, and not all reversals that affect two hurdles merge them, as will be shown

in the next section.

3.2 Sorting Reversals in the Absence of Fortresses

Based on the definitions above, we can describe an exhaustive classification scheme

for reversals.

41


Lemma 3.2 Suppose ρ is an arbitrary reversal, such that ρ acts on black edges bi,j,k

and bi′,j′,k′, belonging respectively to cycles ci,j and ci′j′ and to connected components

mi and mi′. Then one of the following must be true:

1. i = i′ and j = j′ (bi,j,k and bi′,j′,k′ belong to the same cycle, ci,j)

(a) ci,j is oriented and mi is oriented

i. bi,j,k and bi′,j′,k′ are convergent

ii. bi,j,k and bi′,j′,k′ are divergent

(b) ci,j is unoriented and mi is either oriented or unoriented

2. i = i′ and j 6= j′ (bi,j,k and bi′,j′,k′ belong to different cycles of the same compo-

nent, mi). Here, mi may be oriented or unoriented.

3. i 6= i′ (bi,j,k and bi′,j′,k′ belong to different components, mi and mi′). Each of

mi and mi′ may be benign or unoriented.

Proof: Either i = i′ or i 6= i′ (the edges are part of the same or different components);

if i = i′ then either j = j′ or j 6= j′ (the edges are part of the same or different

cycles); if i = i′ and j = j′ then bi,j,k and bi′,j′,k′ are either convergent or divergent.

Furthermore, each of ci,j, ci′,j′ , mi, and mi′ by definition is oriented, unoriented, or

trivial. Trivial cycles or components are not possible in cases 1 and 2, and case

1 need not consider the possibility of an oriented cycle belonging to an unoriented

component (which is prohibited by definition).

We define the “Fortress-Free Model” (FFM) of ASR to be a simplified version

of the problem in which it is assumed that a fortress does not exist. Thus, under

the FFM , d(π, φ) = n + 1 − c + h. The FFM allows us to introduce a simple but

powerful rule we will call “conservation of distance”.

42


Lemma 3.3 (Conservation of Distance) Under the FFM , a reversal is a sort-

ing reversal iff: ∆c = −1 and ∆h = −2; or ∆c = 0 and ∆h = −1; or ∆c = 1 and

∆h = 0.

Proof: We know that ∆c ∈ −1, 0, 1 (because a reversal can only merge cycles,

be neutral with respect to cycle number, or split a cycle). In the case of a sorting

reversal, ∆d = −1. Because d = n+ 1− c+ h, it must be true that ∆h−∆c = −1.

The lemmata below address Cases 1a, 1b, 2, and 3 of Lemma 3.2. They rely

heavily on the idea of conservation of distance.

Lemma 3.4 (Case 1a) Under the FFM , a reversal ρ that acts on two black edges

belonging to the same oriented cycle is a sorting reversal iff the edges are divergent

and ρ does not increase the number of hurdles.

Proof: Either the black edges are divergent and ρ splits the cycle, or the black edges

are convergent and ρ is neutral with respect to cycle number [34]. If ρ splits the cycle

(∆c = 1), then it is a sorting reversal iff ∆h = 0 (conservation of distance). If ρ is

neutral (∆c = 0), then it is a sorting reversal iff ∆h = −1 (conservation of distance).

But if ρ is neutral, we know that ∆h = 0, because the reversal acts on black edges of

an oriented cycle, and therefore of an oriented component. A reversal that acts only

on edges of a single component cannot affect any other component (otherwise the

two component would overlap, and thus would not be separate components), so in

this case, ρ cannot change the number of hurdles. Therefore, to be a sorting reversal,

ρ must split the cycle in question and avoid increasing the number of hurdles.

Lemma 3.5 (Case 1b) Under the FFM , a reversal ρ that acts on two black edges

belonging to the same unoriented cycle c is a sorting reversal iff c belongs to a simple

hurdle.

43


Proof: Because c is unoriented, all of its black edges are convergent; therefore ρ

must result in ∆c = 0. The component m to which c belongs is either oriented, a

hurdle, or a protected nonhurdle. If m is oriented, then ∆h = 0, as described in the

proof of Lemma 3.4. If m is a protected nonhurdle, ρ cannot orient a hurdle, and

∆h ≥ 0 (orienting a protected nonhurdle cannot decrease the number of hurdles). If

m is a hurdle, however, then ρ cuts m. But if m is a superhurdle, then when it is

cut another hurdle will emerge, and ∆h = 0; thus, ρ will not be a sorting reversal.

Only if m is a simple hurdle will ∆h = −1, and only in this case will ρ be a sorting

reversal.

Lemma 3.6 (Case 2) Under the FFM , a reversal ρ cannot be a sorting reversal

if ρ acts on two black edges belonging to different cycles of the same component.

Proof: Because ρ acts on different cycles, ∆c = −1 [34]. Therefore, by conservation

of distance, ρ is a sorting reversal only if ∆h = −2. It is impossible, however, for ρ

to orient two hurdles, because it affects only a single component.

To address Case 3 of Lemma 3.2, we must introduce several new ideas.

Definition 3.3 Suppose in a permutation π there exist four unoriented components

u, v, w, and p, such that p is a protected nonhurdle that separates hurdles u and v

from w but does not separate u and p from each other (Figure 3.4). Further suppose

that every other unoriented component in π either separates u and v or is separated

by p from both u and v, and that p does not separate any two of the components

that it separates from u and v. Then we say that hurdles u and v form a double

superhurdle.

Definition 3.4 A hurdle that separates a benign component from all other unori-

ented components is said to be the separating hurdle of the benign component.

44


u z

p

y

x v w

Figure 3.4: Breakpoint graph of the permutation π = +2,+1,+3,+6,+16,+15,+7,+10,+12,+11,+13,+9,+8,+14,+17,+19,+18,+20,+5,+4,+21,+23,+22, in whichhurdles u and v form a double superhurdle with respect to protected nonhurdlep. Notice that all unoriented components besides u, v, and p either separate u and v(y and z) or are separated from both u and v by p (x and w). We call this construct a“double superhurdle” because a reversal that destroys u and v will cause p to becomea hurdle; thus, the pair acts as a kind of superhurdle of p, even though neither ofthe individuals is a superhurdle of p. Notice in this example that x and w also forma double superhurdle with respect to p.

It follows from the definition of a separating hurdle that a benign component may

have at most one separating hurdle, but a hurdle may be the separator of multiple

benign components. All separating hurdles and the benign components that they

separate can be found easily in a single pass of the breakpoint graph, as shown in

Algorithm 3.1.

Lemma 3.7 (Case 3) Under the FFM , a reversal ρ that acts on black edges be-

longing to different components u and v is a sorting reversal iff all of the following

are true:

1. Each of u and v is a hurdle or a benign component that has a separating hurdle.

2. u and v are not benign components sharing the same separating hurdle.

3. u and v or their separating hurdles do not form a double superhurdle.

45


Input: A breakpoint graph B; a corresponding array lab such that lab [i] isthe label of the connected component to which the ith vertex, vi ∈ B,belongs; whether each connected component is “benign”, a “hurdle”, ora “protected nonhurdle”

Output: A list of pairs (x, y) such that x is a benign component and y is itsseparating hurdle.

begininitialize lists L, M ;initialize array mark ;start← −1;for i← 0 to 2n+ 2 do if lab [i] is a hurdle then start← i; break;if start = −1 then return L;currenth← l[start];for i← start+ 1 to start+ 2n+ 2 do

comp← lab [i mod 2n+ 2];if comp = currenth then

/* another instance of current hurdle—save all members of M */foreach c ∈M do

mark [c] ← 0;append(L, (c, currenth));

endendelse if comp is a hurdle then

/* new hurdle—empty M*/currenth← comp;foreach c ∈ list do mark [c] ← 0;

endelse if comp is benign and mark [comp] = 0 then

/* new benign component—add to M*/mark [comp] ← 1;append(M, comp);

endendreturn L;

end

Algorithm 3.1: find separating hurdles

46


0 13101615172543 16 7 8 18 9 14 191112

ρ

q

u vp

Figure 3.5: A reversal ρ that acts on black edges belonging to two benign components(u and v) can still cause the destruction of two hurdles (p and q), if the benigncomponents have distinct separating hurdles. The effect is similar to what would beobserved if the hurdles were merged.

Proof: As in Lemma 3.6, ∆c = −1 and ρ is a sorting reversal only if ∆h = −2.

If ∆h = −2, then ρ must orient at least two hurdles; and it is impossible for a

single reversal to orient more than two hurdles, so ρ must orient exactly two hurdles.

To orient two hurdles, ρ must act on two black edges, each of which belongs to a

hurdle or to a benign component separated from the other by a hurdle (a protected

nonhurdle cannot be separated from a hurdle by a hurdle; otherwise a hurdle would

separate unoriented components). Furthermore, if each black edge belongs to a

benign component, the two benign components cannot be separated by a single

hurdle (if they were, ρ would orient only one hurdle). If a benign component is

separated by a hurdle from another hurdle, the former hurdle must be its separating

hurdle (another separating hurdle could not exist; if it did a hurdle would separate

unoriented components). Thus, each of u and v must be a hurdle or a benign

component that has a separating hurdle; and if both are benign components, they

cannot share a separating hurdle. For it to be true that ∆h = −2, however, ρ

must avoid introducing any new hurdles. We claim that a reversal that destroys two

hurdles causes a new hurdle to emerge iff those hurdles form a double superhurdle.

It is clear that, if each of u and v is a hurdle or a benign component separated

by a hurdle, and they do not share a separating hurdle, then ρ will orient exactly

two hurdles; thus, if we prove our claim about double superhurdles, we will have

completed both directions of the proof of Lemma 3.

47


A reversal ρ that destroys two hurdles u and v will cause a new hurdle to emerge

iff u and v form a double superhurdle. It follows directly from the definition of a

double superhurdle that, if u and v form a double superhurdle, then a reversal that

destroys them must cause a new hurdle to emerge; therefore, we will focus on the

converse claim. Suppose ρ causes a protected nonhurdle p to emerge as a hurdle.

Then p must separate each member of a nonempty set U of unoriented components

from each member of a nonempty set V of unoriented components, and ρ must orient

all members of U or all members of V (it cannot orient members of both, because if ρ

spanned U and V it would orient p). Assume that ρ orients all members of U . Then

u, v ∈ U , and u and v are separated from all members of V . Furthermore, u and

v must be separated from one another by all other members of U ; otherwise either

they could not be hurdles, or a reversal that destroyed them could not also destroy

the other members of U . Thus, u and v meet the definition of a double superhurdle.

3.3 Accommodating Fortresses

Now we shall abandon the simplification of the FFM and re-introduce fortresses.

With fortresses, we have two major cases to consider:

1. Before a reversal ρ, there exists no fortress (f = 0). Here we must consider the

possibility that ρ introduces a fortress, and thus is not a sorting reversal in the

general model.

2. Before a reversal ρ, there exists a fortress (f = 1). Here there is no danger

of introducing a fortress (there can be only one fortress). Instead, however,

we must consider the possibility that a reversal ρ that is not a sorting reversal

under the FFM eliminates the fortress, thus is a sorting reversal in the general

model.

48


Case 1 is addressed by the following lemma.

Lemma 3.8 A reversal ρ that meets the criteria for a sorting reversal under the

FFM will introduce a fortress iff one of the following is true:

1. ρ acts on divergent black edges of the same oriented cycle, and ρ introduces at

least one unoriented component so that there are an odd number of hurdles all

of which are superhurdles.

2. ρ cuts the only simple hurdle and there are an odd number of superhurdles.

3. ρ acts on two black edges belonging to different components such that two hur-

dles are destroyed, and the set of hurdles is reduced so that there remain an

odd number and all are superhurdles.

Proof: We can introduce a fortress only by changing the set of hurdles such that it

has an odd size and consists only of superhurdles. Three classes of sorting reversals

under the FFM are capable of changing the set of hurdles: (a) those that introduce

unoriented components as they split cycles (Lemma 3.4); (b) those that cut simple

hurdles (Lemma 3.5); and (c) those that destroy pairs of hurdles (Lemma 3.7). A

reversal of any of these three classes is capable of changing the set of hurdles so as

to create a fortress. A reversal of class (b), however, can only remove a single simple

hurdle, so for it to create a fortress, it must cut the sole simple hurdle and there

already exist an odd number of superhurdles.

The second case is more difficult and requires the introduction of several new

concepts.

Definition 3.5 We say two unoriented components u and v are adjacent in a break-

point graph B iff at least one pair of vertices from u and v have between them, in

49


x

w

p

z

u

y

v

Figure 3.6: The hurdle graph for the breakpoint graph of Figure 3.4. Every un-oriented component is represented by a vertex, and adjacencies between unorientedcomponents are represented by edges. Notice here that unoriented components y, z,and u form a hurdle chain for hurdle u, with y as the anchor. Also notice that eachof the double superhurdles has a corresponding 3-vertex cycle.

a circular traversal of the vertices of the breakpoint graph, no vertex belonging to

another unoriented component.

Definition 3.6 Let U be the set of unoriented components in a breakpoint graph B.

We define the hurdle graph for B to be a graph H = (V,E) such that V = U and

E = vi, vj | vi, vj ∈ V and vi, vj are adjacent in B.

We can easily construct the hurdle graph in a single traversal of the breakpoint

graph, after we have labeled B with the connected components in O, and identified

all unoriented components. Figure 3.6 shows the hurdle graph for the breakpoint

graph of Figure 3.4.

The hurdle graph has a number of useful properties. We will describe several of

them here without providing detailed proofs. For convenience of discussion, we will

use the same name to refer to an unoriented component and the vertex in the hurdle

graph that represents it; e.g., unoriented component u is represented by vertex u.

50


1. Two unoriented components u and v are separated by a third unoriented com-

ponent w iff there exists no path in the hurdle graph between vertices u and v

that does not pass through vertex w.

2. A vertex in the hurdle graph cannot separate other vertices if it belongs to a

cycle and has degree 2, or if it has degree 1. Such a vertex u appears in the

hurdle graph iff u is a hurdle.

3. A node in the hurdle graph can belong to multiple cycles, but an edge cannot.

The reason is that the rules of separation are such that multiple cycles can

occur only if there exists at least one vertex v that separates all members

(other than v) of cycle from all members (other than v) of the other. As a

result, a vertex in the hurdle graph must separate other vertices if it belongs

to a cycle and has degree greater than 2, or if it does not belong to a cycle and

has degree 2. Such a vertex v appears in the hurdle graph iff v is a protected

nonhurdle.

4. A hurdle u is a superhurdle iff u has degree 1 and the single neighbor v of u

either has degree 3 and belongs to a cycle, or has degree 2 and does not belong

to a cycle.

Definition 3.7 A hurdle chain is a chain of vertices in the hurdle graph that

consists of a hurdle and zero or more other vertices, such that every vertex v in the

chain is a hurdle, has degree 2 and does not belong to a cycle, or is the last vertex in

the chain and either belongs to a cycle or has degree greater than 2. A superhurdle

chain is a hurdle chain that contains a superhurdle.

If a hurdle chain has one end that is not a hurdle, we call the vertex at that end

the anchor of the chain. A hurdle chain that has hurdles at both ends is said to

be “unanchored”. If there exists an unanchored chain, it must encompass the entire

51


hurdle graph (that the hurdle graph must be connected is implicit in its definition);

therefore, if there exists at least one anchored chain, all chains must be anchored.

The anchors of anchored chains always belong to cycles. We say that the chain is

“anchored by” the cycle to which its anchor belongs.

Lemma 3.9 Two hurdles u and v form a double superhurdle iff u and v belong to

hurdle chains anchored by w and x, such that w and x belong to a 3-vertex cycle,

the third vertex y of that cycle (y 6= w, y 6= x) has degree 3, and each of w and x has

degree of at most 3.

Proof: If w, x, and y form a 3-vertex cycle as described, and if y has degree 3,

there must exist a vertex z such that y separates z from both u and v, yet y does

not separate u and v from each other. Furthermore, any other vertices must belong

to the hurdle chains of u and v (in which case they separate u and v) or must be

separated from u and v by y. Finally, because y has degree of 3 (and not more), it

cannot separate any two of the components that it separates from u and v. Thus, u

and v meet the definition of a double superhurdle. To prove the converse, construct

a hurdle graph according to the definition of a double superhurdle as follows. Begin

with four components, u, v, w, and p. Place r − 1 vertices and r edges (r ≥ 1)

between u and p, and place s − 1 vertices and s edges (s ≥ 1) between v and p.

Place the vertex w on the other side of p from u and v, and connect it to p, using

any number of intermediate nodes and edges, making sure that only a single edge

connects this sub-graph to p (otherwise p would separate vertices that it separates

from u and v). Recall that u and v must not be separated from each other by p, and

all vertices that separate them from p must separate them from one another. The

only way to meet these criteria is to connect with a new edge the closest to p of the

nodes that separate u and p, and the closest to p of the nodes that separate v and p.

Call these nodes y and z. Thus, p, y, and z form a 3-vertex cycle, y and z are the

52


anchors of the hurdle chains of u and v, y and z have degrees of at most three, and

p has degree greater than 2.

Definition 3.8 We say a superhurdle is a single protector if it belongs to an

anchored hurdle chain of length 2. We call the neighbor of a single protector a

pseudohurdle.

The important implication of this definition is that if a single protector were oriented,

then the corresponding pseudohurdle would become a simple hurdle; and if a pseudo-

hurdle were oriented, the corresponding superhurdle would become a simple hurdle.

The reason for the name pseudohurdle will become apparent below. Essentially, when

there is a fortress, such a component can be treated much like a hurdle.

Lemma 3.10 A reversal ρ that affects a single unoriented component eliminates a

fortress iff it affects a single protector or a pseudohurdle.

Proof: First we will show that if ρ eliminates a fortress, it must affect a single

protector or a pseudohurdle. Suppose to the contrary that ρ eliminates a fortress

but affects a component that is (1) a superhurdle that is not a single protector, or (2)

a protected nonhurdle that is not a pseudohurdle (u cannot be a simple hurdle if there

is a fortress). If u is a superhurdle that is not a single protector, then it protects

at least two nonhurdles. Therefore, if ρ orients u, it will simply create another

superhurdle in its place, and the fortress will not be removed. Similarly, if u is a

protected nonhurdle that is not a pseudohurdle, then when ρ orients u, no superhurdle

will be converted into a simple hurdle, and the fortress will remain. In either case our

assumption is impossible; thus, if ρ eliminates a fortress, it must affect and orient a

single protector or a pseudohurdle. Now we will show the converse: that if ρ affects

a single protector or a pseudohurdle, it must eliminate a fortress. If ρ affects and

orients a single protector or a pseudohurdle, it will create a simple hurdle. Because

53


we have assumed that ρ does not affect or create any other unoriented component,

the simple hurdle must remain a simple hurdle. A fortress cannot exist if there is a

simple hurdle, so ρ must eliminate the fortress.

Lemma 3.11 A sorting reversal ρ that orients a set U of two or more unoriented

components eliminates a fortress iff one of the following is true:

1. U includes all members of exactly one superhurdle chain.

2. U includes the members of a double superhurdle

Proof: Let us show first that if ρ meets one of the two criteria listed, it must elim-

inate a fortress. Suppose U includes all members of superhurdle chain k. Then ρ

must remove k from the hurdle graph. Furthermore, U includes all members of no

other superhurdle chain, so ρ removes no other hurdle chain, and the number of

superhurdles must decrease by one; thus, there can no longer be an odd number

of superhurdles, and ρ eliminates the fortress. Suppose instead that U includes a

double superhurdle. In this case, U must contain exactly the members of the double

superhurdle and all unoriented components that separate them. Thus, ρ will destroy

two superhurdles and cause a new hurdle to emerge (the protected nonhurdle of the

double superhurdle). Therefore, an even number of hurdles will remain, and the

fortress can no longer exist.

Now we must show that if ρ eliminates a fortress, it must meet one of the criteria

listed. Assume to the contrary that ρ eliminates a fortress and neither includes all

members of exactly one superhurdle chain nor includes a double superhurdle. The

first criterion is the fundamental one; we will focus on it and address the other in

passing. U must include all members of exactly two superhurdle chains. We know this

because U cannot include all members of no chain (because ρ must affect multiple

cycles, it must cause ∆c = −1; thus, to be a sorting reversal, with ∆f = −1,

54


ρ must achieve ∆h = −1; but decreasing the number of hurdles is only possible

by orienting all members of at least one hurdle chain), and no single reversal can

orient all members of more than two chains. Therefore, ρ must eliminate exactly two

superhurdles. However, ρ cannot remove a fortress by eliminating two superhurdles

unless the fortress is a 3-fortress (the smallest possible fortress) or the superhurdles

form a double superhurdle (the only way that eliminating the two superhurdles could

cause a new hurdle to emerge). In a 3-fortress, every two superhurdles form a double

superhurdle (if any two were merged, the anchor of the third’s chain would emerge

as a hurdle), so it is enough to say that U must contain a double superhurdle.

This, however, is prohibited by the second part of our assumption so we have a

contradiction. Thus, if ρ eliminates a fortress, it must meet one of the two criteria

of the Lemma.

Lemma 3.12 A reversal ρ eliminates all members of exactly one superhurdle chain

iff it acts on edges belonging to two components u and v such that:

1. One of u and v is either a superhurdle, or a benign component with a separating

hurdle. Without loss of generality, assume that u is this component. Let w be

either u (if u is a superhurdle) or the separating hurdle of u (if u is a benign

component).

2. One of the following is true for v:

(a) v is the anchor of w’s superhurdle chain.

(b) v is a protected nonhurdle that does not belong to w’s superhurdle chain.

(c) v is a benign component that has no separating hurdle, and v is not sepa-

rated from one unoriented component in w’s chain by another.

Proof: First we will show that if ρ meets the criteria of the Lemma, it eliminates all

members of exactly one superhurdle chain. Suppose that u and w are defined as in

55


part 1. Suppose further that k is the hurdle chain of w. Then, if v is the anchor of

k, all members of k, and no members of any other hurdle chain, will be affected by

ρ. If instead v is a protected nonhurdle not belonging to k, then either v belongs to

another hurdle chain, or v belongs to no hurdle chain. In both cases, all members

of k are affected by ρ, and all members are affected of no other hurdle chain (if v

belongs to another hurdle chain k′, then at least one member of k′ [its hurdle] is not

affected by ρ). Finally, if v is a benign component that has no separating hurdle, and

v is not separated from one unoriented component in k by another, then no hurdle

other than w can separate u and v, and again, ρ eliminates all members of exactly

one hurdle chain.

Now we shall show that if ρ eliminates all members of exactly one superhurdle chain,

then ρ meets the criteria of the Lemma. If ρ eliminates all member of superhurdle

chain k, and does not eliminate all member of any other superhurdle chain, then ρ

must act on the black edges of two components u and v such that (without loss of

generality) u is equal to, or separated from v by, the superhurdle w of k, and v is

equal to, or separated from u by, the anchor a of k. Thus, u must either equal w or

be a benign component separated by w, and we have proved the necessity of the first

criterion. If v equals the anchor a, then the second criterion is satisfied via option

(a). Otherwise, v must not be another superhurdle (i.e., a superhurdle besides w),

a benign component separated by another superhurdle, or a member of k. If v were

another superhurdle or a benign component separated by another superhurdle, then

ρ would destroy all members of two superhurdle chains; and if v were a member of

k other than a, then ρ would not eliminate all members of k. All that remains is

for v to be a protected nonhurdle that does not belong to k (option (b)) or a benign

component that has no separating hurdle. If the latter, that benign component also

cannot be separated from one unoriented component in w by another; otherwise,

ρ would not orient all components of k. Thus, we obtain option (c). Therefore, if

ρ eliminates all members of exactly one superhurdle chain, then ρ meets the first

56


criterion and one of the three options of the second criterion, and the Lemma is

proved.

Finally, we are prepared to enumerate the ways in which a fortress can be elimi-

nated.

Lemma 3.13 If there exists a fortress, then a sorting reversal ρ will eliminate the

fortress iff one of the following is true:

1. ρ splits a cycle and increases the number of hurdles.

2. ρ cuts a single protector or a pseudohurdle.

3. ρ acts on edges of two components u and v such that one of u and v is a

superhurdle or a benign component that has a separating superhurdle. Let u be

this component, and let the hurdle w be either u or u’s separating hurdle. Then

one of the following is true:

(a) v is the anchor of w’s chain.

(b) v is a protected nonhurdle not belonging to w’s chain.

(c) v is a benign component that is not separated by a hurdle, and v is not

separated by one component in w’s chain from another.

(d) v is a superhurdle or a benign component with a separating superhurdle,

and v or its separating hurdle forms a double superhurdle with w.

Proof: Any reversal that eliminates a fortress must make the number of hurdles even

or cause there to arise at least one simple hurdle. Such a change in the set of hurdles

can occur only by the elimination or creation of hurdles, or by the elimination of

pseudohurdles (it cannot occur through the creation of a protected nonhurdle). We

know that d = n+ 1− c+ h+ f and, because ρ is a sorting reversal, that ∆d = −1;

57


therefore, if ∆f = −1, then ∆h = ∆c. Furthermore, ∆h = ∆c ∈ −1, 0, 1. Let us

consider these three possibilities:

1. ∆h = ∆c = 1 (a cycle is split and the number of hurdles increases by one).

This case will occur only if at least one portion of the split cycle forms a new

unoriented component, and if this new unoriented component neither separates

existing unoriented components nor protects an existing superhurdle. The

increase in the number of hurdles will cause an odd number to become even,

and thus will eliminate the fortress.

2. ∆h = ∆c = 0 (neither the number of cycles nor the number of hurdles changes).

This case can occur only if ρ acts on convergent edges of the same cycle (oth-

erwise ∆c 6= 0). In addition, ∆f = −1 and ∆h = 0 will be true iff the affected

cycle belongs to a single protector or a pseudohurdle (Lemma 3.10).

3. ∆h = ∆c = −1 (two cycles are merged and the number of hurdles decreases

by one). Here ρ must act on edges of different components (if it acted on

different cycles of the same component, it could not decrease the number of

hurdles). Because all hurdles are superhurdles, the number of hurdles can-

not change unless multiple unoriented components are affected; thus Lemma

3.11 applies here. By combining Lemma 3.11 with Lemma 3.12 we obtain the

characterization presented in Lemma 3.13.

Now we can synthesize everything developed so far by generalizing Lemmata 3.4,

3.5, 3.6, and 3.7 to accommodate fortresses.

Lemma 3.14 (Generalization of Lemma 3.4) A reversal ρ that acts on two

black edges belonging to the same oriented cycle is a sorting reversal iff the edges are

divergent and one of the following is true:

58


1. ρ does not introduce an unoriented component

2. There exists no fortress (f = 0) and ρ introduces an unoriented component, but

that oriented component does not cause there to be an odd number of hurdles

all of which are superhurdles.

3. There exists a fortress (f = 1) and ρ increases the number of hurdles.

Proof: In case 1, the set of hurdles cannot be changed, so a fortress cannot be

introduced. Case 2 follows from Lemma 3.8, and case 3 from Lemma 3.13.

Lemma 3.15 (Generalization of Lemma 3.5) A reversal ρ that acts on two

black edges belonging to the same unoriented cycle c is a sorting reversal iff one of

the following is true:

1. There exists no fortress (f = 0), c belongs to a simple hurdle u, and either u

is not the only simple hurdle or the number of superhurdles is even.

2. There exists a fortress (f = 1) and c belongs to a single protector or a pseudo-

hurdle.

Proof: Case 1 follows from Lemma 3.8 and case 2 from Lemma 3.13.

Lemma 3.16 (Generalization of Lemma 3.6) A reversal ρ cannot be a sorting

reversal if ρ acts on two black edges belonging to different cycles of the same compo-

nent.

Proof: We have shown in Lemma 3.13 that, even if a fortress exists, a reversal that

acts on different cycles of the same component cannot remove the fortress. Thus

Lemma 3.6 applies without change to the general case.

59


Lemma 3.17 (Generalization of Lemma 3.7) A reversal ρ that acts on black

edges belonging to different components u and v is a sorting reversal iff one of the

following is true:

1. There exists no fortress (f = 0) and all of the following are true:

(a) Each of u and v is a hurdle or a benign component that has a separating

hurdle.

(b) u and v are not benign components sharing the same separating hurdle.

(c) u and v or their separating hurdles do not form a double superhurdle.

(d) The elimination of the hurdles associated with u and v will not leave an

odd number of hurdles all of which are superhurdles.

2. There exists a fortress (f = 1) and either:

(a) Each of u and v is a superhurdle or a benign component that has a sep-

arating superhurdle, and u and v are not benign components sharing the

same separating hurdle; or

(b) Case 3 of Lemma 3.13 applies

Proof: Case 1 comes directly from Lemma 3.7 and case 2 from Lemma 3.13.

3.4 The Algorithm

Lemmata 3.14, 3.15, 3.16, and 3.17, lead directly to an algorithm to address ASR

(Algorithm 3.2). For clarity of presentation, we have broken out as separate sub-

routines the steps that find sorting reversals that split cycles (Algorithm 3.3), that

cut hurdles (or pseudohurdles; Algorithm 3.4), and that merge separate components

60


(Algorithms 3.5 and 3.6). For the moment, we shall assume in Algorithm 3.3 the

existence of a routine to detect whether a candidate reversal introduces new unori-

ented components; the details of deriving such a routine will be the subject of the

next section.

Input: Two signed permutations of size n, π and φ; assume functionsget revs split cycles, get revs cut hurdles, get revs merge-

nofort, and get revs merge fort that find all sorting reversals thatrespectively split cycles, cut hurdles (or pseudohurdles), merge compo-nents (when there is no fortress), and merge components (when there isa fortress).

Output: A list L of all sorting reversals of π with respect to φ.

beginConstruct the breakpoint graph B of π with respect to φ;

Identify all black edges, cycles, and connected components in B. Let ei,j,krepresent the ith black edge belonging to the jth cycle of the kth component;

For each (i, j, k), let oi,j,k be defined such that oi,j,k ∈ −1,+1 and oi1,j,k ·oi2,j,k = −1 iff ei1,j,k and ei2,j,k are divergent ;

Label each component as oriented, trivial, or unoriented;

Build hurdle graph H; use H to label each unoriented component as a simplehurdle, a single protector, a pseudohurdle, a superhurdle (i.e., a non-single-protector), or a protected nonhurdle (i.e., a non-pseudohurdle). Also labeleach member of a double superhurdle with its partner (a single hurdle canhave at most two);

Let c be the number of cycles in B, h be the number of hurdles, and s be thenumber of superhurdles; let f = 1 if h = s and s is odd; otherwise let f = 0.

Initialize list L;

append all(L, get revs split cycles() );

append all(L, get revs cut hurdles() );

if f = 0 then append all(L, get revs merge nofort() );else append all(L, get revs merge fort() );

return L;end

Algorithm 3.2: find all sorting reversals

61


Input: All ei,j,k, oi,j,k, and the values of h, s, and f from find all sorting-

reversals; assume the existence of a function detect new-

unoriented components, that returns a list of new unoriented compo-nents introduced by a reversal (or ∅ if none is introduced).

Output: A list M of all sorting reversals that split cycles

beginInitialize list M ;

foreach ei1,j,k and ei2,j,k such that oi1,j,k · oi2,j,k = −1 doP ← detect new unoriented components(ρ(ei1,j,k, ei2,j,k));

if P = ∅ then/* no new unoriented component */append(M,ρ(ei1,j,k, ei2,j,k));

end

1 else/* at least one new unoriented component */Add components represented by the elements of P to hurdle graph;Label types of new components, and relabel neighbors in hurdle graphas needed;Compute new number of hurdles (h′) and whether a fortress exists innew permutation (f ′);

2 if h′ + f ′ ≤ h+ f thenappend(M,ρ(ei1,j,k, ei2,j,k));

endend

end

return M ;end

Algorithm 3.3: get revs split cycles

62



reversals

Output: A list M of all sorting reversals that cut hurdles (or pseudohurdles)

beginInitialize list M ;

H ← k | component k is a simple hurdle;

1 if f = 1 then H ← H ∪ k | component k is a pseudohurdle or a singleprotector;

2 if f = 0 and s = 2a+ 1, a ∈ Z+ and h = 2a+ 2 thendo not cut hurdles; /* avoid a fortress! */

end

3 else foreach ei1,j,k, ei2,j,k such that k ∈ H and i1 6= i2 doappend(M,ρ(ei1,j,k, ei2,j,k));

end

return M ;end

Algorithm 3.4: get revs cut hurdles

63



reversals.Output: A list M of sorting reversals that merge separate components, assuming

there is not a fortressbegin

Initialize list M ;Find all separating hurdles (Algorithm 3.1). Let Sk ∈ S be defined such thatSk = j | component j is a benign component whose separating hurdle iscomponent k;H ← k | component k is a hurdle;foreach i ∈ H do

foreach k ∈ H such that k 6= i do/* avoid a fortress */

1 if (s = 2a + 1, a ∈ Z+ and h = 2a + 3 and hurdles i and k are bothsimple hurdles) or(s = 2a + 2, a ∈ Z+ and h = 2a + 3 and one of hurdles i and k is asuperhurdle) then

Do not merge i and k;end/* avoid a double superhurdle */

2 else if hurdles i and k form a double superhurdle thenDo not merge i and k;

endelse

3 foreach j ∈ i ∪ Si do4 foreach ex1,y1,z1 , ex2,y2,z2 such that z1 = j and z2 ∈ k ∪ Sk do

append(M,ρ(ex1,y1,z1 , ex2,y2,z2));end

endend

endendreturn M ;

end

Algorithm 3.5: get revs merge nofort

64



reversals.Output: A list M of sorting reversals that merge separate components, assuming

there is a fortressbegin

Initialize list M ;Find all separating hurdles (Algorithm 3.1). Let Sk ∈ S be defined such thatSk = j | component j is a benign component whose separating hurdle iscomponent k, and let S¬ ∈ S be defined such that S¬ = j | component j isa benign component that has no separating hurdle;H ← k | component k is a hurdle;U ← k | component k is an unoriented component;foreach k ∈ H do

1 foreach j ∈ U such that j belongs to hurdle chain k do γj ← k;2 if j is the anchor of the chain then αk ← j;

end3 foreach j ∈ U such that j belongs to no hurdle chain do γj ← −1;

foreach i ∈ H doV ← k | k ∈ U and k /∈ H and γk 6= i;P ← k | k is a double superhurdle partner of i;W ← k | (k ∈ H and k 6= i) or (k ∈ Sj such that j ∈ H and j 6= i);S ′¬ ← k | k ∈ S¬ and k is not separated by one component of chain γifrom another;foreach j ∈ i ∪ Si do

4 foreach ex1,y1,z1 , ex2,y2,z2 such that z1 = j and z2 ∈ αi∪V ∪P∪W∪S ′¬do

append(M,ρ(ex1,y1,z1 , ex2,y2,z2));end

endendreturn M ;

end

Algorithm 3.6: get revs merge fort

65


Let us briefly discuss a few subtleties in obtaining Algorithms 3.3, 3.4, 3.5, and

3.6 from Lemmata 3.14, 3.15, and 3.17. Note that the “else” clause of Algorithm

3.3 (step 1) covers cases 2 and 3 of Lemma 3.14. Because the introduction of a new

unoriented component can disrupt the separation relationships among unoriented

components in various ways, it turns out to be simplest to add the new unoriented

component to the hurdle graph, to relabel neighbors as necessary, and to recompute

the sum of the numbers of hurdles and fortresses. In Algorithm 3.4, note that, in the

case of a fortress, pseudohurdles and single protectors can be handled exactly as are

simple hurdles (step 1; case 2 of Lemma 3.15). In this algorithm, step 2 explicitly

avoids walking into a fortress (case 1). Once an unoriented component is identified

as able to be cut, a separate sorting reversal is defined by every pair of black edges

that belong to the same cycle in that component (step 3).

In Algorithm 3.5, benign components with separating hurdles are essentially han-

dled like hurdles themselves (step 3; case 1a of Lemma 3.17), except that checks to

avoid double superhurdles (step 2; case 1c) and to avoid walking into a fortress (step

1; case 1d) are executed with respect to the separating hurdles of such benign com-

ponents. The algorithm implicitly avoids merging two benign hurdles that share the

same separating hurdle (case 1b) by the way its loops are nested. Note that, once

two components are found that can be merged, a sorting reversal is defined by every

pair of black edges such that one belongs to each component (step 4).

In Algorithm 3.6, it is necessary to label every unoriented component with its

hurdle chain (step 1), or to mark it appropriately if it belongs to no hurdle chain

(step 3). It is also necessary to associate each hurdle with the anchor of its chain

(step 2). Once these steps are accomplished, four sets are identified with respect to

every hurdle i: V contains all protected nonhurdles not belonging to i’s chain; P

contains every double superhurdle partner of i; W contains all hurdles besides i and

the benign components that they separate; and S ′¬ contains all members of S¬ (the

66


set of benign components without separating hurdles) except those that are separated

by one component in i’s chain from another. The union of the set containing the

anchor of i’s chain (αi), V , P , W , and S ′¬ represents the set of all components to

merge with i and with all of the benign components that i separates (step 4). As

before, a sorting reversal is defined by every pair of black edges drawn from two

mergeable components.

Theorem 3.1 Algorithm 3.2 will correctly find all sorting reversals of one permu-

tation π with respect to another permutation φ (π and φ both of size n).

Proof: We have shown that Lemmata 3.14, 3.15, 3.16, and 3.17 are correct; further-

more, because they are tied directly to the exhaustive classification of reversals of

Lemma 3.2, these Lemmata describe all possible sorting reversals. Algorithm 3.2

is constructed directly from these Lemmata, with the subroutine of Algorithm 3.3

representing Lemma 3.14, the subroutine of Algorithm 3.4 representing Lemma 3.15,

and the subroutine of Algorithms 3.5 and 3.6 representing Lemma 3.17. Lemma 3.16

is represented implicitly by the absence of any steps that consider edges belonging

to different cycles of the same component.

Let us comment briefly on the asymptotic running time of Algorithm 3.2. The

initial setup (finding and labeling black edges, cycles, and components; building the

hurdle graph5) can all be done in average time linear in the number of genes, n. The

number of divergent black edges isO(n2) (consider the case of a breakpoint graph con-

sisting of one oriented cycle with n+12

edges of each orientation), so Algorithm 3.3 re-

quires O(n2 ·f(n)) time, if f(n) is the time required to run detect new unoriented-

components. The number of ways to cut hurdles is of size O(n2) (consider the case

of a breakpoint graph consisting of a single unoriented cycle); thus, Algorithm 3.4

5Note that while the number of nodes in the hurdle graph is O(n), the number of edgesis also limited to O(n), rather than the O(n2) that one might expect.

67


takes O(n2). Algorithms 3.5 and 3.6 also take O(n2) time, because although there

can be O(n2) pairs of components and O(n) black edges per component, there can

only be O(n2) pairs of black edges in total. Thus, Algorithm 3.2 is dominated by

Algorithm 3.3, and its running time is O(n2 · f(n)). In the next section, we shall see

that detect new unoriented components can be performed in O(n) time, yielding a

running time of O(n3) for Algorithm 3.2. We shall also see, however, that the fastest

implementation of detect new unoriented components is one that takes O(n2) time

(implying O(n4) for Algorithm 3.2).

3.5 Detecting New Unoriented Components

The purpose of the routine we have called detect new unoriented components

(which we will now abbreviate “detect”) is to find whether a reversal that splits

a cycle introduces unoriented components. If it does not, we know that the number

of hurdles stays the same, and hence that the reversal is a sorting reversal; if it does,

we must examine any new unoriented components and their effect on the hurdle

graph to see whether the number of hurdles increases6. Because oriented cycles are

far more common than unoriented ones, and because detect must be executed anew

for every pair of divergent edges of every oriented cycle, the efficiency of this routine

is critical for the efficiency of Algorithm 3.2. Note that detect can facilitate the ad-

ditional examination required when new unoriented components are introduced by

returning the vertices in the breakpoint graph that compose such components. Thus,

we will define detect to return not simply true or false, but a list of sets of ver-

tices; if this list is empty, then it is understood that no new unoriented components

6If the reader is in doubt of the possibility of introducing an unoriented componentwhile splitting a cycle, consider a reversal that merges hurdles: performing the “same”reversal once more (that is, to “undo” the merge) will split an oriented component intotwo unoriented components (both hurdles).

68


were created.

It is worth mentioning why the problem of detect has not been studied, since

one might imagine that solving it would be a prerequisite to any algorithm that sorts

signed permutations by reversals. Algorithms that seek only one sequence of sorting

reversals, however, are able to avoid checking for new unoriented components by

carefully selecting reversals that are guaranteed not to introduce them. Because we

seek all sorting reversals, we do not enjoy this luxury.

We will begin this section by discussing briefly a simple linear-time algorithm for

detect; then we will introduce a more complicated O(n2) algorithm that turns out

to be more efficient in practice.

3.5.1 A Simple Linear-Time Solution

The most straightforward way to implement detect is simply to have it rerun the

linear-time connected components routine of [1], and then test whether that routine

yields a greater number of unoriented components after the reversal than before.

This approach can be improved slightly by noting that a candidate reversal can

alter only the connected component that it affects. As a result, it is possible to run

connected components on only the portion of the breakpoint graph consisting of

vertices that belong to that component.

It might seem unlikely that one could improve on this strategy. As it turns out,

however, many reversals alter only very slightly the component that they affect,

and as a result, re-examining that entire component is wasteful. We will see below

how, by studying the effect of a reversal on the overlap graph, we can derive an

algorithm that performs better in practice (although worse in asymptotic terms)

than the simple one described here.

69


3.5.2 The Effect of a Reversal on the Overlap Graph

Bergeron [2] and Bergeron and Strasbourg [3] have introduced an elegant and sim-

ple technique for sorting by reversals that sidesteps much of the complexity of

Hannenhalli-Pevzner theory. An important part of their method is to model sim-

ply and efficiently the effect on the overlap graph of a reversal corresponding to

an oriented gray edge in the breakpoint graph. Their technique uses the following

Lemma:

Lemma 3.18 ([20, 2]) If one performs the reversal corresponding to an oriented

vertex v, the effect on the overlap graph will be to complement the sub-graph of v and

its adjacent vertices.

Unfortunately, Lemma 3.18 is too restrictive for our purposes, because it considers

only reversals that correspond to oriented gray edges (which are equivalent to oriented

vertices in the overlap graph). Without too much trouble, however, we can generalize

it to characterize the effect on the overlap graph of any reversal.

Lemma 3.19 (Generalization of Lemma 3.18) The effect on the overlap graph

of any reversal will be to complement the sub-graph of all vertices corresponding to

affected gray edges.

Proof: Let u and v be two gray edges affected by a reversal ρ. Then exactly as

described in [2], if u and v overlap, ρ will cause them not to overlap, and if u and

v do not overlap, ρ will cause them to overlap (the effect is as if there existed an

oriented vertex corresponding to ρ, in Bergeron’s scenario; see Figure 3.7). Suppose

instead that one or both of u and v are unaffected by ρ. If both are unaffected,

their overlapping relationship cannot change. Suppose (without loss of generality)

that u is affected and v is unaffected. If R and R′ are the ranges of ρ, then one of

70


v

uv

vu

u u

u

u

v

v

v

reversal

reversal

reversal

v

v

u

u

vu

reversal

reversal

reversal

v

v

u

u

vu

Figure 3.7: Some examples illustrating that whether two edges overlap changes ifand only if both are affected by a reversal.

R and R′ must contain both vertices of v and one of u, and the other must contain

the other vertex of u. Without loss of generality, call the former R and the latter

R′. The reversal will simply reverse the order in the breakpoint graph of vertices

that belong to R with respect to those that belong to R′ (and vice versa), without

changing the relative order of the vertices within each set. If u and v overlap, then in

a circular traversal of the vertices of R in the breakpoint graph, one will encounter

first a vertex of v, then one of u, then the second vertex of v. For this reason, we say

that u’s vertex is “between” those of v. The reversal cannot alter this arrangement;

that is, after ρ occurs, u’s vertex will still be between those of v, and thus, u and

v will still overlap. If, on the other hand, u and v do not overlap, then the vertex

of u in R will not occur between the vertices of v. The reversal cannot alter this

arrangement either. Thus, the overlapping relationship between two gray edges is

negated if and only if both edges are affected by a reversal, and consequently, any

reversal will complement the sub-graph of the overlap graph that includes exactly

those vertices corresponding to affected edges.

Recall that we have already established (Lemma 3.1) that a reversal changes the

orientation of exactly those gray edges that it affects. This finding, together with

Lemma 3.19, allows us to characterize completely the effect on the overlap graph of

any given reversal. Figure 3.8 illustrates this effect using a simple example.

71


16 15 10 9 12 11 14 13 8 5 6 3 18 17 1 270 194

(0,1)

(18,19)

(16,17)

(14,15)

(12,13)(10,11)

(8,9)

(6,7)

(4,5)

(2,3)

(0,1)

(18,19)

(16,17)

(14,15)

(12,13)(10,11)

(8,9)

(6,7)

(4,5)

(2,3)

16 15 10 9 12 11 6 5 7 13 14 4 3 18 17 1 280 19

Figure 3.8: The effect of a reversal on the breakpoint and overlap graphs. Notethat gray edges (4,5), (8,9), (12,13), and (14,15) are affected; as a result, these edgeschange orientation and their sub-graph is complemented in the overlap graph. Thisparticular reversal splits a cycle and introduces a hurdle.

72


3.5.3 A Bitwise Algorithm

Algorithm 3.7 is an implementation of detect based on Lemmata 3.1 and 3.19. It

makes use of several techniques introduced by Bergeron and Strasbourg, including

the following:

1. The overlap graph for k gray edges is represented as a “bit matrix” composed

of k bit vectors v0 . . . vk−1, each of length k, such that the ith bit of the jth

vector and the jth bit of the ith vector are set to 1 if edges i and j overlap,

and set to 0 otherwise. An example of a bit matrix is shown in Figure 3.9.

2. An auxiliary bit vector p indicates the “polarity”, or orientation, of each edge.

That is, the ith bit of p is set to 1 if the ith edge is oriented, and set to 0

otherwise.

3. Bitwise operators are used to model efficiently the changes to the overlap graph

induced by a candidate reversal. The “exclusive or” operator is particularly

useful for “flipping” selected bits to reflect a change in edge orientation, or to

find the complement of a sub-graph of the overlap graph.

Our algorithm introduces two additional bit vectors, a and l, whose purposes

will be seen below. Moreover, we take advantage of our previous observation that

connected components can be examined separately by representing each component

with a separate bit matrix (which allows considerable savings of time and space, since

the bit matrix takes O(k2) time to build). Algorithm 3.7 assumes an initialization

step in Algorithm 3.2 that constructs a bit matrix for each oriented component; bit

matrices need not be built for unoriented and trivial components.

The algorithm alters the overlap graph to reflect the candidate reversal, then

searches the graph for unoriented components (if there exist unoriented components,

they must be new, because the original overlap graph represented only a single

73


oriented component). It begins by constructing a bitwise representation of which

vertices are affected by the reversal (bit vector a), then uses this representation to

compute updated versions of the bit matrix (step 1; Lemma 3.19) and the orientation

bit vector, p (step 2; Lemma 3.1). In addition, the algorithm uses bitwise operators

to construct a bit vector (l) that indicates which affected vertices are rendered un-

oriented by the reversal (step 3). It uses this vector to limit the possible starting

points for its search of the graph (step 4), because any new unoriented component

must contain at least one such vertex. Any search starting from such a vertex can be

aborted as soon as an oriented vertex is found (step 5), because a single oriented ver-

tex is sufficient to ensure a component is oriented. Because only an oriented vertex

can cause a search to abort early, however, the presence of a mark from a previous

search indicates that an oriented vertex must also be present (step 8). Note that the

algorithm must distinguish trivial components from unoriented ones (steps 7 and 9),

since neither contain oriented edges.

Figure 3.9 shows the changes the algorithm will make in the bit matrix and

auxiliary bit vectors that correspond to the example of Figure 3.8.

Theorem 3.2 Given inputs as specified, Algorithm 3.7 will return a nonempty list

if and only if a candidate reversal ρ creates at least one new unoriented component.

Furthermore, iff ρ introduces p new unoriented components, then the elements Ui of

L, 1 ≤ i ≤ p, will contain the vertices that compose these new components.

Proof: First we must show that, if ρ introduces no new unoriented component,

then Algorithm 3.7 will return an empty list, and if ρ introduces p new unoriented

components, Algorithm 3.7 will return a list of p sets each containing the vertices

of a new unoriented component. Steps 1 and 2 ensure that the overlap graph will

be altered exactly as described by Lemmata 3.19 and 3.1. Suppose ρ splits a cycle

74


without introducing an unoriented component. Then every vertex in the altered

overlap graph will be part of an oriented or trivial component, so every search at

step 5 will result in either the flag oriented being set, or the flag trivial being

left as true, and the algorithm will never append any set to the list L; thus, L will

be empty when the algorithm exits. Suppose instead that ρ introduces p unoriented

components. For each one of these components, ci, there must exist a vertex v in ci

that was affected and rendered unoriented by ρ. We know this is true because the only

changes ρ can cause to the overlap graph are in the orientation of affected vertices

and in the presence or absence of edges between affected vertices; and ci cannot

contain an affected, oriented vertex (if it did, it would be an oriented component).

Steps 3 and 4 ensure that we begin a search at every affected, unoriented vertex,

so eventually we will begin a search with v, or with some other affected, unoriented

vertex u that is also part of ci. Because ci is an unoriented component, a search from

v or u cannot abort until the entire component has been traversed. When the search

finishes, no oriented vertex will have been found, and no mark from another search

will have been encountered, so the oriented flag will not have been set. In addition,

the trivial flag will have been changed to false, so the algorithm will append V

to L at step 9. Furthermore, V will correctly contain all of the vertices in ci. Thus,

when the algorithm exits, the list L will contain a set of vertices representing each

new unoriented component.

Now we must show that, if Algorithm 3.7 returns an empty list, then ρ must have

introduced no new unoriented component, and if Algorithm 3.7 returns a list of p

sets, then each set must contain the vertices of a separate new unoriented component.

Suppose the algorithm returns an empty list. Then the test at step 9 must always

have failed, so that no set was ever appended to the list. If this test failed, then

each search at step 8 must have encountered an oriented vertex or a mark from

a previous search, or have been executed on a vertex with no neighbors. Because

a search was begun at each affected, unoriented vertex, a search was initiated of

75


every component that could possibly be unoriented. If in a search of a component

c, the algorithm encountered an oriented vertex, then clearly c must be oriented.

If instead the algorithm encountered a mark from a previous search, then c also

must be oriented, because the previous search could have aborted before marking

all vertices of the component only if it encountered an oriented vertex or a mark

from a previous search (the first in a sequence of such searches must have actually

encountered an oriented vertex). If instead c consisted of a vertex with no neighbors,

c must be trivial. In any case, c could not be unoriented. Therefore, in this case, all

components that could have been unoriented must have not have been unoriented,

and the algorithm must have correctly returned the empty list. Suppose instead

that the algorithm returns a list of p sets. Each set Ui, 1 ≤ i ≤ p, must have

been appended to the list at step 9. Furthermore, each such set must include all

of the vertices in a nontrivial connected component, because if the algorithm had

encountered an oriented vertex or a previous mark, and aborted the search early, it

would have set the oriented flag, and if the component had consisted of a single

vertex, the trivial flag would have remained true; in either case, the set could

not have been appended to the list. Thus, each set Ui must consist of all vertices

of a component that has no oriented vertex, and properly represents an unoriented

component.

Algorithm 3.7 takes Ω(k2) time, but runs very fast in practice. Reasons for

its speed include that k tends to be much smaller than n when distances between

permutations are modest, that the steps taking Ω(k2) time are very efficient (copying

the bit matrix and executing the exclusive-or computations), and that if oriented

vertices are present (as they usually are), the graph search often can abort quickly.

76


Input: (1) A candidate reversal ρ; (2) the overlap graph for the k-vertex compo-nent affected by ρ, represented (as described in [2]) as an n×n bit-matrixwith rows and columns corresponding to vertices v0 . . . vk−1; and (3) abit-vector p indicating the “parity” or orientation of each vertex in theoverlap graph (pi = 1 iff vi is oriented). Assume bit-vector operators for“exclusive or” (⊕), “and” (∧), and “negation” (¬).

Output: A list L of sets (U1, U2, . . . , Up) such that each set contains those verticescomposing a new unoriented component; L = ∅ if no such componentis detected.

begin/* identify affected vertices */a← (a0 . . . ak−1 | ∀i, 0 ≤ i ≤ k − 1, ai ∈ 0, 1, ai = 1 iff vi is affected by ρ);

/* complement sub-graph of affected vertices */1 foreach i | ai = 1 do vii ← 1; vi ← vi ⊕ a;

/* negate parity of affected vertices */2 p← p⊕ a;

/* identify affected vertices that are now unoriented */3 l← ¬p ∧ a;

/* search for unoriented components in the graph */Initialize mark, an array of k integers, with values of −1;Initialize list L;

4 foreach i | li = 1 doif mark[i] 6= −1 then continue;

/* starting with vi, traverse the overlap graph until exhaustionor evidence of an oriented vertex */Initialize stack S; V ← ∅; oriented ← false; trivial ← true ;push(S, i);

5 while oriented = false and S not empty dopop (S, j); V ← V ∪ vj; mark[j]← i;

6 foreach m | vjm = 1 do7 trivial ← false ;8 if pm = 1 or −1 < mark[m] < i then oriented ← true; break ;

if mark[m] = −1 then push(S,m);end

end9 if oriented = false and trivial = false then append(L, V );

endreturn L;

end

Algorithm 3.7: detect new unoriented components

77


Before reversal

(4,5) (6,7) (8,9) (10,11) (12,13) (14,15)v0 v1 v2 v3 v4 v5

(4,5) v0 0 1 1 0 1 1(6,7) v1 1 0 0 0 0 0(8,9) v2 1 0 0 1 1 0

(10,11) v3 0 0 1 0 1 0(12,13) v4 1 0 1 1 0 0(14,15) v5 1 0 0 0 0 0

p 0 1 1 0 1 1a 1 0 1 0 1 1

After reversal

(4,5) (6,7) (8,9) (10,11) (12,13) (14,15)v0 v1 v2 v3 v4 v5

(4,5) v0 0 1 0 0 0 0(6,7) v1 1 0 0 0 0 0(8,9) v2 0 0 0 1 0 1

(10,11) v3 0 0 1 0 1 0(12,13) v4 0 0 0 1 0 1(14,15) v5 0 0 1 0 1 0

p 1 1 0 0 0 0l 0 0 1 0 1 1

Figure 3.9: The bit matrix for the affected component of Figure 3.8, before andafter the reversal. The bit vectors p, a, and l are also shown (only p is relevantboth before and after the reversal). Algorithm 3.7 will detect the new unorientedcomponent during a graph traversal beginning at v2.

78


3.6 Experimental Methods and Results

We implemented Algorithm 3.2 in C and tested it for correctness and speed. Our

implementation, program find-all-sr, includes code for both detect algorithms

discussed in the previous section, and allows either one to be selected at compile

time. Program find-all-sr also re-implements the connected components routine

of [1], instead of using their existing implementation. We chose this approach for two

reasons: so that we could adapt the routine to use the edge-overlap rather than the

cycle-overlap formulation of the overlap graph, the former being more compatible

with the rest of Algorithm 3.2; and so that we could parameterize the routine for the

option to consider only a single connected component, as is useful for the linear-time

implementation of detect. Program find-all-sr comprises about 1600 lines of C

code.

Test data fell into three classes. The largest class consisted of pairs of signed

permutations, such that one member of each pair had been “scrambled” with respect

to the other by a specified number of reversals. These pairs were generated by a

program that took three parameters: the permutation size n, the number of pairs

to generate p, and the number of random reversals r to execute on a member of

each pair. For each of the p pairs, this program would generate a random signed

permutation of size n, copy the permutation, execute r random reversals on the

copy, and output the original and the scrambled copy as a pair. A random reversal

was defined as a reversal from position i to position j, such that i and j are two

random integers, 0 ≤ i < j ≤ n + 1. The second class of test data was similar to

the first except that permutations were scrambled not with random reversals, but

with a procedure designed to introduce as many unoriented components as possible.

This class was intended to stress the many parts of the algorithm (and of the code)

that are exercised only when multiple unoriented components are present7. The third

7Random reversals only very rarely create an unoriented component, and virtually never

79


class of data consisted of hand-picked pairs of permutations representing special cases

unlikely to appear in the other two classes (configurations involving fortresses, long

hurdle chains, double superhurdles, and the like). A total of several thousand pairs

of permutations were produced.

Correctness testing was performed by comparing the output of program find-

all-sr with that of a control called program find-all-bf. Program find-all-bf

finds all sorting reversals of a permutation π with respect to a permutation φ by brute

force—that is, by considering all(n+1

2

)reversals that can be executed on π, finding

the neighbor of π corresponding to each reversal, calculating that neighbor’s distance

from φ, and outputting descriptions of only those reversals that produce neighbors

closer to φ than is π (this approach takes Ω(n3) time). This program directly uses

the well-tested code for inversion distance by Bader, et al. [1], and because it is

also very simple, is believed to be highly reliable. Program find-all-sr was not

confirmed to be correct until on all test cases it produced identical results to program

find-all-bf (aside from the order in which reversals appeared).

Performance testing focused on two types of comparisons: the performance of

find-all-sr versus that of find-all-bf, and the performance of find-all-sr

when using the “connected-components” version of the detect algorithm versus that

when using the “bitwise” version. All testing was performed on a SONY laptop with

a 700 MHz Pentium III processor and 128 MB of RAM, running the Linux operating

system (RedHat 7.0). The most extensive testing was performed using test data of

the first class (the one with permutations that had been scrambled by reversals), as

the presence of multiple unoriented components did not seem to change performance

significantly (if anything, the performance of both versions of find-all-sr improved

[relative to reversal distance] for test data of the second class). We ran tests for values

of n between 25 and 100 and values of r between 0% and 100% of n; for each value

create multiple unoriented components.

80


0

20

40

60

80

100

120

140

0 10 20 30 40 50 60 70 80 90 100

t (sec)

r

find-all-bf

+

+

++

+ + + ++ + +

+find-all-sr1

find-all-sr2

4 4 4 4 4 4 4 4 44 4

4

Figure 3.10: Running times of programs find-all-bf and find-all-sr for n = 100and eleven values of r between 0 and 100. Plotted are total times required toprocess 500 pairs of permutations. Times are plotted for find-all-sr using boththe connected-components (find-all-sr1) and bitwise (find-all-sr2) implemen-tations of detect.

of n and r, we recorded the cumulative times required to find all sorting reversals

for each of 500 pairs of permutations.

Figure 3.10 shows results for n = 100 and 0 < r ≤ 100, which are typical of what

we observed. Plots are shown for find-all-bf and both versions of find-all-sr.

As would be expected the brute force algorithm shows approximately constant per-

formance8 for all values of r, and the implementations of Algorithm 3.2 perform

significantly better9 at small r than at large r. What one might not have expected,

however, is that Algorithm 3.2 still performs three to four times as fast as the brute

force approach even when r = n, when one would expect Algorithm 3.2 to be closest

8The slightly better performance seen at small r is probably the result of improvementsin the speed of distance calculations due to the presence of numerous trivial components.

9As r increases, so does the average number of pairs of divergent black edges of thesame cycle, the dominant factor in the running time of the algorithm

81


0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80 90 100

tbftsr

r

××

×

×

×× × × × × ×

Figure 3.11: Speedup of find-all-sr2 (the version with bitwise detect) with respectto find-all-bf for the same experiment shown in Figure 3.10.

to its worst-case performance of O(n3) or O(n4) time (depending on which implemen-

tation of detect is in use). What is also surprising is that the bitwise implementation

of detect outperforms the connected-components implementation consistently, for

all values of r. Indeed, for r ≥ 10, the bitwise implementation results in total ex-

ecution times between 67% and 77% of those achieved with the other. One might

have expected, to the contrary, that the bitwise implementation would become less

efficient as r became large, when the average size of a connected component also

becomes large.

Figure 3.11 plots the speedup of find-all-sr with respect to find-all-bf, so

that the advantage of the former at small r can be observed in more detail. In this

experiment, a speedup of over 400 times is achieved for r ≤ 10. These results are

particularly encouraging because it is small values of r that are most of interest with

respect to our larger goal of solving the reversal median problem.

82


0

200

400

600

800

1000

1200

1400

1600

0 20 40 60 80 100r

× × × × × ××

×

×

××

Figure 3.12: Average number of sorting reversals for the experiment shown in Figure3.10. Error bars indicate one standard deviation.

Figure 3.12 shows the number of sorting reversals for 0 < r ≤ 100. Notice that

the number begins to rise steeply at about r = 0.4n. When r is close to n, as one

might expect, a significant fraction of all possible reversals are sorting reversals (here

the fraction is about 15); but even when r = 0.5n, hundreds of sorting reversals are

possible.

3.7 Summary

In this chapter, we have temporarily abandoned the reversal median problem, and

derived a solution to the problem of finding all sorting reversals of one permutation

with respect to another (which we call ASR). Our hope is that a solution to this

problem will allow us to explore more efficiently the search space of the median

problem. The ASR problem is of considerable interest independent of the median

problem—both theoretical interest, because it requires extending the Hannenhalli-

83


Pevzner theory in new ways, and practical interest, because an efficient solution

provides a general-purpose tool for exploring the space of genome rearrangements.

The problem is complex and requires a deliberate, step-by-step approach. We have

approached it by classifying all possible reversals, proving under a simplified version

of the problem (the “Fortress-Free Model”) which classes can be sorting reversals

and under what conditions they sort, then reintroducing fortresses and adapting

our results to accommodate them. In this way, we have derived an algorithm that,

while nontrivial, is reasonably understandable, and appears to perform well. We

have shown experimental results indicating a speedup compared to a brute-force

algorithm (the only reasonable alternative known to us) of between about 4 (when

pairwise distances are large) and about 400 (when distances are small). We have

also shown results indicating that the number of sorting reversals is very large—

much larger than one might think from a naive interpretation of Hannenhalli-Pevzner

theory, which focuses on particular subsets of the set of all sorting reversals. When

pairwise distances are large, a significant fraction of all reversals are sorting reversals,

and even when distances are modest, many sorting reversals exist (for n = 100 and

r = 50 the mean number is approximately 100).

As we have derived our solution to ASR, we have introduce several new ideas that

may be useful in solving related problems. For example, we have introduced a simple

graph representation of the separation relationships among unoriented components—

the “hurdle graph”—that allows complex relationships to be described easily and

precisely, in terms of the degrees of corresponding vertices, and whether those vertices

belong to cycles. In part using the hurdle graph, we have characterized several

new classes of unoriented component, including double superhurdles, pseudohurdles,

single protectors, and anchors of hurdle chains. We have also derived an efficient

solution to the problem of detecting whether a reversal introduces an unoriented

component, which turns out to be an important sub-problem of ASR. Our solution

involves extending the method previously described by Bergeron for modeling the

84


effects of reversals using bit-vector operators and a bitwise representation of the

overlap graph.

85

Chapter 4

An Improved Algorithm for

Finding an Exact Median

...the abysmal dark

of the unfathomed center...

–David Hartley Coleridge

Now that we have at hand an algorithm to enumerate all sorting reversals, we can

apply it to the median problem in the way described at the beginning of chapter 3:

by restricting our search to the neighbors of intermediate vertices that correspond to

sorting reversals. In this way, we hope to grope our way steadily toward the “abysmal

dark of the unfathomed center”, improving on the the much blinder exploration of

the algorithm of chapter 2.

One problem remains, however: what if there is no perfect median? In such a

case, we might simply fall back on the algorithm of chapter 2, but we would prefer

not to do so, especially now that we have seen how inefficient its method is for

finding feasible neighbors of an intermediate vertex. To extend our new method to

the case of non-perfect medians, however, we must be able to enumerate other classes

86

Chapter 4. An Improved Algorithm for Finding an Exact Median

of reversals besides sorting reversals. As we have noted, in all cases ∆d ∈ −1, 0, 1;

thus, only three classes of reversals are possible: sorting reversals (∆d = −1), what

we call neutral reversals (∆d = 0), and what we call anti-sorting reversals (∆d = 1).

We will see below that, without too much effort, we can extend our framework from

the previous chapter to enumerate neutral reversals as well as sorting reversals. In

this way, we will enable ourselves to enumerate all three sets, because the set of

anti-sorting reversals is simply the difference between the set of all reversals (which

is easily enumerable) and the union of the sets of neutral and sorting reversals.

In this chapter we will use these insights to develop a dramatically improved

algorithm to find an exact reveral median. We will call this algorithm the “sorting”

algorithm, and contrast it with the algorithm of chapter 2, which we will call the

“metric” algorithm. We will begin by precisely characterizing neutral reversals; then

we will overlay our new methods for enumerating reversals on the basic branch-and-

bound framework of the metric algorithm; and finally, we will show experimental

results demonstrating the performance gain of the sorting algorithm with respect to

the metric algorithm.

4.1 Enumerating Neutral Reversals

Let us begin with a precise definition of a neutral reversal.

Definition 4.1 A neutral reversal of a permutation π with respect to a permuta-

tion φ is a reversal ρ such that d(π, φ) = d(ρ(π), φ)

We can establish exactly which reversals are neutral reversals in the same way

that we established which reversals are sorting reversals. As before, we will begin

with the FFM , and we will follow the classification system of Lemma 3.2. We can

87


use the idea of conservation of distance in this case as well. Here, ∆d = 0, so it must

be true that ∆c = ∆h. As before, ∆c ∈ −1, 0, 1.

Lemma 4.1 (Case 1a) Under the FFM , a reversal ρ that acts on two black edges

belonging to the same oriented cycle is a neutral reversal iff either the black edges

are divergent and ρ increases the number of hurdles by one, or the black edges are

convergent and ρ does not change the number of hurdles.

Proof: Either the black edges are divergent and ρ splits the cycle (∆c = 1) or the

black edges are convergent and ρ is neutral with respect to cycle number (∆c = 0).

In the former case, ρ is neutral iff ∆h = 1, and in the latter case ρ is neutral iff

∆h = 0.

Lemma 4.2 (Case 1b) Under the FFM , a reversal ρ that acts on two black edges

belonging to the same unoriented cycle c is a neutral reversal iff c does not belong to

a simple hurdle.

Proof: Because c is unoriented, all of its black edges are convergent; therefore, ∆c =

0, and to be a neutral reversal, ρ must cause ∆h = 0. The component m to which

c belongs is oriented, a protected nonhurdle, or a hurdle; and if it is a hurdle, it

is a superhurdle or a simple hurdle. If m is a simple hurdle, then ∆h = −1, as

shown in Lemma 3.5. In all of the other cases, however, ∆h = 0, and ρ is neutral.

Let us consider each case. If m is oriented, ρ cannot remove a hurdle (because it

only affects m), and it also cannot create a hurdle (ρ cannot create new components

because it acts on convergent black edges; and ρ affects and orients c, so m must

remain oriented); thus, ∆h = 0. If m is a protected nonhurdle, then ρ orients m but

does not change the number of hurdles. Finally, if m is a superhurdle, ρ cuts m and

causes another hurdle to emerge in its place, so ∆h = 0.

88


Lemma 4.3 (Case 2) Under the FFM , a reversal ρ that acts on two black edges

belonging to different cycles of the same component m is a neutral reversal iff m is

a simple hurdle.

Proof: Because ρ acts on different cycles, ∆c = −1; therefore, to be a neutral reversal,

ρ must cause ∆h = −1. The only way that ρ can reduce the number of hurdles is if

it affects a hurdle; and because it affects only m, we know that m must be a hurdle.

However, m cannot be a superhurdle, or another hurdle would emerge when it was

oriented. Only when m is a simple hurdle will ρ be a neutral reversal.

Lemma 4.4 (Case 3) Under the FFM , a reversal ρ that acts on black edges be-

longing to different components u and v is a neutral reversal iff the following are

true:

1. One of u and v is either a hurdle or a benign component with a separating

hurdle. Without loss of generality, assume that u is this component. Let w

be either u (if u is a hurdle) or the separating hurdle of u (if u is a benign

component).

2. One of the following is true of v:

(a) If w belongs to an anchored hurdle chain, then:

i. v is the anchor of w’s hurdle chain; or

ii. v is a protected nonhurdle that does not belong to w’s hurdle chain; or

iii. v is a benign component that has no separating hurdle, and v is sep-

arated from w by all components in w’s chain; or

iv. v is either a hurdle or a benign component separated by a hurdle, and

v or its separating hurdle forms a double superhurdle with w.

89


(b) If w does not belong to an anchored hurdle chain (in which case there

exists a single, unanchored hurdle chain), then:

i. w’s chain is of length one, v is a benign component separated by w,

and either u is w or u is a benign component separated from v by

w; or

ii. w’s chain is of length at least two, and v is a protected nonhurdle

adjacent to the other hurdle besides w; or

iii. w’s chain is of length at least two, v is a benign component that has

no separating hurdle, and v is separated from w by every unoriented

component but by no hurdle.

Proof: Because ρ acts on different cycles, ∆c = −1; therefore, to be a neutral reversal,

ρ must cause ∆h = −1. Let us show first that, if u and v meet the criteria of the

Lemma, then ρ results in ∆h = −1 and the reversal is neutral. The hurdle chain of w

must either be anchored or unanchored. If it is anchored, and v meets any of the first

three criteria of case (a)—that is, it is the anchor of w’s chain, a protected nonhurdle

that does not belong to w’s chain, or a benign component as described—then ρ will

eliminate w’s entire hurdle chain but will not completely eliminate any other hurdle

chain; thus, exactly one hurdle will be removed and none will emerge. If the chain is

anchored and v or its separating hurdle forms a double superhurdle with w, then as

shown previously, exactly two hurdles will be removed and exactly one will emerge.

On the other hand, if w’s chain is unanchored, then there is a single, unanchored

hurdle chain, and that chain is either of length one or has a hurdle at either end. If

the chain is of length one and v meets the first criterion under case (b), then ρ will

eliminate the only hurdle. If the chain is of length at least two, then there must exist

a hurdle x opposite w in the chain. If v meets either of the remaining criteria under

case (b) then ρ will eliminate all of the hurdle chain except for x. Thus, in all cases

∆h = −1.

90


Now let us show that, if ρ is neutral, then u and v must meet the criteria of the

Lemma. To cause ∆h = −1, ρ must orient one hurdle and avoid causing another

hurdle to emerge, or it must orient two hurdles and cause another hurdle to emerge

(it is impossible for ρ to orient more than two hurdles). Suppose that ρ orients

one hurdle and avoids causing another to emerge. Either there is or is not a single,

unanchored hurdle chain. If there is, then ρ must eliminate the whole chain if it is

of length one, or all but one component of the chain if it is of length greater than

one (if more than one unoriented component remained, there would still exist two

hurdles). If ρ eliminates an entire chain of length one, then w must be the only

unoriented component; therefore, v must be a benign component, and u must either

be w or another benign component separated from v by w. In this case, every benign

component will have w as its separating hurdle, so all aspects of the first criterion

under case (b) must be met. If ρ eliminates all members but one of an unanchored

hurdle chain, then v must be separated from w by all unoriented components besides

the hurdle opposite w. If v is unoriented, it must meet the second criterion under

case (b), and if it is benign, it must meet the third. If on the other hand, there is

not a single, unanchored chain, then ρ must eliminate all members of exactly one

hurdle chain (if it eliminated all members of no chain, it would either fail to orient a

hurdle or cause another to emerge; and if it eliminated all members of two chains, it

would orient two hurdles). If ρ eliminates all members of one chain, then, as shown in

Lemma 3.121, it must act on one component that meets the crition for u, and another

that meets one of the first three criteria under case (a) for v. Suppose instead that

ρ orients two hurdles and causes another to emerge. We showed in Lemma 3.7 that

this will occur iff u and v or their separating hurdles form a double superhurdle.

Furthermore, if there is a double superhurdle, there cannot be a single, unanchored

hurdle chain (because a double superhurdle always corresponds to a cycle). Thus we

1The arguments of Lemma 3.12 hold as well for a hurdle chain as for a superhurdlechain.

91


address the fourth criterion under case (a) for v.

The re-introduction of fortresses also can be handled much the same way as in

section 3.3. We will not present results on fortresses here, however, as addressing

them is tedious, and for all practical purposes they are irrelevant to the median

problem (we have never seen a fortress in a real or synthetic data set, unless it was

manually inserted).

Using Lemmata 4.1, 4.2, 4.3, and 4.4, we can easily extend Algorithm 3.2 to

enumerate neutral reversals as well as sorting reversals (notice that we must use

the detect routine for Lemma 4.1). We will not present the details of this exercise

either, as it is straightforward and would only require several pages much like those

of the last chapter. Instead, for the remainder of this section, we will simply assume

a version of find all sorting reversals that returns a pair (S,N), where S is the

set of all sorting reversals and N is the set of all neutral reversals. We will call this

the “augmented version” of Algorithm 3.2.

4.2 The Algorithm

We are now prepared to present the sorting algorithm for the reversal median prob-

lem. Notice that this algorithm (see Algorithm 4.1) is very similar to the metric

algorithm (cf. Algorithm 2.1). The essential differences are that the sorting algo-

rithm finds appropriate neighbors of each vertex using the augmented version of

Algorithm 3.2, and that it runs in two passes; on the first pass it tries quickly to

find a perfect median, and on the second pass it thoroughly searches for an optimal

median.

A few details deserve comment. As mentioned, the algorithm runs in two passes;

when pass = 1 it seeks only a perfect median, and when pass = 2, it visits exactly

92


the vertices visited by Algorithm 2.1 (but identifies them more efficiently). The set

R contains all reversals that are initial candidates for determining viable neighbors

of an intermediate vertex v. When pass = 1, R contains only reversals that sort

with respect to both ψ1 and ψ2 (step 1), and when pass = 2, R contains all possible

reversals except neutral or sorting reversals with respect to the origin, ψorig (step 2;

such reversals can be ignored in the same way that vertices not farther from the

origin were ignored in step 9 of Algorithm 2.1). A candidate neighbor w is generated

from each candidate reversal ρ; its distance from the origin is known to be one more

than that of v (step 3). The distances of w from ψ1 and ψ2 can be determined from

those of v by examining the membership of ρ in the sets S1, S2, N1, and N2—which

include all sorting and neutral reversals with respect to ψ1 and ψ2 (steps 4 and 5;

note that, during pass 1, ρ will always belong to S1 and S2). Unlike in Algorithm 2.1,

the distances from ψ1 and ψ2 must now be stored with each vertex v in the priority

stack. A flag is set when a perfect median is found, and that flag is checked at the

end of pass 1 (step 6), so that if a perfect median has been found, the algorithm can

avoid its second pass. Note that the algorithm can also exit if a median has been

found that has a score of one more than a perfect median (step 7). The reason is

that, in this case, there can exist no perfect median, so the candidate that has been

found must be minimal. If pass 1 does fail, all marks of vertices must be cleared

before pass 2 begins (step 8; note that, in this case, the priority stack must already

be empty).

93


Input: Three signed permutations of size n: π1, π2, and π3.

Output: An optimal reversal median M .

beginSet up d1,2, d1,3, d2,3, Mmin, Mmax, ψorig, ψ1, ψ2, and priority stack s as inAlgorithm 2.1;Create vertex v with vlabel = ψorig, vdist = 0, vd1 = dψorig ,ψ1 , vd2 = dψorig ,ψ2 ,vbest = Mmin, vworst = Mmax;push(s, v); M ← ψorig; dsep ← dψ1,ψ2 ; pmed← false;for pass← 1 to 2 do

while s is not empty dopop(s, v);if vbest ≥Mmax then break;(S1, N1)←find all sorting reversals(vlabel, ψ1);(S2, N2)←find all sorting reversals(vlabel, ψ2);

1 if pass = 1 then R← S1 ∩ S2;else

(Sorig, Norig)←find all sorting reversals(vlabel, ψorig);2 R← ρ | ρ is a possible reversal on vlabel and ρ /∈ (Sorig ∪Norig);

endforeach ρ ∈ R do

3 create vertex w with wlabel = ρ(vlabel), wdist = vdist + 1;if w is marked then continue else mark w;

4 if ρ ∈ S1 then wd1 ← vd1 − 1; else if ρ ∈ N1 then wd1 ← vd1 ;else wd1 ← vd1 + 1;

5 if ρ ∈ S2 then wd2 ← vd2 − 1; else if ρ ∈ N2 then wd2 ← vd2 ;else wd2 ← vd2 + 1;

wbest ← wdist + dwd1+wd2+dsep

2e;

wworst ← wdist + min(wd1 + dsep), (wd1 + wd2), (dsep + wd2);if wworst = Mmin then M ← wlabel; pmed← true; break;if wbest < Mmax then push(s, w, wbest);if wworst < Mmax then M ← wlabel; Mmax ← wworst;

endendif pass = 1 then

6 if pmed = true then break;7 else if Mmax = d1,2+d1,3+d2,3

2+ 1 then break;

8 else clear all marksend

endend

Algorithm 4.1: find reversal median sorting

94


One subtlety is important about step 7. By definition, a perfect median must

have score dd1,2+d1,3+d2,3

2e. If the sum d1,2 + d1,3 + d2,3 is even, then dd1,2+d1,3+d2,3

2e =

d1,2+d1,3+d2,3

2, and a perfect median indeed must fall (as we have assumed) on a shortest

path between every two of the three permutations π1, π2, and π3. If that sum is odd,

however, then a perfect median must fall on a shortest path between the members

of two pairs of the starting permutations, but not on a shortest path between the

members of the third pair (otherwise its median score would not be an integer,

which is impossible). Therefore, we can apply the assumption inherent in step 7—

that there exists no perfect median—only if the sum is even (hence the use of the

term d1,2+d1,3+d2,3

2in step 7, rather than the term dd1,2+d1,3+d2,3

2e).

Now we shall formally prove the correctness of the algorithm.

Theorem 4.1 Algorithm 4.1 will return a median M only if M is a median of the

input permutations π1, π2, and π3.

Proof: The algorithm either returns after the first pass or after the second pass. If the

algorithm returns after the first pass, it has found a candidate median M that either

has the lowest possible score (step 6), in which case M is clearly a median, or has a

score one greater than that of a perfect median, while at the same time the sum is

even of the distances between every pair of π1, π2, and π3 (step 7). In the latter case,

a perfect median must have score d1,2+d1,3+d2,3

2, and hence must exist on a shortest

path between every two of π1, π2, and π3 (otherwise the triangle inequality would

be violated). The algorithm has examined every potential median (according to the

bounds of Theorem 2.1) on shortest paths from ψorig to ψ1, and from ψorig to ψ2, and

has not found a perfect median. Therefore, there can exist no perfect median, and

the best possible median has a score of one greater than that of a perfect median.

Thus, M has the lowest possible score and must be a median.

On the other hand, if the algorithm returns after the second pass, then it has exactly

95


emulated Algorithm 2.1, which we have shown to be correct. On pass 2, the algo-

rithm differs from Algorithm 2.1 only in the way it computes the distance of each

intermediate vertex w from π1, π2, and π3. It derives the distances of w from those

of v, the neighbor of w. The distance of w with respect to each ψ ∈ ψorig, ψ1, ψ2

is set to be one less than that of v iff w is obtained by a sorting reversal of v with

respect to ψ, equal to that of v iff w is obtained by a neutral reversal of v with respect

to ψ, and one more than that of v iff w is obtained by an anti-sorting reversal of v

with respect to ψ. We have shown that the routine find all sorting reversals

correctly enumerates sorting and neutral reversals, and we know that all other re-

versals are anti-sorting reversals. Thus, the distances of w with respect to ψorig, ψ1,

and ψ2 must be correct, and the sorting algorithm on pass 2 must be equivalent to

the metric algorithm.

4.3 Experimental Methods and Results

We implemented Algorithm 4.1 in C, re-using much of the code of program find-

reversal median from chapter 2, and calling program find-all-sr from chap-

ter 3 as a subroutine (we augmented program find-all-sr to enumerate neu-

tral reversals). The hashing mechanism and priority stack from program find-

reversal median were re-used directly. One other implementation detail is worthy

of comment: we used a fixed array, rather than a hash table, to record the set

membership of each candidate reversal. This approach allows steps 4 and 5 of the

algorithm to be executed in constant time (and is also useful in determining the mem-

bership of R during pass 2); it is possible because the number of reversals cannot

exceed (n+ 1)2, and for our purposes n is rarely more than 100, and certainly never

more than 1000. Because we need to track membership only in sets S1, S2, N1, N2,

and (Sorig ∪ Norig), we can use a single array of 8-bit elements, and accomplish

96


marking using bit masks.

Using test data of the same type as described in section 2.4, we compared the

performance of the sorting and metric algorithms. Figure 4.1 shows results for n = 50

and n = 100, with the mean and median execution times plotted over 100 experi-

ments for various values of r. The sorting median clearly is much faster for all values

of r. In addition, its running time appears to grow more slowly as r becomes larger.

Figure 4.2 shows a detailed view of the running time of the sorting algorithm. Notice

that the median time grows extremely slowly and approximately linearly; the mean

time, on the other hand, suddenly shoots up at r = 9 for n = 50 and at r = 12

for n = 100. At these values, the sorting algorithm appears to have to resort to its

second pass more frequently, and when executing that pass it becomes “mired” in

the same way as reported in chapter 2.

Figure 4.3 shows the speedup of the sorting algorithm with respect to the metric

algorithm. A speedup of more than two orders of magnitude is achieved over a range

of values. It is notable that the speedup becomes larger as r increases, although at

a much greater rate for n = 100 than for n = 50. The median and mean speedups

appear to be approximately equivalent.

4.4 Summary

In this chapter, we combine the metric median algorithm of chapter 2 with the

solution of chapter 3 for the “all sorting reversals” problem, and arrive at a much

faster algorithm to find a reversal median. Before addressing the median problem,

however, we must adapt our results of the previous chapter to enumerate neutral

reversals as well as sorting reversals; otherwise, we could not address the case where

a perfect median does not exist. Our final algorithm uses two passes, attempting

on the first to find a perfect median rapidly, and then if the first pass fails, falling

97


back on a strategy that effectively emulates the metric algorithm, but performs its

branching step much more efficiently. Experimental results indicate that the new

“sorting” algorithm performs dramatically better than the metric algorithm over a

range of parameters, enabling a speedup of between about one and more than two

orders of magnitude. The new algorithm still has difficulty, however, when distances

between permutations become moderately large (when r reaches 14-18% of n, or

pairwise distances reach about 2r, or 28-36%, of n). At these larger distances, the

algorithm appears to find most medians rapidly, but occasionally becomes “stuck”

(presumably by problems that do not have perfect medians). Interestingly, in these

problematic cases, the algorithm appears usually to find quite good medians on its

first pass; it simply cannot prove that they are optimal. This behavior suggests that

the first pass of the algorithm could serve as an excellent heuristic solution for the

median problem.

Finally, let us comment briefly on the reversal median algorithm of Caprara [8].

This algorithm also uses a branch-and-bound strategy2, but one of a completely

different nature: the algorithm performs edge contraction on a multi-breakpoint

graph (a version of the breakpoint graph that captures relationships among more

than two permutations), rather than directly exploring the space of all possible signed

permutations. Unfortunately, at the time of submission of this thesis, a thorough

experimental comparison has not yet been possible of our sorting algorithm and

Caprara’s algorithm. We have, however, run preliminary tests that indicate that the

two algorithms behave quite differently. Caprara’s algorithm appears to be much

more sensitive to n than is ours, and much less sensitive to r. As a result, it appears

to be preferable when n is small and r is large, and ours appears to be preferable

when n is large and r is small (note that the latter is perhaps a more interesting

case for phylogenetic analysis). We hope to compare the two comprehensively in

2Caprara presents both an exact algorithm and a heuristic algorithm for the problem;the exact algorithm uses branch-and-bound.

98


the coming months, and to explore ways in which these different strategies might be

combined.

99


0

200

400

600

800

1000

1200

1400

1 2 3 4 5 6 7 8 9

t (msec)

r

metric (mean)

♦♦

♦♦

♦♦

♦

♦♦metric (median)

++

++

++

++

+sorting (mean)

sorting (median)

× × × × × × × × ×

×

0

500

1000

1500

2000

2500

3000

3500

4000

2 4 6 8 10 12 14

t (msec)

r

metric (mean)

♦

♦

♦

♦

♦

♦♦

metric (median)

+

+

+

+

+

+

+sorting (mean)

sorting (median)

× × × × × × × × × ×

×

Figure 4.1: Comparison of metric and sorting median algorithms for n = 50 andn = 100 over various values of r. Plotted are mean and median execution time inmilliseconds. Each point corresponds to 100 experiments. Metric algorithm did notcomplete for n = 50, r > 8 or n = 100, r > 6.

100


0

20

40

60

80

100

120

2 4 6 8 10 12 14

t (msec)

r

n = 50 (mean)

♦ ♦ ♦ ♦♦ ♦ ♦ ♦

♦♦

n = 50 (median)

+ + + + + + + + +

+n = 100 (mean)

n = 100 (median)

× × × × × × × × ××

×

Figure 4.2: Detailed view of performance of sorting algorithm. Experiments are asdescribed in Figure 4.1.

0

50

100

150

200

250

1 2 3 4 5 6 7 8

tmetrictsorting

r

n = 50 (mean)

♦♦

♦♦ ♦

♦

♦

♦

♦n = 50 (median)

++

++

++ + +

+n = 100 (mean)

n = 100 (median)

×

×

×

×

×

××

Figure 4.3: Speedup of sorting algorithm with respect to metric algorithm. Experi-ments are as described in Figure 4.1.

101

References

[1] D.A. Bader, B.M.E. Moret, and M. Yan, A linear-time algorithm for computinginversion distance between signed permutations with an experimental study, Al-gorithms and Data Structures: Seventh International Workshop, WADS 2001,Brown University, Providence, RI, August 8-10, 2001, Proceedings (F. Dehne,J.-R. Sack, and R. Tamassia, eds.), Lecture Notes in Computer Science, vol.2125, Springer, 2001, pp. 365–376.

[2] A. Bergeron, A very elementary presentation of the Hannenhalli-Pevzner the-ory, Combinatorial Pattern Matching, 12th Annual Symposium, CPM 2001Jerusalem, Israel, July 1-4, 2001 Proceedings (Amihood Amir and Gad M.Landau, eds.), Lecture Notes in Computer Science, vol. 2089, Springer, 2001,pp. 106–117.

[3] A. Bergeron and F. Strasbourg, Experiments in computing sequences of re-versals, Algorithms in Bioinformatics, First International Workshop, WABI2001, Aarhus, Denmark, August 28-31, 2001, Proceedings (Olivier Gascuel andBernard M. E. Moret, eds.), Lecture Notes in Computer Science, vol. 2149,Springer, 2001, pp. 164–174.

[4] P. Berman and S. Hannenhalli, Fast sorting by reversal, Proceedings of the 7thAnnual Symposium on Combinatorial Pattern Matching (D. Hirschberg andE. Myers, eds.), 1996, pp. 168–185.

[5] M. Blanchette, T. Kunisawa, and D. Sankoff, Parametric genome rearrange-ment, Gene 172 (1996), GC11–GC17.

[6] A. Caprara, Sorting by reversals is difficult, Proceedings of the First AnnualInternational Conference on Computational Molecular Biology (RECOMB-97)(S. Istrail, P.A. Pevzner, and M.S. Waterman, eds.), 1997, pp. 75–83.

[7] , Formulations and complexity of multiple sorting by reversals, Proceed-ings of the Third Annual International Conference on Computational Molecular

102

References

Biology (RECOMB-99), Lyon, France (S. Istrail, P.A. Pevzner, and M.S. Wa-terman, eds.), April 1999, pp. 84–93.

[8] , On the practical solution of the reversal median problem, Algorithms inBioinformatics, First International Workshop, WABI 2001, Aarhus, Denmark,August 28-31, 2001, Proceedings (Olivier Gascuel and Bernard M. E. Moret,eds.), Lecture Notes in Computer Science, vol. 2149, Springer, 2001, pp. 238–251.

[9] L.L. Cavalli-Sforza and A.W.F. Edwards, Phylogenetic analysis: models andestimation procedures, Am. J. Hum. Genet. 19 (1967), 233–257.

[10] J. Dicks, Chromtree: Maximum likelihood estimation of chromosomal phyloge-nies, Comparative Genomics (D. Sankoff and J.H. Nadeau, eds.), Kluwer Aca-demic Press, 2000.

[11] T. Dobzhansky and A.H. Sturtevant, Inversions in the chromosomes ofDrosophila pseudoobscura, Genetics 23 (1938), 28–64.

[12] R.V. Eck and M.O. Dayhoff, Atlas of protein sequence and structure, Natl.Biomed. Res. Found., Washington, DC, 1966.

[13] J. Felsenstein, Maximum-likelihood and minimum-step methods for estimatingevolutionary trees from data on discrete characters, Syst. Zool. 22 (1973), 240–249.

[14] W. Fitch, On the problem of discovering the most parsimonious tree, Am. Nat.111 (1977), 223–257.

[15] W.H. Gates and C.H. Papadimitriou, Bounds for sorting by prefix reversals,Discrete Mathematics 27 (1979), 47–57.

[16] D.S. Goldberg, S. McCouch, and J. Kleinberg, Algorithms for constructing com-parative maps, Comparative Genomics (D. Sankoff and J.H. Nadeau, eds.),Kluwer Academic Press, 2000.

[17] S. Hannenhalli, C. Chappey, E.V. Koonin, and P.A. Pevzner, Genome sequencecomparison and scenarios for gene rearrangements: A test case, Genomics 30(1995), 299–311.

[18] S. Hannenhalli and P.A. Pevzner, Transforming cabbage into turnip (polynomialalgorithm for sorting signed permutations by reversals), Proceedings of the 27thAnnual ACM Symposium on the Theory of Computing, 1995, pp. 178–189.

103

References

[19] , Transforming men into mice (polynomial algorithm for genomic dis-tance problem), Proceedings of the 36th Annual IEEE Symposium on Founda-tions of Computer Science, 1995, pp. 581–592.

[20] H. Kaplan, R. Shamir, and R.E. Tarjan, A faster and simpler algorithm forsorting signed permutations by reversals, SIAM Journal of Computing 29 (1999),no. 3, 880–892.

[21] A. McLysaght, C. Seoighe, and K.H. Wolfe, High frequency of inversions duringeukaryote gene order evolution, Comparative Genomics (D. Sankoff and J.H.Nadeau, eds.), Kluwer Academic Press, 2000, pp. 47–58.

[22] B.M.E. Moret, L.-S. Wang, T. Warnow, and S.K. Wyman, New approachesfor reconstructing phylogenies from gene-order data, Bioinformatics 17 (2001),S165–S173, Presented at the Ninth International Conference on Intelligent Sys-tems for Molecular Biology, ISMB-2001.

[23] J.H. Nadeau and B.A. Taylor, Lengths of chromosomal segments conserved sincedivergence of man and mouse, Proceedings of the National Academy of Sciences81 (1984), 814–818.

[24] Pavel A. Pevzner, Computational molecular biology: An algorithmic approach,MIT Press, 2000.

[25] N. Saitou and M. Nei, The neighbor-joining method: a new method for recon-structing phylogenetic trees, Journal of Molecular Biology 4 (1987), 406–425.

[26] D. Sankoff, Genome rearrangements with gene families, Bioinformatics 15(1999), 909–917.

[27] D. Sankoff and M. Blanchette, Multiple genome rearrangement and breakpointphylogeny, Journal of Computational Biology 5 (1998), no. 3, 555–570.

[28] D. Sankoff, D. Bryant, M. Deneault, B.F. Lang, and G. Burger, Early eukaryoteevolution based on mitochondrial gene order breakpoints, Journal of Computa-tional Biology 7 (2000), no. 3/4, 521–535.

[29] D. Sankoff and N. El-Mabrouk, Duplication, rearrangement and reconciliation,Comparative Genomics (D. Sankoff and J.H. Nadeau, eds.), Kluwer AcademicPress, 2000, pp. 537–550.

[30] , Genome rearrangement, Topics in Computational Biology (T. Jiang,Y. Xu, and M. Zhang, eds.), MIT Press, 2001.

104

References

[31] D. Sankoff, V. Ferretti, and J.H. Nadeau, Conserved segment identification,Journal of Computational Biology 4 (1997), no. 4, 559–565.

[32] D. Sankoff, G. Leduc, N. Antoine, B. Paquin, and B.F. Lang, Gene order com-parisons for phylogenetic inference: Evolution of the mitochondrial genome, Pro-ceedings of the National Academy of Sciences 89 (1992), 6575–6579.

[33] D. Sankoff and J.H. Nadeau (eds.), Comparative genomics, Kluwer AcademicPress, 2000.

[34] J. Setubal and J. Meidanis, Introduction to computational molecular biology,PWS Publishing, 1997.

[35] A.C. Siepel and B.M.E. Moret, Finding an optimal inversion median: Exper-imental results, Algorithms in Bioinformatics, First International Workshop,WABI 2001, Aarhus, Denmark, August 28-31, 2001, Proceedings (Olivier Gas-cuel and Bernard M. E. Moret, eds.), Lecture Notes in Computer Science, vol.2149, Springer, 2001, pp. 189–203.

[36] D. Waddington, Estimating the number of conserved segments between speciesusing a chromosome-based model, Comparative Genomics (D. Sankoff and J.H.Nadeau, eds.), Kluwer Academic Press, 2000.

[37] G.A. Watterson, W.J. Ewens, and T.E. Hall, The chromosome inversion prob-lem, Journal of Theoretical Biology 99 (1982), 1–7.

105

Documents

Exact Algorithms for the Reversal Median Problem - Siepel Lab: Home