A memetic-aided approach to hierarchical …BioSystems 72 (2003) 75–97 A memetic-aided approach to hierarchical clustering from distance matrices: application to gene expression

BioSystems 72 (2003) 75–97

A memetic-aided approach to hierarchical clustering from distancematrices: application to gene expression clustering and phylogeny

Carlos Cottaa,∗, Pablo Moscatoba Dept. Lenguajes y Ciencias de la Computación, Universidad de Málaga ETSI Informática (3.2.49),

Campus de Teatinos, 29071 Malaga, Spainb Faculty of Engineering and Built Environment, School of Electrical Engineering and Computer Science,

The University of Newcastle, Callaghan, 2308 NSW, Australia

Abstract

We propose a heuristic approach to hierarchical clustering from distance matrices based on the use of memetic algorithms(MAs). By using MAs to solve some variants of the Minimum Weight Hamiltonian Path Problem on the input matrix, a sequenceof the individual elements to be clustered (referred to as patterns) is first obtained. While this problem is also NP-hard, aprobably optimal sequence is easy to find with the current advances for this problem and helps to prune the space of possiblesolutions and/or to guide the search performed by an actual clustering algorithm. This technique has been successfully appliedto both a Branch-and-Bound algorithm, and to evolutionary algorithms and MAs. Experimental results are given in the contextof phylogenetic inference and in the hierarchical clustering of gene expression data.© 2003 Elsevier Ireland Ltd. All rights reserved.

Keywords: Hierarchical clustering; Memetic algorithms; Phylogenetic inference; Gene expression; Data mining

1. Introduction

With the amount of data generated by the genomesalready sequenced, and the many initiatives in thelife sciences already planned and in execution, thetask of dealing with large-scale combinatorial prob-lems arising in genomics is undoubtedly one of thegreatest challenges to be addressed by computer sci-ence researchers(Koonin, 1999; Wooley, 1999). Newtechniques, and new insights for algorithm designare needed. In particular, this is a challenge that willreshape computer science, particularly if these newmethods can be easily adapted to high-performance,distributed, computing systems.

∗ Corresponding author. Tel.:+34-952-137158;fax: +34-952-131397.

E-mail address: [email protected] (C. Cotta).

These challenging tasks include the analysis of geneexpression data, pairwise genome comparison, and theconstruction of evolutionary trees. Their associateddifficulty is inherent to the computational complexityand the sheer amount of data to process. For example,the Ribosomal Database Project (http://www.cme.msu.edu/RDP/) currently contains evolutionary treesthat reflect the history of 6200 prokaryotic sequences,2000 eukaryotic sequences and 1500 mitochondrialsequences. While phylogenetic trees can undoubtedlyhelp to order the data, the existence of “horizon-tal gene transfer” might not be dismissed in futureanalysis and poses a challenge(Brown, 2003).

This paper is written with the aim of providingnovel insights for the development of complementarycomputational tools. They are complementary since,for instance, they can be used to display particularcorrelations in the data that are seldom taken into

0303-2647/$ – see front matter © 2003 Elsevier Ireland Ltd. All rights reserved.doi:10.1016/S0303-2647(03)00136-9

http://www.cme.msu.edu/RDP/

http://www.cme.msu.edu/RDP/

76 C. Cotta, P. Moscato / BioSystems 72 (2003) 75–97

account to produce the final layout of genomic infor-mation which has been already classified using somehierarchical clustering algorithm. On the other hand,the method can be used as a preprocessing step whendealing with large number of patterns to be clus-tered. This, in turn, may lead to practical increases inclustering algorithm performance.

The last decade has witnessed the extraordinaryprogress in solving one particular combinatorial prob-lem which belongs to the NP-hard class, the Trav-eling Salesman Problem (TSP). The current code inthe public domain is extremely fast and can handlelarge-scale ordering problems, well providing usefulcomplementary heuristics that can help to cluster dataderived from distance measures. The utility of suchheuristics will be exemplified here in the context oftwo important tasks in the biosciences: the inferenceof phylogenetic trees, and the hierarchical clusteringof gene expression data obtained from microarray ex-perimentation.

2. Hierarchical clustering techniques

For many different applications in the area of in-formation extraction, pattern recognition and datamining, use of methods that allow a multi-level de-scription of the data is very useful. Some of these hi-erarchical methods are “supervised” while others are“unsupervised.” The former class denotes methodswhere some problem-domain knowledge is available.In clustering applications, hierarchical algorithmstend to be more versatile than the partitional counter-parts (i.e. the famousk-means algorithm) since theseother class tend only to work well on data sets havingisotropic clusters(Jain et al., 1999).

The approach in this paper can be viewed as a hy-brid in the sense that, while we aim at reducing timeto completion for an unsupervised hierarchical clus-tering algorithm, we do it by solving another problemand using the optimal (or high-quality) solution(s) ofthis other problem as “hints.” It can be argued thatthe approach is unsupervised since it only uses infor-mation from the input distance matrix. On the otherhand, it can be viewed as supervised since we use“knowledge” obtained by solving another associateproblem to speed-up the algorithm. The methodologycould be considered of “learning from hints”(Abu

Mostafa, 1990)since hints can be viewed as prior in-formation about the function to be modeled or, as inthis case, the solution to be found.

Most of hierarchical clustering techniques are vari-ants of thesingle-link—Sneath and Sokal (1973),complete-link—King (1967)andminimum variance—Ward (1963), Murtagh (1984)algorithms. The twofirst algorithms have in common that two clusters aremerged based on a minimum distance criteria, butthey have an important difference. In the single-linkalgorithm, the distance between two clusters is theminimum of the distances between all pairs of pat-terns drawn from the two clusters (one pattern fromthe first cluster, the other from the second). In thecomplete-link algorithm the distance between twoclusters is themaximum of all pairwise distances be-tween patterns in the two clusters. This is illustratedby the following pseudocode:

1. Initialize by assigning each pattern to its own clus-ter.

2. Construct a list of interpattern distances for all dis-tinct unordered pairs of patterns. Sort the list inascending order.

3. Step through the sorted list of distances, formingfor each distinct dissimilarity valuedk a graph onthe patterns where pairs of patterns closer thandkare connected by a graph edge. Stop when:• (Single-link) all patterns are members of a con-

nected component;• (Complete-link) all patterns are members of a

completely connected component.

As a consequence of the criterion chosen for stop-ping the agglomeration of clusters, complete-linktends to produce tightly bound or compact clusters,while the single-link algorithm tends to produceelongated clusters (a process called thechaining ef-fect Nagy, 1968). While there may be cases whereone is better than the other, it is generally acceptedthat complete-link algorithms tend to produce moremeaningful hierarchies than single-link algorithms.

If each cluster is represented as a node, and everytime two clustersC1 andC2 are joined into a clusterC, two directed arcsC → C1 andC → C2 are added,the execution of the procedure sketched above willultimately produce a tree with a complete hierarchy ofthe patterns (the leaves of the tree). The quality of thehierarchical clustering represented by this tree can be

C. Cotta, P. Moscato / BioSystems 72 (2003) 75–97 77

measured in different ways. We have considered theultrametric model for evaluation. In this model, eachof the tree edges is assigned a weight, such that thedistance matrixM̂ it induces (by measuring the lengthof the paths in the tree between each pair of nodes)bounds the original distance matrixM from above—therefore,M̂ij ≥ Mij—andM̂ is ultrametric (Bandeltet al., 1990). In this case, it holds that

M̂ij ≤ max{M̂ik, M̂jk}, 1 ≤ i, j, k ≤ n. (1)

If M̂ is ultrametric, then the distance inT between anyinterior node and any descendant leaf is the same. Thiscondition provides a very good approximation to theoptimal solution under more relaxed assumptions suchas mere additivity (̂M is additive if for anyi, j, k, l, themaximum ofM̂ij + M̂kl, M̂ik + M̂jl, M̂il + M̂jk is notunique). It is also easy to compute: for a given treeT ,and distance matrixM, minimum edge weights canbe determined inO(n2) time using the algorithm pre-sented byWu et al. (1999). Once these edge weightsare computed, the total weight of the tree indicatesthe quality of the clustering (the lower the weight, thebetter the clustering).

3. Obtaining an optimal path in pattern space

Assume the input to be a distance matrix{dij =d(xi, xj)} between a setS of patterns (DNA sequences,gene expression values, etc.).S = {x1, . . . , x|S|}.This distance matrix can then be used to classify thepatterns by hierarchical means. If we would like touncover additional correlations present in the data,one possibility is to find an optimalpath (to be de-fined later) among all the patterns in the set, withoutrepeating a visit to a pattern. One plausible objectivethen would be to find the path that minimizes the totallength. Such a path is known asHamiltonian, and thelength is the sum of the distances between pairs of con-secutive patterns on the path order. Given a set of|S|patterns, there are|S|!/2 such paths. Note that thereare|S| choices for the first visit,|S−1| for the secondone, and so forth. The denominator of 2 is due to thefact that the path can be written backwards and is stillthe same object. This problem is generally cited in thecomputer science literature as the Minimum WeightHamiltonian Path Problem in a Complete Graph. Aswith many problems in computational biology—e.g.

the problem of finding optimal phylogenetic treesunder different figures of merit—it is NP-hard. Theversion in which we are also interested (selecting aparticular starting pattern and a finishing pattern tobe visited in the Hamiltonian path) is also NP-hard.

While NP-hardness implies that an efficient,polynomial-time, algorithm may be well out of reach,very powerfulmetaheuristic algorithms do exist thatcan give extremely good solutions, in acceptable run-ning times, even for large instances. For instance,during the past decade we have appliedmemetic al-gorithms (MAs), a very powerful population-basedmetaheuristic (seeMoscato and Cotta, 2003for arecent survey), to a large variety of instances of theasymmetric TSP, another conspicuous member of theNP-hard class. Our computer code can generally findthe optimum tour involving less than 100 cities ina matter of seconds, while it generally takes severalminutes to produce the optimum tour for instanceshaving hundreds of cities. More details on the partic-ular MA we are using can be found inBuriol et al.(2003) (seeMerz and Freisleben, 1997for anotherstate-of-the-art MA approach to this problem). Thecode can accept either an asymmetric or symmetricmatrix.

This code can be used to address the problem offinding the path of minimum total length in patternspace. Given a set of patternsS = {x1, . . . , x|S|},a distance matrix among them can be computed(or taken as given). Two auxiliary patterns located“at infinity” in relation to the setS, x0 and x|S|+1are defined; they are arbitrarily close to each otherbut arbitrarily far apart from any other pattern.This means thatd(x0, x0) = d(x|S|+1, x|S|+1) =d(x0, x|S|+1) = d(x|S|+1, x0) = 0 and d(x0, x) =d(x, x0) = d(x|S|+1, x) = d(x, x|S|+1) = M, whereM is an arbitrarily large number which will be laterdefined. These two patterns can be thought as two“outgroup patterns.” The MA is then applied to thenew augmented symmetric matrix (which now has|S| + 2 rows and columns). Any optimal tourT thathas an edge connecting patternsx0 andx|S|+1 can beinterpreted as a Hamiltonian pathP that starts fromthe pattern which is the next one ofx0 in T (whichis not x|S|+1) and ends in the pattern next tox|S|+1(which is not x0). The length ofT (L(T)) and thelength of the Hamiltonian pathP (L(P)) are thusrelated byL(T) − 2M = L(P). Since the value of


2M is a constant, minimizingL(T) is equivalent tominimizing L(P). Moreover, note that any tour thatdoes not include the edge(x0, x|S|+1) includes fouredges of weightM. Let T4M denote such a tour. IfM ≥ |S|maxij{dij}, wheredij denotes an entry of theoriginal distance matrix between the|S| patterns, thenany such a tour is larger than any tour that containsthe edge(x0, x|S|+1). The advantage of using thistransformation is that any existing TSP code can beapplied to find such a path through pattern space. TheMA described byBuriol et al. (2003)has been usedto solve the transformed problem instances.

Returning to the issue of distance matrix construc-tion, and how this transformation can be achieved,other situations of interest can be considered. For in-stance, it may be interesting to find a tour starting froma given patternxa and ending in another onexb. Inthis case, the method above needs only be modifiedby arbitrarily approximatingx0 to xa (changing thevalued(x0, xa) = d(xa, x0) = 0) and also approximat-ing patternxb to x|S|+1 (by modifyingd(x|S|+1, xb) =d(xb, x|S|+1) = 0), while for all other patterns we con-tinue enforcing thatx0 andx|S|+1 are “at infinity” (i.e.a value ofM = |S|maxij{dij} is kept).

Another important situation takes place when weare interested in a sequence not simply optimizing thepath length, but also capturing higher-lever similarityrelationships. More precisely, it could be interestingin some domains to obtain a sequence such that ratherthan minimizing the sum of distances of each patternto its predecessor and successor in the sequence, thesum of distances of each pattern to itsk immediate pre-decessors and successors is minimized. We have con-sidered this approach for the gene-expression probleminstances, since we found that the minimum Hamilto-nian path was in this case somewhat “myopic”, and didnot provide a very accurate picture of the global datastructure. Again, an MA can be used to obtain opti-mal or near-optimal solutions to this generalization ofthe Minimum Weight Hamiltonian Path Problem in aComplete Graph. The reader is referred toCotta et al.(2003)for the details of this MA.

4. Materials and methods

In this section, we describe the origin and structureof the problem instances used in the experimentation,

as well as the algorithmic tools utilized. As mentionedin Section 1, the problem of hierarchical clustering hasbeen considered in two contexts: phylogenetic infer-ence and analysis of microarray data. Given the dif-ferent nature of both contexts, a separate descriptionof the materials used in either case is provided. We fi-nally include a section depicting the algorithms usedfor clustering the data.

4.1. Distance matrices from whole mitochondrialgenomes

Phylogenetic inference has been obtained from themitochrondrial DNA of a number of species. As inany distance-based phylogenetic method, an appropri-ate definition of distance is required. In this case, a re-cently proposed definition of distance between pairs ofDNA sequences has been used. This distance measureis based on theconditional Kolmogorov complexity,and has been first proposed inChen et al. (1999)—seealsoLi et al. (2001)—in agreement with the successfulapplication of Kolmogorov complexity in other areas(Li and Vitányi, 1997; Bennett et al., 2003).

This distance measure can be thought of as theultimate lower bound of all measures of information.Roughly speaking, this measure tries to capture theamount of information about a sequence providedby another sequence (or alternatively the amount ofrandomness in the former that is eliminated when thelatter is known). It can be shown that this distancemeasure is generic, in the sense that it can reflectthe same properties of similarity/dissimilarity amongsequences than those of any “reasonable” distancemeasure. SeeAppendix Cfor a more formal character-ization and other interesting properties of the metric.

Although Kolmogorov complexity cannot be com-puted in the general case(Li and Vitányi, 1997), Chenet al. (2000)recently showed compelling evidence thatfor DNA sequences, a compression program they de-veloped calledGenCompress (available fromhttp://www.cs.ucsb.edu/∼mli/Bioinf/software/), can give agood heuristic approximation of the proposed distance(a glimpse of this algorithm’s functioning is providedin Appendix C). It has been reported that it is capableof compressing the complete, 34 megabasepair (Mbp),human chromosome 22 by 12% in one day. It is cur-rently the best compression program known for DNAsequences, giving the best ratios in some benchmark

http://www.cs.ucsb.edu/~mli/Bioinf/software/

http://www.cs.ucsb.edu/~mli/Bioinf/software/


DNA studies. Using this precise approach,Li et al.(2001)have computed and made available two prob-lem instances in the public domain, and have madeavailable a third one for the purpose of this research(seeAppendix Cfor online access to these instances).They respectively comprise 20, 34 and 84 taxa.

The first database (20mammals) was computed onthe basis of the coding regions from mtDNAs of 20species of mammals taken fromCao et al. (1998). Thesecond database (34mammals) is derived fromReyeset al. (2000), and extends the previous one by consid-ering 19 taxa from the former as well as 15 new mam-mals. Finally, the third database (84animals) includes84 animals, both vertebrate and invertebrate. The in-terested reader may checkAppendix Bfor the detailedlist of taxa included in each of these instances.

4.2. Distance matrices from gene expression data

DNA microarrays consist of large numbers ofDNA molecules spotted in a systemic order on asolid substrate, being each of these spots very small(usually less than 250�m). A microarray experimentconsists of exposing these DNA spots to complemen-tary DNA (cDNA) obtained from messenger RNA(mRNA). mRNA molecules are the transcription ofDNA molecules as mentioned before, and hence areindicators of which genes are active (expressed).These mRNA molecules are obtained from a particu-lar cell, and marked with fluorescent dye visible undera microscope (usually red and green for a target anda reference sample, respectively). Due to the comple-mentary nature between DNA and cDNA molecules,they couple together by means of base-pairing. Afterthis process, the microarray is scanned, measuringthe red/green fluorescence of each spot. This fluores-cence value indicates the level of expression of thecorresponding gene with respect to each sample.

The result of a microarray experiment can thus beexpressed as a matrixG = {gij}i=1,... ,n

j=1,... ,m, wheren is thenumber of genes, andm is the number of experimentsper gene. These experiments correspond to the state ofthe gene under different conditions or at different timepoints. A notion of “distance” among genes is neededin order to construct the associated distance matrix. Anumber of distance measures have been proposed inthe literature. For example, centered Pearson correla-

tion has been used byTsai et al. (2002). Other cor-relation measures such as Kendall’sτ correlation orSpearman rank correlation—Lehmann and D’Abrera(1998)—can be used as well. In this work, an alterna-tive and simpler distance metric has been considered:the Euclidean distance. As usual, this metric is definedas

d(gi, gj) =√√√√

m∑k=1

(gik − gjk)2 (2)

whered(g, g′) is the distance between genesg andg′.Using this distance measure, three problem in-

stances were constructed taking real data from theliterature. They range from 106 up to 517 genes, thusconstituting a good test bed that allowed studyingthe scalability of the algorithms while keeping thecomputational cost at an acceptable level given theextensive computational experimentation performed.

The first dataset (Herpes) is taken fromJenner et al.(2001). It comprises expression levels for 106 genes(21 experiments per gene). These data were used todescribe Karposi’s sarcoma-associated herpes virusgene expression during latency and after the inductionof lytic replication. The second dataset (Lymphoma)is taken fromAlizadeh et al. (2001)and comprisesgene expression levels for 380 genes (19 experimentsper gene). These correspond to selectively expressedgenes in diffuse large B-cell lymphoma. The thirddataset (Fibroblast) is taken fromIyer et al. (1999).It comprises expression levels for 517 genes (18 ex-periments per gene), corresponding to the response ofhuman fibroblasts to serum.

4.3. Description of the algorithms

Experimentation was conducted with both exact andheuristic algorithms. Both approaches rely on the useof a quality criterion inducing a total order in the spaceof all trees. In this case, this criterion is the ultrametricmodel described inSection 2.

First of all, a branch-and-bound (BnB) algorithmhas been considered. This algorithm has been imple-mented following the description ofWu et al. (1999).More precisely, the BnB algorithm starts with aninitial partial solution in which only two leaves arepresent: the maximally distant patterns. Subsequently,each open subproblem is branched by inserting a cer-


tain pattern in every possible position of the tree. Theprecise pattern that is inserted in all these positions isfixed and determined by a so-calledmaxmin sequenceS. Such a sequence is calculated as follows: letSi bethe ith element in the sequence; then the followingproperties hold:

d(S1, S2) = max(M) (3)

mink<i

{d(Si, Sk)} ≥ mink<i<j

{d(Sj, Sk)} (4)

that is, theith element of the sequence is the onethat maximizes its minimum distance to any of itspredecessors. The use of this particular sequence isrelated to the optimistic evaluation function used toprune non-promising branches (seeWu et al., 1999fordetails), and can be shown to minimize the number ofiterated subproblems for finding the optimal solution.

It is important to note that the branching factor in-creases linearly with the depth in the search tree ofthe subproblem being evaluated. This contrasts withthe behavior of BnB in other classical combinatorialproblems in which the branching factor is bounded bya constant, and constitutes one of the limiting factorsin the application of this algorithm. For example, thisapproach can hardly handle problem instances with afew tens of taxa.

Two heuristic algorithms were considered, an evolu-tionary algorithm (EA), and an MA. The EA containeda population of 100 individuals, with a new individualgenerated in each generation and inserted in the pop-ulation by substituting for the worst one. Such an EAis said to use asteady-state evolution model(Whitley,1989). Steady-state EAs are known to converge at afaster rate than generational EAs (i.e. EAs in whichthe whole population is replaced in each generation),since good individuals are promptly available for re-production. On the contrary, in a generational EA suchindividuals would not breed until a new populationis constructed. See alsoRogers and Prugel-Bennett(1999).

The parents of the new individual created in eachevolutionary step are selected using tournament se-lection (Bickle and Thiele, 1995). The size of thetournament is an important parameter that allowsexerting a tunable selective pressure on the popula-tion: the larger the tournament size, the stronger theselective pressure, and subsequently the faster thepopulation converges (seeBäck, 1995). Of course,

too large a tournament size may lead to extremelyfast convergence of the population toward a subopti-mal solution, while too low a tournament size resultsin slow convergence (and hence in poor solutionsafter the allotted number of evaluations, as some pre-liminary tests have indicated). For this reason, thisparameter has to be carefully selected. The precisevalues considered in this work are shown inTable 1.As it can be seen, larger problems have been assigneda larger tournament size. This allows faster conver-gence within the limited computational time allottedto each run of the algorithm. Additionally, the largerproblem instances have been assigned a higher num-ber of iterations.Table 1shows the numerical valuesconsidered for the four larger problem instances.

Each individual in the population represents acomplete tree, internally encoded using a preordertraversal. This means that ad hoc reproductive op-erators have to be used to deal with this represen-tation. Recombination is accomplished using thePDG (Prune–Delete–Graft) operator inCotta andMoscato (2002)in the context of phylogenetic infer-ence. LetT1 andT2 denote the trees being recombined.PDG works as follows:

1. Prune a subtreeT ′ from T1.2. Delete from T2 all leaves occurring inT ′.3. Graft T ′ at a randomly selected point ofT2.

The mutation operator was a variant of the Scram-ble Subtree operator that selected a subtreeT ′ in theindividual T , and rearranged its topology at random.The probability of selecting a particular subtree withinthe individual was proportional to|L(T)| − |L(T ′)|,whereL(T) was the set of leaves inT (the probabili-ties were adjusted so as to never select a subtree withtwo leaves, that would remain unchanged after mu-tation). Thus, small changes are more frequently per-formed than large changes. The probability of apply-ing each reproductive operator is 0.9 for recombina-tion, and 1/n for mutation, wheren is the number ofelements to be clustered. These are typical values forwhich no fine tuning has been attempted.

The MA differs from the EA described above in thata local improvement strategy is included. This strat-egy is based on performing rotations within the tree.For instance, a Rotation-Right operation movesTLR, the right subtree of the left subtree ofT ′, to theright so it becomes the left subtree of the right subtree


Table 1Parameters of the EA/MA depending on the problem instance

Problem

84animals Herpes Lymphoma Fibroblast

Tournament size 3 3 10 10No. of iterations 250,000 250,000 1,000,000 1,000,000

Fig. 1. Performing a Rotation-Right-1 operation inside a tree.

of T ′ (Fig.). A Rotation-Right-2 operation wouldhave performed the same movement onTLL (and anal-ogously Rotation-Left-1 and Rotation-Left-2).For each interior node of the tree, it is first checkedwhether a Rotation-Right movement is pos-sible, and if so, whether Rotation-Right-1 orRotation-Right-2 produce an improvement. If thiswere the case, the change would be retained, andthe local improvement would stop onT ′. Otherwise,Rotation-Left movements would be attempted. Ifno improvement is possible either, the procedure isrecursively applied to the left and right subtrees ofT ′.

In order to test whether a change is satisfactory, itis not necessary to evaluate the whole tree. For exam-ple, the Rotation-Right-1 operation shown inFig. 1would be acceptable when

maxx,y∈L(TLL )∪L(TLR)

{d(x, y)} > maxx,y∈L(TLR)∪L(TR)

{d(x, y)}(5)

Fig. 2. Examples of tree/sequence consistency. The tree (a, (b, c)) is consistent with the first four sequences, but not with the last two ones.

These quantities are computed during the evalua-tion of the tree, and hence are available for performingthis check. Furthermore, the MA only applies this lo-cal improvement procedure to the current incumbentsolution when it is firstly generated. Thus, the compu-tational overhead is negligible.

A tree is defined asconsistent with a certain se-quence if, and only if, its leaves can be arranged inthe order indicated in the sequence by applying an ap-propriate number of flipping operations (swapping theright and left branches of an internal node). Notice thatany hierarchical clustering remains the same no mat-ter these operations, since the genealogy of no nodeis altered, i.e. these operations only affect the layoutof the tree, and do not change ancestry/brotherhoodrelationships. As an example, the tree (a, (b, c)) iscompatible with sequencesabc, acb, bca, andcba, butnot with bac or cab. For the latter two sequences, anydrawing of the tree would have crossing lines: ifr isthe root of the tree, andh is the internal node parent


of b andc, line ra would cross with either linehb orline hc (seeFig. 2).

5. Experimental results

We firstly considered the two smaller matrices cor-responding to the mtDNA of mammals, since theirsizes allowed using the BnB algorithm for finding theoptimal tree. The results are reported inSection 5.1.Subsequently, the remaining problem instances wereapproached using EAs and MAs. The outcome of theexperimentation is detailed inSection 5.2.

5.1. Results on matrices generated from mtDNAsequences of mammals

The 20mammals data focused on the debated ques-tion of which two of the three main groups of pla-cental mammals are more closely related: Primates,Ferungulates, and Rodents. By themaximum likeli-hood method, some protein data supports the (Fer-ungulates, (Primates, Rodents)) grouping while otherdata supports the (Rodents, (Ferungulates, Primates))grouping.Cao et al. (1998)used members of the Mar-supials and Monotremes—Opossum, Wallaroo, andPlatypus—as outgroup to confirm the (Rodents, (Pri-mates, Ferungulates)) grouping hypothesis. The sameresult was obtained byLi et al. (2001)when apply-ing theHypercleaning program ofBryant et al. (2000)to the 20mammals distance matrix. The Hyperclean-ing program constructed an evolutionary tree using theedges best supported by all possible four taxa subtrees(commonly called “quartets”).

Fig. 3 shows this tree, using the leaf order sug-gested by the MA. The number at the left of eachspecies indicates the distance to the species immedi-ately above it. Thus, 0.880709 is the distance betweenthe gibbon and the gorilla, the distance of the gorillato the pygmy chimpanzee is 0.720617, and so on.Note that the optimal Hamiltonian path found by theMA is the, possibly unique, solution that minimizesthe length of a path of distances between species.There are 3 species belonging to the Marsupials andMonotremes group, 2 Rodents, 8 Ferungulates, and7 Primates. Therefore, there are 3!× 2! × 8! × 7! =2,438,553,600 tree topologies that specify this grouporder (Primates first, followed by Ferungulates, Ro-

dents, and finally Marsupials and Monotremes), butdiffer in the relative order positions of the intra-grouptaxa. Nevertheless, the solution tree could be drawnwithout crossing lines. Following the terminologyintroduced inSection 4.3, this solution is said to beconsistent with the minimum Hamiltonian path.

Note that in the layout presented in Fig. 1 ofLi et al.(2001)—paper available online athttp://www.cs.ucsb.edu/∼mli/—humans are located between the pygmychimpanzee and the gorilla. Also in their layout, thegibbon is located next to the cow, while in our lay-out there exist eight species in between. While thereader can continue exploring the similarities and dif-ferences using this fairly clear example of higher an-imals, we remark on the importance of the methodto uncover close relationships between organisms onwhich we have no subjective hints (e.g. morphologi-cal differences), such as species of viruses or bacteria(Goldberg et al., 1998), as to how to produce a layoutthat respects their closeness.

If only a preliminary clustering of the data is needed,or if the topology of the “true evolutionary object” isunknown but not expected to be a phylogenetic tree,such as a reticulated tree (e.g. seeDoolittle, 1999),then the method shows promise regarding the com-puter times involved. It is known that phylogeneticmethods, like the Hypercleaning algorithm, generallyrequire substantial computer time due to their highcomputational complexity. The optimal path of thisset of 20 species was found in 0.18 s by the MA run-ning on an Intel-based PC at 300 MHz, using the Javaprogramming language and the JDK 1.3. To elucidatewhether this linear ordering can also be used to speedup other phylogenetic algorithms, the BnB algorithmwas tested introducing the constraint that the solutionsought should be consistent with the minimum Hamil-tonian path.

First of all, Fig. 4 (left) shows the fitness distribu-tion in the subset of the search space composed ofall trees consistent with the Hamiltonian path. As itcan be seen, the distribution is shifted to the left, thusindicating that this subset has a mean fitness valueabove the global average, i.e. it is a promising region.Fig. 5 shows the common optimal solution providedby the BnB algorithm for both the unconstrained andthe constrained problem instance. Two facts are worthmentioning here. First of all, the unconstrained BnBrequired to iterate 4078 subproblems before finding

http://www.cs.ucsb.edu/~mli/

http://www.cs.ucsb.edu/~mli/


Fig. 3. Evolutionary tree for 20 species of mammals found with theHypercleaning program ofBryant et al. (2000). It differs from Fig. 1presented inLi et al. (2001)in that the leaves of the tree have been ordered following the minimum length Hamiltonian path algorithmthus showing other strong correlations in the mtDNA distance matrix. The number associated to each clade indicate the percentage ofquartets supporting the grouping according to the Hypercleaning algorithm, and can be interpreted similarly to bootstrap values.

the optimal solution; the constrained BnB only iter-ated 565 subproblems, i.e. a speed-up of 7.22 wasachieved. Secondly, and quite interestingly, the opti-mal solution provided by the BnB algorithm differsfrom the Hypercleaning solution, suggesting a (Pri-mates, (Rodents, Ferungulates)) association.

A similar analysis was conducted on the 34mam-mals instance. This extended set contains the “phylo-genetically controversial” guinea pig, since its properposition is the subject of debate in systematic biol-ogy (Graur et al., 1991; D’Erchia et al., 1996; Caoet al., 1997; Sullivan and Swofford, 1997; Reyes et al.,1998). Li et al. (2001) analyzed this instance withthe Neighbor-Joining and Hypercleaning programs

mentioned before. This gives as a result two differentphylogenies, whose consensus agrees, up to a certainextent, with the overall structure of the phylogenypresented byReyes et al. (2000).

The optimal path finding algorithm was firstly runas before for the 20 species case. The resulting opti-mal path was found in 0.5 s. It starts (or ends, sinceit is undirected) with the baboon, a newly introducedspecies in this extended set, followed by the other Pri-mates in exactly the same order as before, ending withthe orangutan, and with the exception of the sumatranorangutan, which was not included in this set. Thenext species in the path is the controversial guineapig, followed by the platypus, wallaroo, and opossum,

84C

.C

otta,P.

Moscato

/BioSystem

s72

(2003)75–97

Fig. 4. Fitness distribution in the space of all phylogenetic trees (D1), and in the space of phylogenetic trees consistent with the minimum Hamiltonian path (D2), as measuredby sampling 100,000 random trees. Left: 20mammals; right: 34mammals.


Fig. 5. Optimal ultrametric tree provided by the BnB for the 20mammals instance.

in that order. As a consequence, since the Marsupialsand the Monotremes are expected to be an outgroup, adomain knowledge augmented version of the programwas run, in which initial and final species in the taxasequence were specified.

The following heuristic was created to find the firstand final elements of the sequence: the species thathas the largest average distance to all other taxa in theset is selected as the initial point of the sequence; thenthe set of maximally distant taxa from this initial pointis computed. This set can have cardinality higher than1, so ties are broken by selecting from this set theone that has the largest average distance to all othersequences in the set.

Based on the result of this heuristic, the initial taxon(xa) was the platypus and the final one (xb) was thebaboon. We were also interested in the actual numeri-cal value of the optimal path of the constrained prob-lem in comparison with the previous one. If the valuesare numerically close (recall that the distance measureuses an approximation to the actual Kolmogorov com-

plexity), this alternative solution might make muchmore sense from a biological point of view (after allthe Marsupials and Monotremes are still expected tobe an outgroup).

The path obtained from our method and the resultof the consensus of the Neighbor-Joining and Hyper-cleaning programs ofLi et al. (2001) are both de-picted inFig. 6. This path had a length of 28.535927,while the unconstrained problem had an optimal pathof length 28.535572. Since the relative gap betweenthe solutions is approximately 1.227× 10−5, this or-dering seems to be a perfectly reasonable alternative,and it makes more sense from a biological perspective.

The relative order of the sequences from the Pri-mates is maintained, and in this case the guinea pig isagain next to the orangutan. The dog is the next in theorder, so the guinea pig groups rather far (regardingthe linear relative order of the path) from the muridand the nonmurid rodents, as it was also found byLiet al. (2001)using the Hypercleaning algorithm. Thedog, harbor seal, the grey seal, and the cat appear in


Fig. 6. Evolutionary consensus tree of the Neighbor-Joining and Hypercleaning algorithms for 34 mammalian species. Taxa are rearrangedfollowing an optimal Hamiltonian path starting from platypus and ending with baboon of total length 28.535927. The tree topology is notconsistent with this taxon ordering, and hence there are crossing lines (i.e. see the guinea pig branch). The number associated to each cladeindicate the percentage of quartets supporting the grouping according to the Hypercleaning algorithm, and can be interpreted similarly tobootstrap values.


that order, providing a carnivore outgroup of the fer-ungulates and giving additional support toGraur et al.(1997). The pig acts as a kind of “stepping stone” af-ter the two rhinos and the two whales, but note thatthe phylogenetic tree methods ofLi et al. (2001)havemanaged to group it with the perisodactyls rather thanwith the cetartiodactyls. Its “central” position in thispath order within the ferungulates appears to be moreintuitive than the layout of Fig. 2 inLi et al. (2001)where it appears between the donkey and the fruit bat.

The BnB algorithm was also tested on this problem.The size of this instance allowed it to be solved to op-timality, but precluded a Least-Cost policy for manag-ing the subproblem queue. A LIFO policy (depth-first)was used instead.Fig. 7 (top) shows the optimal so-lution with a value of 14.855763. This solution isnot consistent with the minimum length Hamiltonianpath though, despite the fact that consistent solutionsare significantly better on average (seeFig. 4—right).The constrained BnB provided the solution of quality14.857979 shown inFig. 7 (bottom). The relative gapof this solution is just 1.492× 10−4, and hence it isalso biologically plausible. It must also be noted thatthis solution was obtained in a matter of seconds afteriterating 427,559 subproblems. Actually, it was foundin the 11,321st iteration. To establish a comparison,the unrestricted BnB needed 28,093,118 iterations toreach a solution at least as good as this one.

Some remarks can be made on the biologicalimplications of these results. Firstly, the (Primates,(Rodents, Ferungulates)) association appears in bothsolutions, as it did in the 20mammals problem in-stance, suggesting that this association is likely to becorrect.

Another interesting finding can be found in con-nection with the results presented byLi et al. (2001).The reader is firstly referred toFig. 6 which displaysthe consensus tree of the Neighbor-Joining and Hy-

Table 2Results (tree-weight) of the basic EA and MA (averaged for 10 runs) on the different problem instances considered

Problem EA-RP MA-RP

Best Mean± S.D. Best Mean± S.D.

84animals 3801.391 3810.686± 6.902 3801.408 3801.763± 0.209Herpes 351.474 356.462± 2.455 347.556 354.076± 4.316Lymphoma 1211.373 1230.061± 9.005 1187.679 1199.339± 7.518Fibroblast 673.761 684.207± 5.042 653.724 662.041± 5.337

Table 3Results (tree-weight) of the complete-link agglomerative clusteringalgorithm on the different problem instances considered

Problem

84animals Herpes Lymphoma Fibroblast

Complete-link 3803.225 364.547 1216.533 665.166

percleaning algorithms for the 34mammals instance(Fig. 2 of Li et al., 2001). The tree has here its leavesordered accordingly to the minimum length Hamilto-nian path starting with the baboon and ending with theplatypus. Note that the ((hippo, (blue whale, finbackwhale)), (cow, sheep)) association in this tree differsfrom the ((blue whale, finback whale), (hippo, (cow,sheep))) association made by the minimum weight ul-trametric tree. This touches the delicate issue of howwhales evolved during their 10-million-year transitionfrom land to water and who their ancestors were. Thishas been the subject of debate during the previousdecades (e.g. seeShimamura et al., 1997; Gingerichet al., 2001; Thewissen et al., 2001).

5.2. Results on larger data matrices

EA and MA were used on problems that could notbe solved using BnB. First of all, canonical versionsof both algorithms were applied to these problem in-stances, using the parameters described inSection 4.3.The results are shown inTable 2. As a reference, theresults provided by the complete-link agglomerativealgorithm are shown inTable 3.

The results inTable 2were moderately good: thebest solutions provided by either algorithm were betterthan those of the complete-link, and so were the meansolutions in most cases, specially in the case of theMA. However, there was still room for improvement.This improvement came from the utilization within the


Fig. 7. Top: Optimal ultrametric tree provided by the BnB for the 34mammals instance.Bottom: Optimal ultrametric tree consistent withthe minimum length Hamiltonian path for the 34mammals instance.


Table 4Results (tree-weight) of the heuristic-path-constrained EA and MA (averaged for 10 runs) on the different problem instances considered

Problem EA-HPC MA-HPC



EA/MA of the information contained in the heuristicpaths for each instance.

Fig. 8 shows the fitness distribution in the uncon-strained search space (D1), and in the space of treeconsistent with the heuristic path (D2). Unlike the caseof the phylogeny data discussed in the previous sub-section, the shape of the distributions is here moresimilar (in terms of standard deviation). Nevertheless,D2 is again shifted to the left. Thus, it has a highermean fitness, indicating the potential usefulness of ex-ploring this region of the search space.

We considered two ways of carrying out this explo-ration. On one hand, it is possible to have the EA/MAconfined within this region of the search space (i.e.all solutions generated will have to be consistent withthe heuristic path). This can be ensured by enforcinga particular leaf ordering after applying the reproduc-tive operators. On the other hand, this region can beconsidered as a good starting point, but the algorithmcan be given freedom to escape from it. Then, justthe initial population is made to belong to this region,but no constraint is enforced from that point on. Wewill refer to these approaches asheuristic-path con-strained (HPC) andheuristic-path initialized (HPI),respectively. The results for both approaches areshown inTables 4 and 5.

Notice that the results of the HPI approach werebetter than those of the HPC approach. This is coher-

Table 5Results (tree-weight) of the heuristic-path-initialized EA and MA (averaged for 10 runs) on the different problem instances considered

Problem EA-HPI MA-HPI



ent with the fact that the optimal solution is not nec-essarily contained in the HPC region, as shown in theprevious section for the 34mammals instance. Thisregion is certainly promising, but it is reasonable tothink that good solutions can be also located in theoutside boundaries of the region, exploiting part of theinformation contained in the heuristic path, but beingalso enhanced by the use of other information. Thisfact can be clearly visualized inFig. 9. The initialconvergence of the HPC approaches was much betterthan that of the HPI algorithms. However, the searchwas exhausted after a number of iterations, and thealgorithm stagnated: the HPC region cannot possiblyprovide more useful information. On the contrary, theHPI algorithms maintained a steady level of progress,outperforming HPC algorithms in the medium term.Even the basic approaches could outperform HPC al-gorithms in the long term because of this reason.

Notice also that the absolute quality of the HPI so-lutions is notably better than both complete-link andthose of the basic approaches. This can be appreciatednot only numerically but also from the visual pointof view. Fig. 10shows the clustering provided by thecomplete-link algorithm, and the best solution pro-vided by the MA-HPI. As usual, the representation ofthe clustering was accomplished by having the branchwhose leaves have the highest mean expression levelon top of the figure at each level of the tree. It can be

90C

.C

otta,P.

Moscato

/BioSystem

s72

(2003)75–97

Fig. 8. Fitness distribution in the space of all hierarchical clustering trees (D1), and in the space of trees consistent with the heuristic path (D2), as measured by sampling100,000 random trees. Left: Herpes; right: Lymphoma.

C.

Cotta,

P.M

oscato/B

ioSystems

72(2003)

75–9791

Fig. 9. Evolution of fitness for the different algorithms considered. Left: Lymphoma; right: Fibroblast.


Fig. 10. Hierarchical clustering of the Herpes dataset as calculated by the complete-link algorithm (top), and the MA (bottom).

seen that the MA solution is visually smoother, indi-cating that genes of similar expression levels have beenclustered together. The aspect of the complete-link so-lution is not so smooth though (notice, for example,the dark band in the middle of the figure).

6. Conclusions

The main objective of this work was to showthat currently available, high-performance, computercodes for a problem of high computational complex-


ity, the Minimum Weight Hamiltonian Path, can beused to provide new methodologies by life sciencesresearchers. Two fields of application have servedto illustrate some of the benefits, one in the area ofphylogenetic inference, and another in the field ofmicroarray data analysis.

Starting with the former field, the proposed ap-proach has yielded in some cases an increase in speedwhen used as a preprocessing step for hierarchicalclustering. More research on this issue is needed.The preliminary experiments we have conducted alsoshow that this type of “clusterization along an optimalpath approach” can be fruitful in other fields(Martin,1999). In addition, the idea of using an optimal pathas an aid for the visualization of data appears to betimely right since new methods are required to helpwith some old questions(Doolittle, 1999; Pennisi,1998, 1999). It can also be used when there are noevolutionary models that can be trusted or that havea well-established consensus (i.e. inferring computervirus phylogenies, see, for instance,Goldberg et al.,1998) as well as when individual gene trees do notagree (e.g.Cao et al., 1998). In this sense, we wouldlike to noteen passant, that it would be very inter-esting to formalize new multi-objective combinatorialoptimization problems (and to also develop appropri-ate algorithms) that would take into account the vari-ety of distance definitions (e.g.gene order: Boore andBrown, 1998; gene content: Fitz-Gibbon and House,1999; Snel et al., 1999; reversal and rearrangementdistances: Hannenhalli and Pevzner, 1995; Kececiogluand Sanko, 1995; Nadeau and Sanko, 1998, etc.).Finding the Pareto-optimal frontier would then makemuch sense for the life sciences researcher instead ofembracing several a priori debate about the benefitsand drawbacks of alternative distance definitions.

Also, regarding phylogenetic inference, the Hamil-tonian path may correlate with other type of infor-mation, this time epidemiological. It is very rare forsome types of viruses to cross the species-barrier(e.g. Meng, 1998). One of these is theEuropeanBat Lyssavirus (EBL), a rabies-like virus, whichinfects insectivorous bats in Europe (mostly in Den-mark, The Netherlands and Germany). This virushas been reported to infect terrestrial animal species,e.g. sheep (see, for instance,http://www.who-rabies-bulletin.org/q32001/4.1.html). In this sense, it can beseen inFig. 7 (bottom) that the fruit bat is adjacent

to the sheep and the rabbit in the Minimum WeightHamiltonian Path. It will be an interesting future ex-ercise in computational-aided discovery to investigatethe possible correlation between these new definitionsof distances between species, the associated proximitygraphs thus generated, and the probability of virusesto cross the species barrier.

Moving to the second application field, it is para-doxical that while obtaining relevant results on geneexpression data using microarray technologies is anexpensive and time demanding task, most of the hier-archical clustering algorithms used in the literature arevery simple heuristics. We believe that this situationmust be modified, and that increasingly more com-plex algorithms, attacking the basic NP-hard cluster-ing problems arising in this area, must be addressed.Work is in progress in this area (see, for example,Merz, 2002in the context of partitional clustering).

We have presented results on the distribution of val-ues of the objective function for three real-world datasets of gene expression levels obtained from microar-ray experiments. It is clear fromFig. 8 that the useof the heuristic path is an informative hint about thestructure of good hierarchical clusters. As such, it canbe very useful to speed-up an evolutionary algorithmas well as possibly an exact method.

A lesson learned from the combined application ofthis methodology in two distant problems in the lifesciences is that the method seems to be robust at dif-ferent scales. It provides a good hierarchical cluster-ing for closely related patterns as well as reasonablehierarchies at higher levels. In the field of phyloge-netic inference, this contrasts with other approaches,such as theBayesian phylogenetic methods that useRNA substitution models. This approach, as recog-nized by Jow et al. (2002), fails for closely relatedspecies. We cite from this paper, in their Results sec-tion, page 1595: “For example, within the primatesthere is a posterior probability of 83% attached to thesister relationship of the Gorilla and Human, whereasthe more usually accepted arrangement which placeschimps as sister group to humans has only 12% sup-port. We attach little significance to this, as our methodis not ideal for very related species.” In contrast, themethod proposed here does not depend on consensusof aligned sites and it is fully automatizable requiringonly the complete DNA sequence making it a valuabletool in other problem domains. Another practical ad-

http://www.who-rabies-bulletin.org/q3_2001/4.1.html.


vantage is that the method is highly parallelizable andcan be implemented on high-performance distributedcomputing systems, allowing to prove the optimalityof larger hierarchical clustering problem instances.

We are currently analyzing the results provided bythe hierarchical clustering of the three microarray datasets. It is clear thus far that a comprehensive newanalysis needs to be carried out, since the clustershave differences and similarities with the ones alreadyknown. In particular, it would be interesting to eluci-date whether there exists a relationship between thehierarchical structure and the dynamics, such as thetemporal response of fibroblasts to serum.

Acknowledgements

We thank Prof. W. Cook for running his TSP pro-gram Concorde to verify the optimality of the solu-tions we presented in this paper and to L. Buriol forher cooperation at an earlier stage of this project. Wealso thank Prof. M. Li and J. Badger for making avail-able to us the 84animals instance, and for answeringseveral questions on their research. P. Moscato wouldlike to thank here, although with some delay, M. Neiand N. Saitou for having sent him (back in 1989) acopy of his Neighbor-Joining paper(Saitou and Nei,1987). Finally, we thank the reviewers for their usefulcomments and suggestions. C. Cotta is partially sup-ported by Spanish MCyT and FEDER under contractTIC2002-04498-C05-02.

Appendix A. Online supplement

An online supplement containing color versionsof figures, as well as pointers to the data used inthe experimentation has been arranged. The URL ishttp://www.lcc.uma.es/∼ccottap/biosystems03.

Appendix B. Taxa included in the phylogenybenchmark

The first database (20mammals) comprises: bluewhale (Balaenoptera musculus), cat (Felis catus),chimpanzee (Pan troglodytes), cow (Bos taurus),finback whale (Balaenoptera physalus), gibbon (Hy-

lobates lar), grey seal (Halichoerus grypus), go-rilla (Gorilla gorilla), harbor seal (Phoca vitulina),house mouse (Mus musculus), horse (Equus cabal-lus), human (Homo sapiens), opossum (Didelphisvirginiana), orangutan (Pongo pygmaeus), platypus(Ornithorhynchus anatinus), pygmy chimpanzee (Panpaniscus), rat (Rattus norvegicus), sumatran orangutan(Pongo pygmaeus abelii), wallaroo (Macropus robus-tus), and white rhino (Ceratotherium simum).

The second database (34mammals) includes allspecies in 20mammals except the sumatran orangutan,plus the following ones: aardvark (Orycteropus afer),armadillo (Dasypus novemcintus), baboon (Papiohamadryas), dog (Canis familiaris), donkey (Equusasinus), dormouse (Glis glis), elephant (Loxodontaafricana), fruit bat (Artibeus jamaicensis), great rhino(Rhinoceros unicornis), guinea pig (Cavia porcellus),hippo (Hippopotamus amphibius), pig (Sus scrofa),rabbit (Oryctolagus cuniculus), sheep (Ovis aries),and squirrel (Sciurus vulgaris).

Finally, the third database (84animals) includes thefollowing 84 animals: aardvark, alligator, armadillo,artic trout, baboon, birchir, black chiton, black urchin,blue whale, brine shrimp, broadbill, brook trout, carp,cat, cat shark, chicken, chimpanzee, clam worm, cod,colubrid snake, cow, dog, dog tick, dogfish, donkey,door snail, dormouse, earthworm, elephant, feath-erstar, finback whale, fruit bat, two species of fruitfly, gibbon, goldfish, gorilla, great rhino, grey seal,guinea pig, gummy shark, harbor seal, hedgehog,hedgehog tick, helmet turtle, hippo, honey bee, horse,house mouse, human, lamp shell, loach, locust, medfly, two species of mosquito, opossum, orangutan,ostrich, peregrin falcon, pig, platypus, purple urchin,pygmy chimpanzee, rabbit, rainbow trout, rat, red-head duck, rhea, rook, salmon, sea turtle, sea urchin,sheep, skate, skink, starfish, sumatran orangutan, tur-tle, wallaroo, water flea, white rhino, widow finch, andwood snail.

Appendix C. A distance measure based onkolmogorov complexity

For the sake of completeness, this appendix pro-vides some additional details about the Kolmogorov-complexity-based distance measure mentioned inSection 4.1.

http://www.lcc.uma.es/~ccottap/biosystems03


Given two sequencesx, y ∈ Σ∗, the conditionalKolmogorov complexityK(x|y) is defined to be thelength of the shortest program causing a universalcomputer to outputx given inputy. We will note asK(x), for any given sequencex, the conditional Kol-mogorov complexity ofx when we give as input anempty stringε, i.e.K(x) = K(x|ε). The concatenationof two stringsx andy is denoted asxy. The value ofK(x) −K(x|y) is understood as the amount of infor-mation that they sequence “knows” about sequencex, andK(x|y) can be interpreted as the amount of ran-domness of sequencex given sequencey as input. Forthis reason,K(x|y) is also known as thealgorithmicentropy of x giveny.

It can be proved (see, for instance,Li and Vitányi,1997) thatK(x)−K(x|y) ≈ K(y)−K(y|x), meaningthatK(x) − K(x|y) = K(y) − K(y|x) within an ad-ditive logarithmic factor. This result is due to a theo-rem by Kolmogorov and Levin (see Theorem 3.9.1 inLi and Vitányi, 1997). The expressionK(x)−K(x|y)is also known as themutual algorithmic information,thus it is tempting to use it as a measure of distancesince it is a symmetric function of its arguments (toa certain approximation). However, it does not satisfythe triangle inequality, that is, given any triplet of se-quences{x, y, z} a proper measure of distanced mustalso satisfyd(x, z) ≤ d(x, y)+ d(y, z).

Chen et al. (2000)have proposed the use of thefollowing distance measure between two sequencesx

andy

d(x, y) = 1 − K(x)−K(x|y)K(xy)

(C.1)

since this definition handles well several problems.It has a good normalization factor (i.e. theK(xy) inthe denominator). This normalization is very usefulsince it allows taking into account the length of thesequences and avoiding some pitfalls of other distancemeasures. This measure also allows good handling ofwell those cases in which one sequence is much shorterthan the other, as well as very different.

It is not at all clear that this distance measure indeedsatisfies the triangular inequality. However, a proof ofthis fact is given byLi et al. (2001). It is clear that thedefinition ofEq. (C.1)also satisfies the other require-ments needed for a distance. It can be also argued thatthis particular distance measure is only one among aninfinite number of others that can be defined ad hoc.

However, it can also be proved that ifD is a generaldistance measure, such thatx and y are “close” ac-cording toD, then they are also “close” according tod. In formal terms: let us require that for allx we de-fine that in a neighborhood of radiusr,

|{y : |y| = nandD(x, y) ≤ r}| ≤ 2rn. (C.2)

Under this assumption,Li et al. (2001) proved thefollowing theorem.

Universality Theorem. For any computable distanceD, there is a constant c < 2 such that, with probability1, for all sequences x and y, d(x, y) ≤ cD(x, y).

This is a powerful result, since it shows that if anygiven distance measureD reflects some similarity be-tween two sequencesx and y, then this is also re-flected ond provided the quite general assumptioncondition is satisfied. On the practical side, the condi-tional Kolmogorov-based measure does not require aprevious alignment of the sequences, being more ro-bust that other measures that tend to cluster sequenceswith highG+ C content.

As mentioned inSection 4.1, the GenCompress pro-gram(Chen et al., 1999, 2000)can provide an approx-imation of this distance measure. GenCompress is acompression algorithm based on approximate match-ing. Unlike the compression algorithms developed byZiv and Lempel (1977), it does not rely on exact re-peats. On the contrary, it is a one-pass algorithm thatproceeds as follows: letw be the sequence to compress(in our case, it would be a mtDNA sequence); let usassume that a part of it, sayv, has been already com-pressed; letu be the remaining part of the sequence(i.e. w = vu). The algorithm tries to find an optimalprefixp of u such that it approximately matches somesubstring inv. “Approximately” means here that it isnot necessary for this prefix to be an exact match ofa substring inv, but a number of edit operations (re-place, insert, delete) can be allowed.Chen et al. (1999)suggest limiting to 3 the allowed number of edit oper-ations, and considering 12-character prefixes at most.Subsequently,p is removed fromu and appended tov, and the code forp (i.e. the initial position and thelength of the substring inv, plus the code of the editoperations required) is output. This procedure is re-peated untilu = ε. Now, the compression ratio of a


stringu is used as an approximation ofK(u), and thecompression ratio ofu given v for free is used as anapproximation ofK(u|v).

References

Abu Mostafa, Y., 1990. Learning from hints in neural networks.J. Complex. 6, 192–198.

Alizadeh A.A., et al. 2001. Distinct types of diffuse large B-celllymphoma identified by gene expression profiling. Nature 403,503–511.

Bäck. T., 1995. Generalized convergence models for tournament-and (µ, λ)-selection. In: Eshelman, L.J. (Ed.), Proceedings ofthe Sixth International Conference on Genetic Algorithms.Morgan Kaufmann, San Francisco, CA, pp. 2–8.

Bandelt, H.J., 1990. Recognition of tree metrics. SIAM J. DiscreteMath. 3 (1), 1–6.

Bennett, C.H., Li, M., Ma, B., 2003. Chain letters and evolutionaryhistories. Sci. Am. March, 32–37.

Bickle, T., Thiele, L., 1995. A mathematical analysis of tournamentselection. In: Eshelman, L.J. (Ed.), Proceedings of the SixthInternational Conference on Genetic Algorithms. MorganKaufmann, San Francisco, CA, pp. 9–16.

Boore, J.L., Brown, W.M., 1998. Big trees from little genomes:mitochondrial gene order as a phylogenetic tool. Curr. Opin.Genet. Dev. 8, 668–674.

Brown, J.R., 2003. Ancient horizontal gene transfer. Nat. Rev.Genet. 4, 121–132.

Bryant, D., Berry, V., Kearney, P., Li, M., Jiang, T., Wareham,T., Zhang, H., 2000. A practical algorithm for recovering thebest supported edges of an evolutionary tree. In: Proceedingsof the Eleventh Annual ACM–SIAM Symposium on DiscreteAlgorithms. ACM Press, San Francisco, CA, pp. 287–296.

Buriol, L.S., França, P.M., Moscato, P., 2003. A new memeticalgorithm for the asymmetric traveling salesman problem. J.Heurist., in press.

Cao, Y., Okada, N., Hasegawa, M., 1997. Phylogenetic positionof guinea pigs revisited. J. Mol. Evol. 14, 461–464.

Cao, Y., Janke, A., Waddell, P.J., Westerman, M., Takenaka,O., Murata, S., Okada, N., Pääbo, S., Hasegawa, M., 1998.Conflict among individual mitochondrial proteins in resolvingthe phylogeny of eutherian orders. J. Mol. Evol. 47, 307–322.

Chen, X., Kwong, S., Li, M., 1999. A compression algorithm forDNA sequences and its applications in genome comparisons.Genome Inform. 10, 51–61.

Chen, X., Kwong, S., Li, M., 2000. A compression algorithmfor DNA sequences based on approximate matching. In:Proceedings of the Fourth Annual International Conference onComputational Molecular Biology (RECOMB). ACM Press,Tokyo, Japan, p. 107.

Cotta, C., Moscato, P., 2002. Inferring phylogenetic treesusing evolutionary algorithms. In: Merelo, J.J., et al. (Eds.),Parallel Problem Solving From Nature VII, volume 2439 ofLecture Notes in Computer Science. Springer-Verlag, Berlin,pp. 720–729.

Cotta, C., Mendes, A., Garcia, V., França, P., Moscato, P., 2003.Applying memetic algorithms to the analysis of microarraydata. In: Raidl, G., et al. (Eds.), Applications of EvolutionaryComputing, volume 2611 of Lecture Notes in ComputerScience. Springer-Verlag, Berlin, pp. 22–32.

D’Erchia, A.M., Gissi, C., Pesole, G., Saccone, C., Arnason,U., 1996. The guinea pig is not a rodent. Nature 381, 597–599.

Doolittle, W.F., 1999. Phylogenetic classification and the universaltree. Science 284 (5423), 2124–2128.

Fitz-Gibbon, S.T., House, C.H., 1999. Whole genome-basedphylogenetic analysis of free-living microorganisms. NucleicAcids Res. 27, 4218–4222.

Gingerich, P.D., Haq, M., Zalmout, I.S., Khan, I.H., Malkani,M.S., 2001. Origin of whales from early artiodactyls: hands andfeet of Eocene Protocetidae from Pakistan. Science 293, 2239–2242.

Goldberg, L.A., Goldberg, P.W., Phillips, C.A., Sorkin, G.B., 1998.Constructing computer virus phylogenies. J. Algorithms 26 (1),188–208.

Graur, D., Gouy, M., Duret, L., 1997. Evolutionary afinities ofthe order Perissodactyla and the phylogenetic status of thesuperordinal taxa Ungulata and Altungulata. Mol. Phylogenet.Evol. 7, 195–200.

Graur, D., Hide, W.A., Li, W.-H., 1991. Is the guinea pig a rodent?Nature 351, 649–652.

Hannenhalli, S., Pevzner, P., 1995. Transforming cabbage intoturnip. In: Proceedings of the 27th ACM Symposium Theoryof Computing. ACM Press, San Francisco, CA, pp. 178–189.

Iyer, V., et al., 1999. The transcriptional program in the responseof human fibroblasts to serum. Science 283, 83–87.

Jain, A.K., Murty, N.M., Flynn, P.J., 1999. Data clustering: areview. ACM Comput. Surv. 31 (3), 264–323.

Jenner, R.G., Alba, M.M., Boshoff, C., Kellam, P., 2001. Kaposi’ssarcoma-associated herpesvirus latent and lytic gene expressionas revealed by DNA arrays. J. Virol. 75, 891–902.

Jow, H., Hudelot, C., Rattray, M., Higgs, P., 2002. Bayesianphylogenetics using an RNA substitution model applied to earlymammalian evolution. Mol. Biol. Evol. 9 (19), 1591–1601.

Kececioglu, J., Sanko, D., 1995. Exact and approximationalgorithms for the inversion distance. Algorithmica 13, 180–210.

King, B., 1967. Step-wise clustering procedures. J. Am. Stat.Assoc. 69, 86–101.

Koonin, E.V., 1999. The emerging paradigm and open problemsin comparative genomics. Bioinformatics 15, 265–266.

Lehmann, E.L., D’Abrera, H.J.M., 1998. Nonparametrics:Statistical Methods Based on Ranks. Prentice-Hall, EnglewoodCliffs, NJ.

Li, M., Vitányi, P., 1997. An Introduction to KolmogorovComplexity and its Applications. Springer-Verlag, New York.

Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P.,Zhang, H., 2001. An information based sequence distanceand its application to whole mitochondrial genome phylogeny.Bioinformatics 17, 149–154.

Martin, W., 1999. Mosaic bacterial chromosomes: a challenge onroute to a tree of genomes. BioEssays 21, 99–104.


Meng, X.J., 1998. Genetic and experimental evidence for cross-species infection by swine hepatitis E virus. J. Virol. 72, 9714–9721.

Merz, P., 2002. Clustering gene expression profiles with memeticalgorithms. In: Merelo, J.J., et al. (Eds.), Parallel ProblemSolving From Nature VII, volume 2439 of Lecture Notes inComputer Science. Springer-Verlag, Berlin, pp. 811–820.

Merz, P., Freisleben, B., 1997. Genetic local search for the TSP:new results. In: Proceedings of the 1997 IEEE InternationalConference on Evolutionary Computation. IEEE Press, USA,pp. 159–164.

Moscato, P., Cotta, C., 2003. A gentle introduction to memeticalgorithms. In: Glover, F., Kochenberger, G. (Eds.), Handbookof Metaheuristics. Kluwer Academic Publishers, Boston, MA,pp. 105–144.

Murtagh, F., 1984. A survey of recent advances in hierarchicalclustering algorithms which use cluster centers. Comput. J. 26,354–359.

Nadeau, J.H., Sanko, D., 1998. Counting on comparative maps.Trends Genet. 14, 495–501.

Nagy, G., 1968. State of the art in pattern recognition. Proc. IEEE56, 836–862.

Pennisi, E., 1998. Genome data shake the tree of life. Science280, 672–674.

Pennisi, E., 1999. Microbes immunity and disease: is it time touproot the tree of life? Science 284, 1305–1307.

Reyes, A., Pesole, G., Saccone, C., 1998. Complete mitochondrialDNA sequence of the fat dormouse,Glis glis: further evidenceof rodent paraphyly. Mol. Biol. Evol. 15, 499–505.

Reyes, A., Gissi, C., Pesole, G., Catze, F.M., Saccone, C.,2000. Where do rodents fit? Evidence from the completemitochondrial genome ofSciurus vulgaris. Mol. Biol. Evol. 17,979–983.

Rogers, A., Prügel-Bennett, A., 1999. Modelling the dynamicsof steady-state genetic algorithms. In: Banzhaf, W., Reeves,C. (Eds.), Foundations of Genetic Algorithms, vol. 5. MorganKaufmann, San Francisco, CA, pp. 57–68.

Saitou, , N, , Nei, M., 1987. The Neighbor-Joining method: a newmethod for reconstructing phylogenetic trees. Mol. Biol. Evol.4, 406–425.

Shimamura, M., Yasue, H., Ohshima, K., Abe, H., Kato, H.,Kishiro, T., Goto, M., Munechika, I., Okada, N., 1997.Molecular evidence from retroposons that whales form a cladewithin even-toed ungulates. Nature 388, 666–670.

Sneath, P.H., Sokal, R.R., 1973. Numerical Taxonomy. Freeman,London, UK.

Snel, B., Bork, P., Huynen, M.A., 1999. Genome phylogeny basedon gene content. Nat. Genet. 21, 108–110.

Sullivan, J., Swofford, D.L., 1997. Are guinea pigs rodents? Theimportance of adequate models in molecular phylogenetics. J.Mammal. Evol. 4, 77–86.

Thewissen, J.G.M., Williams, E.M., Roe, L.J., Hussain, S.T., 2001.Skeletons of terrestrial cetaceans and the relationship of whalesto artiodactyls. Nature 413, 277–281.

Tsai, H.-K., Yang, J.-M., Kao, C.-Y., 2002. Applying geneticalgorithms to finding the optimal gene order in displaying themicroarray data. In: Langdon W.B., et al. (Eds.), Proceedings ofthe 2002 Genetic and Evolutionary Computation Conference.Morgan Kaufmann, San Francisco, CA.

Ward, J.H., 1963. Hierarchical grouping to optimize and objectivefunction. J. Am. Stat. Assoc. 58, 236–244.

Whitley, D., 1989. The GENITOR algorithm and selective pressure:why rank-based allocation of reproductive trials is best? In:Schaffer, D. (Ed.), Proceedings of the Third InternationalConference on Genetic Algorithms. Morgan Kaufmann, SanMateo, CA, pp. 116–121.

Wooley, J.C., 1999. Trends in computational biology: a summarybased on a RECOMB plenary lecture. J. Comput. Biol. 6, 459–474.

Wu, B.Y., Chao, K.-M., Chuan, Y.T., 1999. Approximation andexact algorithms for constructing minimum ultrametric treesfrom distance matrices. J. Combin. Optim. 3 (2), 199–211.

Ziv, J., Lempel, A., 1977. A universal algorithm for sequential datacompression. IEEE Trans. Inform. Theory 23 (3), 337–343.

Documents

A memetic-aided approach to hierarchical …BioSystems 72 (2003) 75–97 A memetic-aided approach to hierarchical clustering from distance matrices: application to gene expression