Lacroix v 2008

Embed Size (px)

Citation preview

  • 7/29/2019 Lacroix v 2008

    1/24

    An Introduction to Metabolic Networks andTheir Structural Analysis

    Vincent Lacroix, Ludovic Cottret, Patricia Thebault, and Marie-France Sagot

    AbstractThere has been a renewed interest for metabolism in the computational biology community, leading to an avalanche ofpapers coming from methodological network analysis as well as experimental and theoretical biology. This paper is meant to serve asan initial guide for both the biologists interested in formal approaches and the mathematicians or computer scientists wishing to injectmore realism into their models. This paper is focused on the structural aspects of metabolism only. The literature is vast enoughalready, and the thread through it is difficult to follow even for the more experienced worker in the field. We explain methods foracquiring data and reconstructing metabolic networks and review the various models that have been used for their structural analysis.Several concepts such as modularity are introduced, as well as the controversies that have beset the field these past few years, forinstance, on whether metabolic networks are small world or scale-free and on which model better explains the evolution of metabolism.

    Index TermsMetabolism, network, metabolic data, modeling, topological analysis, modularity, evolution.

    1 INTRODUCTION

    IN his 11 December 1907 lecture for the Nobel Prize inchemistry, EduardBuchner said: Weareseeing thecells ofplants and animals more and more clearly as chemicalfactories, where the various products are manufactured inseparate workshops. The enzymes act as the overseers. Ouracquaintance with these most important agents of livingthings is constantly increasing. An even perfunctory look atthe computational biology literature will indicate that theinterest of the community for such cellular chemicalfactories is also constantly increasing to the point whereanyone may feel overwhelmed by the amount and variety of

    papers pertaining to the topic. This survey is meant as aninitial guide into chemical factories that is into metabolism,which is defined as the sum of the physical and chemicalprocesses in an organism by which its material substance isproduced, maintained, and destroyed, and by which energyis made available [1]. The guide is initial becauseproviding a complete even if simplified roadmap throughthe whole of the computational biology literature onmetabolism would be beyond the scope of a single paper.We decided therefore to focus our attention on the simplestmathematical model one may draw of metabolism, that of agraph (or hypergraph), and on the questions that may be

    asked using such a model. These are essentially structuralquestions on what corresponds to a static representation ofmetabolism. We shall see that, even within such restrictions,the literature is already vast and controversial enough. Wealso chose to focus our attention on metabolism exclusivelyandto intentionally disregardother cellular processes such asgene regulation and signaling. The main reason, besides aproblem of space (talking about such processes even insimplifiedwayswouldtakeustoofar),isthat,atleastfornow,metabolism is far better documented (with more abundantandreliabledata)andthusprovidesagoodsteppingstonefor

    initiating the modeling of complex cell factories.Although this paper concerns only structural aspects of

    metabolism, the topic is of importance as illustrated by thenumerous in silico analyses that will be discussed at somelength in this paper, various of which represent alsoexperimentally fully checked success stories (see [17],[160], [170], and [177] for a few examples and references onboth in silico and in vitro or in vivo work), where the studyof structural features such as input-output relationships andmaximization of a product, modularity, pathway redun-dancy, and phenotypic behavior were able to bring sig-nificant insight in the understanding of a metabolic systemand in some cases even means to modify and monitor it. Asan additional motivation for this review, it is worth stressingthat in all these cases, the use of formal models was anessential requirement.

    The term metabolism is derived from the Greek wordmetabol"e for change. Its scientific study appears to havestarted some 400 years ago with experiments performed bySantorio Sanctorius on human, indeed on himself. Theexperiments involved observing weight fluctuations in hisbody over the course of a day and during various metabolicprocesses such as those that occur when digesting, sleeping,and eating. Through these experiments, published in Ars destatica medicina in 1614, Sanctorius introduced the quantita-

    tive aspect into medical research and, at the same time,founded the modern study of metabolism with, however, anexclusively vitalistic view to explain it. It was two centurieslater only, by studying the fermentation of sugar to alcohol

    IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO. 4, OCTOBER-DECEMBER 2008 1

    . V. Lacroix is with the Genome Bioinformatics Research Group, Centre deRegulacio Genomica (CRG), PRBB, Aiguader 88, 08003 Barcelona, Spain.E-mail: [email protected].

    . L. Cottret and M.-F. Sagot are with the Equipe BAOBAB, UniversitedeLyon, F-69000, Lyon; UniversiteLyon 1; CNRS, UMR5558, Laboratoirede Biometrie et Biologie Evolutive, F-69622, Villeurbanne, France, and theEquipe-Project BAMBOO, INRIA Rhone-Alpes, 655 avenue de lEurope,38330 Montbonnot Saint-Martin, France.E-mail: {cottret, sagot}@biomserv.univ-lyon1.fr.

    . P. Thebault is with the Equipe MABioVis, Laboratoire Bordelais deRecherche en Informatique (UMR5800), CNRS, University of Bordeaux 2,351, cours de la Libration, 33405 Talence Cedex, France.E-mail: [email protected].

    Manuscript received 13 Mar. 2008; revised 8 June 2008; accepted 21 June2008; published online 28 June 2008.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-2008-03-0055.Digital Object Identifier no. 10.1109/TCBB.2008.79.

    1545-5963/08/$25.00 2008 IEEE Published by the IEEE Computer Society

  • 7/29/2019 Lacroix v 2008

    2/24

    by yeast, that Louis Pasteur showed that the organiccompounds and chemical reactions found in cells are nodifferent in principle from chemistry. It was, however, reallythe discovery of enzymes by Eduard Buchner in the early20th century that separated the study of the chemicalreactions composing the metabolism of an organism fromthe biological study of its cells.

    Such reactions are traditionally grouped into the so-called

    metabolic pathways, which may in turn be classified asanabolic or catabolic. Anabolism is the synthesis of mole-cules through the use of energy and consumption ofreducing agents (a reducing agent is a substance thatchemically reduces other substances by donating one orseveral electrons), while catabolism corresponds to thedegradation of molecules yielding energy and the produc-tion of reducing agents. The pathways may either be studiedin isolation or, since they are overlapping, be combinedtogether to yield what is referred to as a metabolic network.The benefits of studying the whole network rather thanindividual pathways are numerous and include, for in-stance, the possibility to explore alternative pathways.

    Before getting to a network representation of metabolismhowever, the first task is to acquire the data. This is notrivial matter and will be described at some length inSection 2. The process can be time consuming, but it is timeworth spending in order to avoid the risk of fully orpartially invalidating the results obtained with any lateranalysis, above all their correct biological interpretation.Once data are acquired, modeling them into a network maythen start. We discuss in Section 2.3 the various graph-related models (the only ones considered in this survey forthe reasons given above) that have been or could be used.

    With a graph representation of metabolism in hand,initial analyses are possible. These concern mainly thenetwork topology. Although all such analyses are poten-tially interesting, one stands out perhaps more than others.This concerns the issue of modularity. It is generally agreedthat the notion of modularity is relevant in biology, indeedimportant for understanding function and evolution. Formetabolism in particular, this notion is old if one considersthat the metabolic pathways into which the chemicalreactions taking place in a cell are traditionally organizedand depicted correspond to modules. Glycolysis and theurea and tricarboxylic acid (TCA) cycles are probably themost ancient such pathways to have been discovered,the first by Otto Meyerhof around 1940 following the work

    of Louis Pasteur and others on fermentation, and the lasttwo by Hans Krebs in 1932 and 1937, respectively. The TCAcycle, which corresponds to the sequence of chemicalreactions that produces energy in cells, is also known asthe Krebs cycle and earned Hans Krebs a Nobel Prize in1953. The identification of metabolic pathways presents,however, a certain arbitrariness, particularly at the frontiersof the pathways that are often ill defined. Automatic andformal ways of identifying metabolic pathways, such asthose that have been familiar to biochemists since the lastcentury, have therefore been explored, as have otherdefinitions of modules, some of which are operationallyrather than biologically motivated. All are presented in

    Section 3 together with the other network measures thathave been applied to metabolism.

    Although doubtless useful, in particular for understand-ing biological processes and also for validating some

    computational methods developed for analyzing metabo-lism, the search for modules should not overshadow theneed sometimes to work with the full network. This willdepend on the biological question that is asked. Studying theevolution of metabolism may be one occasion where usingthe full network is in some cases required. Work on thisessential topic is discussed in Section 4. It may be argued thatstructural analyses should not be divorced from evolution-ary considerations, as the latter are important for keeping acheck on the soundness of the first. For the sake of expositionhowever, such separation appears to be helpful.

    The reader who wishes to skip information on how datais acquired and is knowledgeable on how metabolicnetworks may be represented using graph or constraint-based models may safely go directly to Section 3 or 4.

    2 ACQUISITION OF METABOLIC DATA

    2.1 Defining the Entities

    A metabolic network may be formally defined as a collectionof objects and the relations among them. The objectscorrespond to chemical compounds, biochemical reactions,enzymes, and genes (see Fig. 1).

    Chemical compounds, also called metabolites, are small

    molecules that are imported/exported and/or synthesized/degraded inside an organism. For most metabolites, theamount observed varies depending on the tissue and cellcompartment inside which the compound is present.

    2 IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO. 4, OCTOBER-DECEMBER 2008

    Fig. 1. Simplified UML schema of the various objects involved in ametabolic network. The indications on each side of an arrow represent-ing a relation between two objects define the cardinalities of each objectin the relation: 1 means exactly one, 0 . . . means zero or more,1 . . . means one or more. For instance, the indications on each sideof the relation codes for between Gene and Protein mean that onegene can produce several proteins (in the case of alternative splicing forinstance) or none if the gene is not a protein gene (this explains the 0 inthe cardinality for the proteins), and that a protein can either beproduced by more than one gene or supplied by the environment (thisexplains the 0 in the cardinality for the genes).

  • 7/29/2019 Lacroix v 2008

    3/24

    Tissues and cells indeed contain a number of liquidcompartments separated from one another by selectivelypermeable membranes.

    Biochemical reactions produce a set of one or morecompounds (called the products) from another set of one ormore compounds (called the substrates). In theory, achemical reaction can occur in both directions. However,under particular physiological conditions, some reactions

    occur in only one direction. In this case, they are defined asbeing irreversible if all other conditions remain constant.Inside a cell, some reactions are spontaneous, but most arecatalyzed by one or several enzymes, which stronglyaccelerate their speed. An enzyme is a protein or a proteincomplex coded by one or severalgenes. A single enzyme mayaccept distinct substrates and may catalyze several reactions,and conversely, a single reaction may be catalyzed byseveral enzymes. Elucidating the links between genes,proteins, and reactions (the so-called GPR relationship) isnot a trivial task and is a major concern in metabolicreconstruction [146] as discussed in Section 2.2.

    The presence of small molecules, called cofactors, is

    sometimes essential to allow the catalysis of a reaction by anenzyme. Such molecules can vary for a given reactionamong several organisms. By binding the enzyme, cofactorscan enhance or decrease the activity of the enzyme. Theyare called, respectively, allosteric activators or allostericinhibitors. The term allostery indicates that the regulatorysite of an allosteric protein is separate from its active site.

    Data on individual reactions have not always beenavailable and the concept of metabolic pathway has oftenbeen used to informally characterize the set of reactionsinvolved in the synthesis or degradation of a molecule ofinterest (glycolysis for the transformation of glucose intopyruvate for instance). The concept of metabolic pathwayremains very much employed for historical reasons butlacks formal definition. In particular, there is no consensuson the boundaries of a pathway. Efforts were made inrecent years to propose a more formal definition [51], [97],[153], [158], but no general agreement has been reached yet.

    2.2 Metabolic Reconstruction

    Reconstructing a metabolic network consists in inferring therelations between genes, proteins (enzymes), and reactionsin a given metabolic system. This is usually achieved usingcomparative genomics but also often as a refinement step,using metabolomic data (the latter referring to the type and

    quantity of metabolites present in the metabolism of anorganism).Other types of relations may be more difficult to assess.

    This is the case, for instance, of the allosteric effects of anenzyme, which are rarely known. The precision needed inthe definition of each relational link depends also on thequestion one wishes to answer. In some cases, it suffices toobtain a list of metabolites associated with a list of thechemical transformations to which they are linked. Onthe other hand, if the aim is to study, for instance, therelationship between genotype and phenotype, then itbecomes necessary to establish a precise correspondencebetween reactions and the enzymatic genes whose protein

    products catalyze them.Inference from comparative genomic data usually sug-

    gests a list of metabolic reactions present in an organism ofinterest, not the whole network. Although the process has

    been highly automated in recent years, it still requires inmost cases an expert manual intervention sustained by datapainstakingly collected from the literature. Once a model forthe whole network, such as a graph, has been obtained, astudy of its general mathematical properties can start(see Section 3).

    The quality of such a reconstruction obviously stronglydepends on the quality of the genome annotation for that

    organism and, to a lesser degree, on the taxonomic positionof the input organism. Indeed, there exist few organisms forwhich the set of genes composing their genomes and theirassociated functions (metabolic or other) are well known.This is the case, for instance, ofEscherichia coli. EcoCyc [90],the part of the metabolic database BioCyc [84] dedicated toEscherichia coli, thus offers an excellent level of accuracy duein no small part to its numerous links to experimentalevidence. This kind of database is, however, an exceptionamong genome-scale metabolic databases. Besides beingoften much less accurate, most of the pathways in BioCyc orin KEGG, another widely used database for metabolism [6],are further biased toward the microbial and plant king-doms. The quality of the metabolic reconstruction of anyanimal is thus expected to be worse than the one of abacterium evolutionarily close to Escherichia coli.

    Metabolic reconstruction from comparative genomics istraditionally divided in two parts. The first provides afunctional annotation of the metabolic genes, i.e., deter-mines the catalytic activity of the enzymes the genes codefor. The second defines the relation between functionalannotations and biochemical reactions, i.e., establishes thelist of reactions enabled by the annotated catalytic activities.

    2.2.1 Functional Annotation of the Metabolic Genes

    Recent high-throughput genome sequencing techniqueshave given access to many complete genome sequences.For instance, the EBI genomic database (http://www.ebi.ac.uk/genomes/) contained at the time of writing this paperthe complete genomes of 575 bacteria or viruses, 51 archea,and 74 eukaryotes. Most genomes have been annotatedusing fully automated methods. The first step consists indetecting the boundaries of the genes and in assigning afunction to the protein(s) such genes code for. Severalcomplementary methods are currently used to identify thelimits of a gene, which proceed by putting togetherinformation from, among others, the detection of openreading frames and of conserved motifs around the junctionsbetween introns and exons (in the case of eukaryotes), as wellas alignments with already known genes. The latter shouldenable also to determine the function(s) of the genes. Indeed,this is most often predicted by sequence comparison of therelated protein with proteins of known function(s) in alreadysequenced genomes.

    In the case of a metabolic network reconstruction, it isparticularly important to be able to establish the catalyticfunction(s), if any, of a protein. Functions are specified eitherbased on experimental clues or by using automatic methods.Because of the frequency of new completely sequencedgenomes, it is nowadays impossible to provide annotations

    based on experiments for each protein, and automaticmethods are therefore the most common means of assigningfunction(s) to a protein. As pointed out in [57], the firstchallenge is however in arriving at an unambiguous and

    LACROIX ET AL.: AN INTRODUCTION TO METABOLIC NETWORKS AND THEIR STRUCTURAL ANALYSIS 3

  • 7/29/2019 Lacroix v 2008

    4/24

    operational (for the purpose of an automated prediction)definition of function.

    Protein function. The ambiguity comes from the fact thatprotein function may depend on which aspect, biochemicalor physiological for instance, is considered. It can also behighly context-dependent as exemplified by the so-calledmoonlighting proteins, which have different functionsinside the same organism depending on their cellular

    localization, cell-related level of expression, oligomericstate, and cellular concentration of a ligand, substrate,cofactor, or product [77]. Such characteristic can gounsuspected for years.

    Ontologies such as Gene Ontology (GO) [9], byestablishing a common and controlled vocabulary for thefunctional description of homologous genes in multipleorganisms, have made gene annotation easier and ingeneral more consistent, and have improved data exchangeand curation. A so-called Enzyme Commission (EC)numbering system is used to classify enzymes by type andfunction by means of a hierarchical code with four digitsseparated by dots attributed by the International Union of

    Biochemistry and Molecular Biology (IUBMB). Each digitrepresents a progressively finer class [178]. The first onedefines the type of reaction catalyzed

    1. oxidoreductases,2. transferases,3. hydrolases,4. lyases,5. isomerases, and6. ligases,

    the next two further indicate, respectively, the reactionssubclass and sub-subclass, and the fourth is the serialnumber of the enzyme in its sub-subclass (see http://www.chem.qmul.ac.uk/iubmb/enzyme/rules.html formore details). Unfortunately, the assignment of an ECnumber, even when correct, can be imprecise. For instance,the EC number 2.7.1.69 corresponds to 60 genes inLactobacillus plantarum [176]. Moreover, at current time,some 30 percent to 40 percent of the metabolic activitiesbiochemically characterized, which is experimentall yknown to exist in nature and to which an EC number hasbeen assigned, remain orphan in the sense that no generesponsible for the corresponding reactions has ever beenidentified in any organism [22], [105], [137].

    Orthology approaches. The classical way to annotate a

    whole-genome set of proteins is to perform sequencecomparison between them and the proteins of other specieswhose genomes have already been annotated in order todetect homology, more precisely orthology. The corre-sponding function assignment is indeed based on theassumption that orthologous enzymes have the sameactivity. Orthologous proteins are homologs in differentspecies that originated from a single ancestral gene after aspeciation event [52]. One major difficulty with suchmethods is to distinguish orthologs from paralogs. Thelatter result from the intragenomic duplication of anancestral gene and are assumed to have different func-tion(s). Two main approaches exist to address this problem.

    The first is based on an all-against-all reciprocal sequencecomparison of the proteins encoded in complete genomesusing, for instance, Blast [5]. Two proteins are classified intothe same group if they are more similar between them than

    to any other proteins from the same genomes, that is, if theyare each others Best Reciprocal Hits (BRH). BRH methodsare well adapted for large-scale data and are at current timecommonly used to annotate new sequenced genomes.Clusters of Orthologous Groups (COG) [174] is the firstdatabase built in this way and is certainly the most famousone. The Kegg Orthology (KO) system [82] is based on thesame concept and is also widely used, particularly for

    metabolic reconstruction. Neither COG nor KO are, how-ever, able to take into account gene duplications, which mayoccur after a speciation event and lead to the presence ofcoorthologs in the same genome. The latter have morerecently been considered by Remm et al. to improve theclustering of orthologous groups [147].

    The second main approach to distinguish speciationevents (evidence of an orthology link) from duplicationevents (evidence of a paralogy link) is based on acomparison of gene trees G with a reference speciestree S. The result of the comparison is a reconciledtree R, which is a variation of the species tree in whichduplication nodes have been inserted in order to explainincongruences with a gene tree [40] (see Fig. 2).

    Automatic tree reconciliation methods were developedfor genome-scale studies but require completely resolvedgene and species trees. To circumvent this problem,Dufayard et al. proposed a method allowing for thepresence of a number of unresolved nodes both in the geneand the species trees [39]. Several databases are built usinga tree reconciliation method, among which are HOVERGEN[40] for vertebrates, INVHOGEN [132] for invertebrates,and TreeFam [106] for animals. However, tree reconciliationmethods present various drawbacks [101]: they are basedon still poorly characterized models of gene duplication and

    loss, the species phylogeny used (in general, this follows thetaxonomy provided by the NCBI) contains a large numberof unresolved nodes, and last, computing a reconciled treeis time consuming.

    4 IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO. 4, OCTOBER-DECEMBER 2008

    Fig. 2. Reconciliation of the gene tree G and the species tree S into the

    tree R.

  • 7/29/2019 Lacroix v 2008

    5/24

    Whatever specific method is chosen, all orthology-basedapproaches assume that orthologs have the same func-tion(s) while paralogs have different function(s). However,closely related paralogs may have more similar functionsthan distantly related orthologs. Moreover, it is not alwaysthe case that orthologs have the same function(s). They mayinstead present rather different functionalities, even whenthe percentage of conserved bases among aligned orthologs

    is very high [57], [119].Sequence signature approaches. Since the function of a

    protein is essentially determined by one or several activedomains, identifying sequence signatures may appear to bemore appropriate for predicting protein function(s). Severaldatabases, such as Interpro [117] and ProDom [161],propose an automated clustering of homologous domainsthat can then be used to establish the function(s) of aprotein. More specifically, in the context of metabolic genes,Claudel-Renard et al. introduced a method, called PRIAM[25], that assigns enzymatic activities based on a classifica-tion into modules of the enzymes in the ENZYME database[10]. An enzyme module is defined as the longest homo-

    logous segment shared among an enzyme collection.The increase in the number and diversity of the

    sequences available in databases make all homology-basedmethods error-prone, whether they are based on sequencecomparison or on common sequence signature detection.As more proteins are automatically annotated, the propaga-tion of such annotation errors may quickly escalate [57].Indeed, orphan genes may be associated to a sequence inanother genome where no experimental evidence exists thatwould permit to link the orphans to that or to any otherhomologous sequence [137]. Moreover, some proteins withundetectable sequence similarity may have, in fact, thesame function (they are called analogs) [59].

    Genomic context approaches. Information provided byhomology-based methods can be complemented by inves-tigating protein coevolution. The hypothesis is that func-tionally linked proteins evolve in a correlated fashion. IfN genomes are considered, a phylogenetic profile isestablished for each protein that corresponds to a vectorof bits of length N, each bit indicating the presence orabsence of a homolog in a given genome. Proteins are thenclustered and function(s) determined according to thesimilarity of their phylogenetic profiles [133].

    In prokaryotes where functionally related genes tend tobe coregulated and colocalized on the chromosome, groups

    of contiguous genes are in general preserved across differentgenomes, with conserved local order. This information canbe used to infer the function(s) of the protein a gene encodes,even in the absence of any similarity with other sequences[150], [182]. It has been shown also that genes linked byfusion events generally code for functionally related proteins[202], [43].

    Other approaches. Expression microarrays, protein-protein interaction (PPI) networks, and information oncellular localization may provide further clues for theassignment of function to a protein. In this case, only ageneral function at the cellular level may be inferred, not aprecise molecular function (such as a complete EC number).

    We refer the interested reader to [57] for an overview ofthese methods.

    To avoid the drawbacks inherent to each of the aboveapproaches, attempts have been made to combine them in

    an efficient way. The method of Chua et al. thus models intothe same weighted graph information about sequencehomology, PPIs, protein domains, gene expression data,and literature [23]. Some missing gene approaches (seefurther details in Section 2.2.3) attempt also to integrateseveral kinds of information to identify genes for missingenzymes [126], [201].

    Manual refinement. After any automatic gene annota-

    tion, doing a manual refinement is essential. Severalannotation platforms, such as GenDB [114], MaGe [182],or Iogma (http://www.genostar.org), provide powerfulgraphical interfaces to help experts curate automaticallygenerated annotations. Since it is very difficult to have anexpert group to annotate all the genes in a single genome,Overbeek et al. propose an approach where all genesoccurring in a subsystem (such as a metabolic pathwayfor instance) are analyzed over a large collection of genomesby an expert in that subsystem [125].

    Various currently available projects try to provideimproved genome annotations by using curated and up-to-date genomic sequences. This is the case, for instance, of

    RefSeq [138], GenomeReviews [89], Ensembl [76], andHigh-quality Automated and Manual Annotation ofmicrobial Proteomes (Hamap). The latter uses bothmanual and semiautomatic curation to obtain the completeproteomes of sequenced bacteria [61]. The project Integr8gathers data from both GenomeReviews and Hamap [89].

    Despite the rapid growth rate of new methods and datato determine protein function(s), the fractions of genes forwhich no function can be predicted is still high (around40 percent), especially in eukaryotes, and remains a majorproblem [63], [137]. It is important to keep in mind thisobservation since, whatever the efficiency of the metabolicreconstruction methods, the metabolic capabilities of anorganism will remain incomplete as long as the assignmentof protein function(s) is incomplete or imprecise.

    2.2.2 Defining the Set of Metabolic Reactions and

    Pathways

    Once enzyme functions have been tentatively defined, linksmust be established between them and the correspondingbiochemical reactions. The ENZYME database describeseach type of characterized enzyme for which an EC numbercould be provided and contains links to various metabolicdatabases and to Uniprot [10], which is the worlds mostcomprehensive catalog of information on proteins. It is a

    central repository of protein sequences and functionscreated by joining the information contained in Swiss-Prot,TrEMBL, and PIR. The BRENDA database provides addi-tional detail, when available, concerning the substratespecificity of the reactions across several organisms [155].The BIOCYC [84] and KEGG [83] databases, currently themost widely used, store information on genomic data,proteins, reactions, compounds, and metabolic pathways.Perhaps the two most currently used tools to automaticallyinfer a set of biochemical reactions from genomic data arethe KEGG Automatic Annotation Server (KAAS) [115] andPathoLogic, which is the prediction tool in the PathwayTools used to build, query, and visualize BioCyc-like

    databases [86]. The two adopt a somewhat distinct strategy.In KEGG, the genes present in the databases are identified bytheir KO number, which groups orthologs coding for thesame enzymatic activity. A biochemical reaction is consid-

    LACROIX ET AL.: AN INTRODUCTION TO METABOLIC NETWORKS AND THEIR STRUCTURAL ANALYSIS 5

  • 7/29/2019 Lacroix v 2008

    6/24

    ered as possible in an organism if its genome contains a genethat is classified by sequence comparison into the KO groupcorresponding to the EC number of the reaction. PathoLogicdoes not do sequence comparison but assumes instead thishas already been performed and the information stored, forinstance, in GenBank, which is the comprehensive NIH-maintained genetic sequence database [12]. As a conse-quence, if an EC number has been assigned to a gene product

    (the EC number can be retrieved from the EC_NUMBERqualifier in GenBank), the corresponding reaction in adatabase called MetaCyc [21] used by PathoLogic is addedto the list of reactions feasible in the organism. MetaCycstores information on nonredundant, experimentally eluci-dated metabolic pathways for more than 900 organisms. If agene has no EC number assigned, its assigned functions arematched against the list of names of functions present inMetaCyc. To a great extent, the strategy of KAAS is thus toredo the assignment of a function for each gene while thestrategy of PathoLogic is to use the preexisting annotations.This means that the quality of the predictions in PathoLogichighly depends on the quality of the annotations. On the

    other hand, PathoLogic can also take into account informa-tion that does not come from sequence comparison. Themain advantage of the KAAS metabolic reconstruction isthat it can start its assignment from draft genomes.

    Historically, metabolic pathways are sets of reactionscorresponding to a metabolic function. For instance,glycolysis is a sequence of reactions involved in thedegradation of glucose (a molecule with six carbon atoms)into pyruvate (a molecule with three carbon atoms) andother molecules needed for the biosynthesis of macromo-lecules constitutive of a cell. Glycolysis is one of themetabolic pathways (such as the TCA cycle and the pentosephosphate pathway for instance) that occurs in almostevery organism. Reference metabolic pathways have thusbeen built to represent common and alternative organism-dependent reactions. By comparing the set of reactionsexpected to occur in an organism with the set of reactions inreference metabolic pathways, it is possible to infer themain metabolic functions of an organism. In KEGG, thereference metabolic pathways are organized into metabolicmaps where all known variants are drawn together. From alist of reactions or even gene identifiers defined by KAAS, itis possible to highlight the corresponding reactions in eachreference metabolic map. PathoLogic additionally attemptsto predict which pathways are susceptible to occur in the

    input organism. The prediction is based on the proportionof metabolic reactions in the pathway for which there is anevidence [126] and on the presence of unique reactions.Indeed, reactions occurring in only one pathway provide astronger evidence for the presence of a metabolic pathwaythan reactions occurring in several other pathways.

    Neither KAAS nor PathoLogic can predict novel path-ways. For this, the Pathway Hunter Tool (PHT) [143] can beused. Starting from a set of annotated EC numbers and asource/destination metabolite pair selected by the user, theshortest metabolic pathways k-shortest pathways arecomputed by PHT. These provide alternative routes thatmay then be evaluated for biological significance.

    A genome-based metabolic reconstruction enables tovery quickly obtain a draft of the relations between genes,enzymes, and reactions and, for instance, to study theimpact of some genomic events (such as duplications,

    transfers, or deletions of genes) on a metabolic network.However, this draft reconstruction often contains errors orimprecisions that must, in general, be further painstakinglyand expertly curated with help from the literature or fromother more refined methods, which are described next.

    2.2.3 Refinements of a Metabolic Reconstruction

    Missing genes and fragments in a metabolic reconstruc-

    tion. After any metabolic reconstruction, some reactionsappear as missing. They correspond to gaps in thebiochemical pathways that were nevertheless declared asbeing present in the reconstruction. We observe that thereseems to be no established quantitative criterion in theliterature to decide whether a pathway is present in ametabolic reconstruction. Any such missing reaction can beexplained [28], [124] by

    1. a low sequence similarity of the corresponding genewith the reference ones coding for the missingenzyme in other organisms;

    2. the products of the reaction can be obtained from

    alternative pathways or are provided by theenvironment;

    3. the presence of multienzyme proteins; and4. the corresponding gene is really absent in the

    genome.

    Various approaches exist to complete these gaps bytrying to identify the genes encoding for a specific metabolicfunction. Like most methods used in this area, they arebased on ad hoc heuristics. Osterman and Overbeekpropose a very good review on the use of comparativegenomics to search for missing genes [124]. Indeed, varioustypes of genomic evidence can be combined to propose

    candidate genes for a missing reaction: genes coding forenzymes involved in the same pathway are often chromo-somal neighbors (in prokaryotes, the case of eukaryotes isless clear) [65], [62], [92]; protein fusion events providesome evidence of potential functional coupling [92]; twoproteins that appear in the same metabolic pathways tendto occur together or not in any specific organism [92]; genesfunctionally coupled are expected to be coregulated [92].Inference of the missing reactions may be performed usinga supervised approach (such as Support Vector Machine),which requires partial knowledge of the network and goodtraining sets [65], [201]. Candidate genes have theirsimilarity to known genes measured according to the

    relative importance of each of the previously cited evidence.These methods are interesting for both metabolic recon-struction and gene annotation.

    When the missing reactions represent a pathway frag-ment, another approach [20] may be used to infer it bycomparison with a set of known pathways. The latter arepreviously modeled as enzyme graphs labeled with GOannotations. Data mining techniques are used to detectfrequent pathway functionality patterns in the graphs andthese are then matched onto the metabolic network ofinterest. The underlying idea is to focus on the biologicalprocesses carried by the various steps along a pathwayrather than on the specific genomic or chemical features of

    each individual enzyme [20].Reversibility of the reactions. The direction of a reaction

    in certain physiological conditions is determined by thethermodynamic properties of the reaction, the kinetic

    6 IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO. 4, OCTOBER-DECEMBER 2008

  • 7/29/2019 Lacroix v 2008

    7/24

    properties of the enzyme, and the concentration of itssubstrates and products. In a metabolic reconstruction, thedirection of a reaction is often added a posteriori andmanually, or the reaction is left as reversible. Methods toautomatically assign a direction to the reactions in a genome-scale metabolic network reconstruction remain rare. Yanget al. [203] show how the direction of a reaction may bedetermined by analyzing the stoichiometric matrix of the

    metabolic network. The stoichiometric matrix of a networkgives the coefficients for all reactions in a metabolic network.The stoichiometric coefficient for a given metabolite in relationto a given reaction represents the degree to which themetabolite participates in the reaction. For instance, if thereaction R is A 2B! C 3D (reactant metabolites A andB are transformed into product metabolites C and D byreaction R), then the stoichiometric coefficients of A, B, C,and D are, respectively, 1, 2, 1, and 3. In general,stoichiometric coefficients are integers since elementaryreactions always involve whole molecules. Obviously, thecoefficient is zero for metabolites that do not participate inthe reaction. Yang et al. [203] showed how, using the

    stoichiometric matrix, feasible reaction directions in a givensystem may be computed from those imposed by thereactions at the boundary of the system that transportmetabolites in or out. However, the algorithm was tested ona reaction network of only 44 reactions. The algorithmproposed by Kummel et al. [102] exploits instead allavailable experimental thermodynamic data, given by thederived Gibbs energies of formation, and metabolite con-centrations to identify irreversible reactions. The standardGibbs free energy of formation of a metabolite is the changeof Gibbs free energy that accompanies the formation of 1mole (the mole is a counting unit) of that substance from itscomponent elements, at their standard states (the most stableform of the element at 25 C and 100 kilopascals). The Gibbsenergy of a reaction can intuitively be thought of as themaximum amount of work obtainable from a reaction. Next,Kummel et al.s algorithm aims at identifying, on the basis ofsome heuristic rules, sets of reactions (subnetworks) whosesimultaneous operation is thermodynamically infeasible. Athermodynamically infeasible subnetwork may, for in-stance, be a cyclic operation of a reaction set such as theexample given in [102] of three reactions A $ B, B$ C, andA $ C. IfA is converted to Band Bto C, then Cmust have alower Gibbs energy of formation than A meaning that thereaction C ! A is not feasible. The algorithm was tested on a

    genome-scale model of Escherichia coli (920 reactions) andassigned 130 of the 920 reactions as irreversible. Feist et al.also propose a genome-scale metabolic reconstruction forEscherichia coli K12, which includes thermodynamic infor-mation [46]. The directions are inferred from the values ofstandard Gibbs free energy change of formation for mostmetabolites and reactions, estimated from the structure ofthe metabolites [113].

    Inferring the presence of reactions from metabolitedata. Large-scale techniques have recently been developedto determine the metabolome of an organism, i.e., the set ofits detected metabolites. The techniques are grouped underthe term metabolomics. A good description of those

    currently used, with their advantages and limitations, hasappeared already in several reviews [13], [68], [122]. Thecatalog of metabolites thus produced provides additionalinformation about which compounds are really present

    inside an organism. Indeed, because of incomplete knowl-edge on substrate specificity, reconstruction from sequenceannotation gives only partial and approximate knowledgeon the metabolites participating in a metabolic network.

    Some methods suggest feasible biochemical reactionsfrom sets of metabolites. For instance, Arita uses hypothe-tical links (16 basic types) between compounds to inferbiochemical reactions even if they do not correspond to a

    known enzyme [7]. In the same way, Kotera et al. propose amethod to assign partial EC numbers from a set of substratesand a set of products [100]. Breitling et al.s approach, on theother hand, relies on the development of ultrahigh resolutionmass spectrometers and on the fact that only a limitedrepertoire of chemical transformations account for themajority of biochemical reactions within cells [16]. Thisenables to infer feasible biochemical reactions by computingaccurate mass differences between compounds and thenreferring to a table that provides correspondences betweenmass difference and chemical reaction. The main advantageof such a method is that no genomic information is required.

    The disadvantage is that it does not provide any informationabout the relations between genes, enzymes, and reactions.Refining metabolic reconstruction using biological

    experimental evidence. Mass spectrometry for the large-scale identification of the proteins in an organism enables toconfirm the presence of enzymes predicted from genomicannotations [183], [192]. In the same way, several other high-throughput techniques currently exist to define the metabo-lome, that is, the set of metabolites present in an organism[13], [50], [56], [122]. High-throughput phenotyping andgene expression data may also be used together with thepredictions of a computational model in order to refine ametabolic network [30]. At a smaller scale, physiological

    information may provide important additional clues tocomplete the set of reactions in a metabolic network. Forinstance, in the metabolic model of Streptomyces coelicolorbuilt by Borodina et al. [15], 89 percent of the reactions havean associated annotated gene while the remaining reactionswere included based on physiological evidence. Thepurification of an enzyme and study of its catalytic activitiesfollowing functional annotation or mass spectrometry mayalso further enable to precise its substrate specificity.

    The analysis of admissible fluxes in a network, togetherwith incorporated constraints that limit the networkbehavior with respect to feasible steady-state flux distribu-

    tions, has also been used in order to predict the phenotypicbehavior of the network as a function of such constraints. Ateach iteration, the phenotypic predictions computed bysuch an analysis are compared with experimental observa-tions. If the model predictions do not correlate with theexperiments, hypotheses about the functions of an organismare generated to expand the model [15], [38].

    2.3 Modeling Metabolism

    Metabolism can be studied from a structural or from adynamic point of view. Distinct models are used in eachcase [167], the most common being graphs, the so-calledconstraint-based models and differential equations. The

    latter is necessary for dynamic studies. Since this review isfocused on the structural analysis of metabolism, we mostlycover the first two types, with an emphasis on graphmodels. The reader interested in dynamic aspects of

    LACROIX ET AL.: AN INTRODUCTION TO METABOLIC NETWORKS AND THEIR STRUCTURAL ANALYSIS 7

  • 7/29/2019 Lacroix v 2008

    8/24

    networks and a representation by differential equations ofthe latter may refer to [171].

    2.3.1 Graphs

    Formally, a graph G is defined as a couple V ; E, where Vis a finite set of nodes, and Eis a set of edges that is a subsetof V2. Modeling a metabolic network with a graph meanschoosing which biological entities will be associated tonodes and edges. In the context of metabolism, biologicalentities may be compounds, reactions, or enzymes. We startby describing the graph models more frequently found inthe literature and discuss in what context they may

    introduce ambiguities.The question of which is the right graph model to choosedepends mainly on the type of question asked, whether, forinstance, the focus is on connectedness only, or if knowingthe exact pattern of connection is necessary.

    More commonly used graph models. A compoundgraph is a model of the metabolic network where the nodescorrespond to compounds and there is an edge betweencompounds A and B if there exists a reaction where one issubstrate and the other is product. In a reaction graph, nodescorrespond to reactions and there is an edge between tworeactions if there exists a compound that is produced by oneand consumed by the other. Enzyme graphs are also

    sometimes used. In this case, nodes correspond to enzymesand there is an edge between two enzymes if they catalyzereactions that share a compound. Using the enzyme graphmay lead to confusion in the structure of the network, i.e.,reactions catalyzed by the same enzyme are merged,thereby creating shortcuts between distant compounds(see Fig. 3).

    Nevertheless, this model may still be adopted if the focusis really on enzymes and on the relations between them, asin [74]. One should call attention to the fact that enzyme andreaction graphs are often mistaken for one another althoughthey are really different. Using a labeled graph (that is, a

    graph where nodes and edges may be labeled), enzymescan be introduced as reaction labels, either in the compoundgraph or in the reaction graph. This ensures that thestructure of the network is preserved.

    Compound and reactions graphs may sometimes beambiguous [36]. For instance, the two following networkslead to the same compound graphs, as indicated in Fig. 4a.

    One may resolve this ambiguity by adding reactions as

    edge labels. Another method consists in using a moreexpressive graph model: either a bipartite graph or ahypergraph.

    Most expressive models. Formally, a bipartite graph is agraph whose set of nodes V can be divided into two disjointsubsets V1 and V2 such thateach edge has a nodein V1 and theother in V2. In the context of a metabolic network, one of thesets corresponds to compounds and the other to reactions. Inthe case of the example of Networks 1 and 2 above, theambiguity would be avoided as shown in Fig. 4b.

    Another way of solving the ambiguity of simple graphsis to use a hypergraph. Intuitively, a hypergraph is a graphwhere the edges may link more than two nodes. Formally, ahypergraph H is a pair V ; E, where V fv1; v2; . . . ; vng isthe set of nodes, and E fe1; e2; . . . ; emg, with Ei V, fori 1; . . . ; m, as the set of hyperedges. Clearly, ifjEij 2, forall i 1; . . . ; m, then the hypergraph is a simple graph. Inorder to model a metabolic network with a hypergraph,nodes are usually associated to compounds and hyperedgesto reactions (see Fig. 4c).

    The models mentioned previously were undirected. Adirected graph is a graph where each directed edge, i.e., arcin mathematical notation, is an ordered pair of nodes, one ofwhich is the source node and the other the target node. Inthe case of a hypergraph, a directed hyperedge, or hyperarc,

    may have several sources and several targets (see Fig. 4d).Orientation of edges generally enables to model thedirection of a reaction. In the case where all reactions areirreversible (respectively reversible), the network will bemodeled as a directed (respectively undirected) graph. Inthe case where only some reactions are reversible, one mayuse a mixed graph (some edges are directed and some arenot). In several papers [49], the assumption is made that allreactions are reversible. Indeed, we can argue that in thepresence of an excess of product, even reactions that have afavored direction may occur in the opposite direction.

    Finally, one should notice that modeling a reversiblereaction with undirected edges may still be ambiguous

    (several networks may have the same representation), evenwhen using bipartite graphs. If we go back to the examplegiven in Fig. 4b, the same graph could have been obtainedusing the set of reactions: A $ B C, C$ D. Indeed, there

    8 IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO. 4, OCTOBER-DECEMBER 2008

    Fig. 3. Reaction graph and enzyme graph for the sets of reactions:

    R1: A B! C D, R2: D E! F G, R3: F G ! H I, andR4: I ! J K. E1, E2, and E3 are enzymes: E1 catalyzes R1 and R4,E2 catalyzes R2, and E3 catalyzes R3. (a) Reaction graph. (b) Enzyme

    graph.

    Fig. 4. (a) Compound graph for the sets of reactions. Set 1: A $ C,B$ C, C $ D. Set 2: A B$ C, C $ D. (b) Bipartite graph for theset of reactions: A B$ C, C $ D. (c) Undirected hypergraphcorresponding to the network: A B$ C, C $ D. (d) Directedhypergraph corresponding to the network: A B! C, C ! D.

  • 7/29/2019 Lacroix v 2008

    9/24

    is no indication in the graph on which compounds are onwhich side of the reaction; no difference is made betweenright compounds and left compounds. One way of solvingthis problem is to use edge labels. Such edge labels may alsohelp in avoiding to compute biologically meaningless paths,which link compounds that are on the same side of areaction (a substrate to another substrate for example). Wemay observe that labeled bipartite graphs and hypergraphs

    are strictly equivalent in terms of the quantity of modeledinformation. One may easily transform one into the other.

    Possible model simplifications. In all models presentedabove, it is commonly assumed that all reactions and allcompounds are equivalent. In practice, for a given reaction,some compounds may be considered as primary and othersas auxiliary. For instance, ATP and NADH are involved inmany reactions as, respectively, energy source and reducingagent. Considering them as regular compounds to build agraph model would lead to artefactually link a very largenumber of reactions.

    A classical way of dealing with this problem is to removefrom the network all compounds that participate in a large

    number of reactions [79], [109], [108]. This method hasseveral caveats: 1) one needs to heuristically define athreshold to determine how many metabolites to remove,2) some highly connected metabolites like pyruvate orfructose will be removed whereas it is commonly agreedthat they belong to the backbone of metabolism, and 3) evencompounds like ATP that could be safely removed frommost reactions should not be eliminated from those thatparticipate in their own synthesis.

    Another option is to remove compounds only in thereactions where they are involved as side compounds, i.e.,remove edges and not nodes of the graph [103]. The idea isto maintain the backbone of metabolism and to get rid of

    shortcuts. The distinction between side compounds andmain compounds for each reaction is, however, not alwaysavailable. It has been automatically generated in BioCyc,initially for drawing purposes [85]. KEGG maps do notrepresent all compounds either.

    Applying no treatment to such highly connected com-pounds may lead to misleading conclusions when comput-ing network measures as we shall see in Section 3.1.1.

    2.3.2 Constraint-Based Modeling

    The term constraint-based modeling was coined and,subsequently, adopted by the bioinformatics community,

    following two papers by Palsson [31], [127]. In a constraint-based framework, the network is modeled by its stoichio-metric matrix (see Section 2.2.3). This actually correspondsto a directed edge-labeled bipartite graph (or equivalently,hypergraph). In this case, the labels are the stoichiometriccoefficients of the compounds in the reactions, with signs toindicate if they are produced or consumed. Reversibility ofthe reactions is not handled within the matrix representa-tion but provided as a separate information.

    In constraint-based modeling, the focus is on thedistribution of mass fluxes through the reactions undersome constraints. The principal motivation is to analyze themetabolic capabilities of an organism. The main constraints

    that have been considered are: 1) steady state (everymetabolite that is produced has to be consumed) and2) thermodynamic constraints (irreversible reactions canonly be taken in the appropriate direction).

    In the following, we introduce the concept of a flux vector,denoted by v. A flux vector (or flux distribution) is anm-vector of the space of reactions 0; 8i 2 Irrev.

    They define a portion of the flux space represented by aconvex polyhedral cone containing all admissible fluxvectors. A flux vector is also called a mode.

    Given this framework, two main issues have beenaddressed. The first, which is called Flux Balance Analysis(FBA), is concerned with finding an admissible flux vector,

    which optimizes an objective function. The objectivefunctions that have been considered in the literature includebiomass or ATP production. FBA has many applicationsand has been shown to have good phenotypic predictivepower [42]. FBA may also be used to measure thephenotypic effects of complete or partial metabolic genedeletions and other types of perturbations on a system.Gene deletion studies are in general performed by con-straining the reaction flux(es) corresponding to the gene(s)(and associated proteins(s)) to zero and applying FBA onthe network. This approach assumes that the correspondingmutant system will display an optimal metabolic state, butas indicated in [159], knockouts probably do not possess a

    mechanism for immediate regulation of fluxes toward theoptimal growth configuration. Another flux-based analysismethod, called MoMA [159], was therefore developed thatis similar to FBA and based on the same stoichiometricconstraints but where the optimal growth flux for mutantsis relaxed. Instead, MoMA provides an approximatesolution for a suboptimal growth flux state, which is nearestin flux distribution to the unperturbed state. It thereforeinvolves a different optimization problem than FBA,namely distance minimization in flux space.

    When all admissible vectors are of interest, one may wishto find a set of vectors that can generate all of them. Thisproblem has been addressed several times and in different

    areas in the past [27]. Several similar concepts now coexist.The most widely used in the context of metabolic networksare the concepts of elementary modes [157], extremepathways [153], and minimal T-invariants [186]. The threeenable to investigate the space of all physiological statesthat are meaningful [168].

    Intuitively, an elementary mode is a special mode thathas the property of not containing any other mode. Moreformally, an elementary mode is a flux vector v that satisfiesconditions 1 and 2 above plus the following condition.

    3. There is no nontrivial admissible flux vector v0 suchthat Rv0 & Rv, where Rv fjjvj 6 0g, i.e., Rv

    is the set of reactions participating (with nonzeroflux) in v; Rv is called the support of v.

    Elementary modes have been said to represent a for-malized definition of a biological pathway. Indeed, a

    LACROIX ET AL.: AN INTRODUCTION TO METABOLIC NETWORKS AND THEIR STRUCTURAL ANALYSIS 9

  • 7/29/2019 Lacroix v 2008

    10/24

    biological interpretation can be given to such flux vectors:a mode is a set of enzymes that operate together at steadystate [156] and a mode is elementary when the removal ofone enzyme causes it to fail. Extreme pathways wereintroduced in the field by Schilling et al. [152] and areactually a subset of elementary modes. Both notions coincidein the case where all exchange reactions (reactions connect-ing a metabolite with the outside of a metabolic system) are

    irreversible. For a detailed comparison of the two, we referthe reader to [97]. As outlined in [158], the concept ofminimal T-invariants used in Petri Nets is also closely relatedto elementary modes, as is the notion of extreme currentsdefined by Clarke [24]. Petri nets may be seen as directedbipartite graphs with two types of nodes called places andtransitions. Places may contain tokens that are passed to otherplaces through transitions according to some local rules.Unlike extreme pathways and elementary modes, minimalT-invariants and extreme currents have only been defined inthe case of a network of irreversible reactions.

    Clearly, there are links between the algorithms forenumerating elementary modes and the ones for obtaining

    minimal T-invariants. Interested readers may refer to [27] asconcerns minimal T-invariants and to [58], [162], and [181]for elementary modes. More generally, the usefulness ofPetri-Net approaches to the study of metabolic pathways ispresented in [186]. As discussed in [118], counting andenumerating such objects are hard algorithmic problems. Inthis case again, applications are numerous and range frommodel validation [72] to prediction of gene expressionpatterns [168].

    An elementary mode may be seen as a set of reactionsthat, when used together, performs a given task. One maybe interested instead in determining a set of reactions oneneeds to inhibit to prevent a given task, usually called targetreaction, from being performed. This leads to the concept ofa minimal (reaction) cut set that was recently introduced byKlamt and Gilles [95]. As mentioned in a latter paper [94],the task to be silenced can also be a combination ofreactions. There is a strong link between reaction cut setsand elementary modes since the first have been operation-ally defined as corresponding to a set of reactions whosedeletion from the network stops each elementary mode thatcontains the target reaction(s).

    Readers interested in constraint-based modeling andwishing to have more details and examples on these conceptsare invited to read [97] and the references therein indicated.

    3 TOPOLOGICAL ANALYSES

    3.1 Synthetic Graph Measures

    Once a metabolic network is modeled as a graph, classicalmeasures from graph theory may be used to check whetherthey provide any further insight into the structure andgeneral characteristics of the network. Several such mea-sures have been applied to metabolic as well as to othertypes of biological networks. In the network analysistoolkit, one may thus find a number of measures such as:degree distribution, average internode distance (averageshortest path length), closeness centrality, diameter, cluster-

    ing and assortativity coefficients, node- and edge-between-ness centrality, and synthetic accessibility.

    The degree of a node i, denoted by ki, is the number ofedges linking it to other nodes of the graph. The distance

    between two nodes i and j is the length of a shortest pathbetween these two nodes (the path may be not unique).Average internode distance of a given network, i.e., averageshortest path length, is the mean of the distance for all pairs ofnodes. Closeness centrality is the mean distance between agiven node i and all other nodes reachable from it. Thediameter of a network is the maximum distance between anypair of nodes in the network. The clustering coefficient of a

    node i is the proportion of existing edges over all those thatare possible between the neighbors of this node. Itquantifies how close from a clique is the subgraph inducedby i and its adjacent nodes, where a clique is a maximalcomplete subgraph (each pair of nodes is connected by anedge). The node-betweenness centrality, or betweennesscentrality for short, of a node i, is the proportion, over allshortest paths between every pair of nodes in the network,of those that contain i. Edge-betweenness centrality issimilarly defined by replacing node i by edge e. Both are ameasure of the centrality of a node in a graph. Theassortativity coefficient, denoted by ac, is defined by

    ac

    Pkekk akbl

    1 P

    k akbl;

    where ekl is the fraction of edges from a node of degree k toa node of degree l, ak

    Pl ekl, and bl

    Pk ekl. It tells in a

    concise fashion how nodes of different degrees arepreferentially connected among themselves.

    All the above measures may be applied to either areaction or compound graph representation of a metabolicnetwork. A further measure, called synthetic accessibility[199], is more specific of metabolites. The syntheticaccessibility Sm of an output metabolite m is the minimal

    number of metabolic reactions needed to produce m from aset of compounds defined as the inputs of a network.Synthetic accessibility defined this way is a generalizationof graph diameter for directed branching chemical reactionsin an input-output transport network [199]. It has been usedto predict the viability of knockout strains with an accuracythat appears to be comparable to FBA on large, unbiasedmutant data sets [199].

    We now comment on two main works where some of theabove synthetic structural measures, namely degree dis-tribution and diameter, were applied to metabolic networksin an attempt to infer biological meaning. The correspond-ing papers have met with an open success in the

    bioinformatics community mainly because they propose torelate complex processes to simple measures. We try todiscuss the limitations of these approaches. One generalword of caution is called for already as concerns the use ofsynthetic measures: as stated in [67], such measures aretruly informative only when the network lacks a modularstructure, or when it is modular but all modules arehomogeneous in terms of the mechanisms that originatedthem and their properties. If neither of these two conditionsis fulfilled, then any theory proposed may need to take intoaccount the modular structure of the network. Modularity isdiscussed in Section 3.2.

    3.1.1 Small-World NetworksThe first work on the use of synthetic measures to analyzebiological networks we comment on is based on the conceptof small worlds. Historically, the small-world phenomenon

    10 IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO. 4, OCTOBER-DECEMBER 2008

  • 7/29/2019 Lacroix v 2008

    11/24

    was first described by Stanley Milgram in the 1960s in aseries of experiments where he chose individuals in citiesacross the US and asked them to send a letter to anotherunknown person on the opposite coast through a chain ofpeople they knew on a first name basis [180]. The results ofthese experiments enabled him to estimate the average pathlength of the considered social network. Surprisingly, thisvalue was very small (around 6), which later led to the

    famous expression 6 degrees of separation.The concept of small-world network was further for-

    malized by Watts and Strogatz [196] who gave a construc-tion method for such networks and provided evidence thatmany types of networks indeed fulfill the properties of asmall world, that is, their diameter increases logarithmicallywith the number of nodes. The construction proceeds asfollows: The starting point is a regular network withn nodes and k edges per node. Each edge is then randomlyrewired with probability p. The construction method allowsto adjust between regularity p 0 and disorder p 1.The authors showed that it is the addition of edges betweendistant nodes (shortcuts) that causes a disruption in the

    diameter of the graph and therefore yields the small-worldproperty (low diameter, high clustering coefficient). It isworth observing in passing that this construction explainshow to obtain a network where there exists, as in Milgramsexperiment, short chains of acquaintances linking togetherarbitrary pairs of strangers, but it does not explain whyarbitrary pairs of strangers should be able to find shortchains of acquaintances that link them together. A moreoperational definition that provides a better explanation forthis was proposed by Kleinberg [98], who showed that, inaddition to having short paths, a network should containlatent structural cues that can be used to guide a messagetoward a target. Not all small-world networks possess suchcues and Kleinberg therefore captured another componentof Milgrams experiment that was not evidenced by Wattsand Strogatzs work.

    In the case of metabolism, Fell and Wagner [49] andWagner and Fell [189] proposed soon after Watts andStrogatzs paper appeared that the metabolic network of thebacterium Escherichia coli satisfies the properties of a small-world network. The authors further argued that this type ofarchitecture enables to minimize the transition timesbetween metabolic states and thus contains clues as to theevolutionary history of metabolism. In 2004, Arita [8]suggested however that the graph model used by Fell and

    Wagner was not realistic enough. In Aritas model, thecompounds themselves are represented as graphs wherenodes are atoms and edges are chemical bonds. The pathsare then calculated as atom trajectories between carbonatoms instead of complete metabolites (which enables tomore accurately represent mass transfer). With this model,Arita showed that the network diameter is much larger thanestimated previously, thereby questioning the biologicalinterpretation of this network measure. The same argumentwas advanced by Alm and Arkin [3] who pointed to thefact that the concept of small-world networks tends tooverlook the stoichiometry inherent to biochemical reac-tions. Indeed, the edges that enable to reduce the path

    length in metabolic networks can often be explained bycofactors that connect seemingly unrelated reactions. Yet,another way of dealing with paths in metabolic networks isto assign weights to the nodes according to their degree (see

    Section 3.1) and to look for lightest paths [32]. This heuristicwas shown to bring better results (at least in terms ofprediction of known metabolic pathways) compared to thesituation where frequent compounds are removed.

    Clearly, however, the notion of path in not very wellsuited to metabolic networks and one should prefer thenotion of balanced hyperpath as defined in constraint-basedmodeling (Section 2.3.2).

    3.1.2 Scale-Free Networks

    Another notion has become even more popular in thebiological network literature than the small-world property.This is the notion of scale-freeness. The term wasintroduced in 1999 by Barabasi and Albert [11] to qualifya large diversity of networks whose degree distribution isbelieved to deviate from the classical Poisson distributionthat is expected in random graphs and is instead betterapproximated by a power-law distribution. The types ofnetworks studied by Barabasi and Albert in their 1999science paper included genetic, nervous, and social net-works as well as the Web. It was further proposed that,

    because of this property, such networks would be errortolerant and robust to random attacks [142]. In the followingyears, other biological networks, including metabolic [79],were suggested by Barabasi et al. to belong to this verygeneral class of networks. This same idea was then taken upand the property repeatedly exhibited by a number of otherauthors. The reader may consult Khanin and Wit for furtherreferences [91].

    In order to explain more formally what a scale-freenetwork is, let us first consider the function pk, whichgives the probability for a randomly chosen node to havek edges connected to it. A network is said to be scale-freeif for all k1, k2, the ratio pk1=pk2 is invariant by

    multiplication of k1 and k2:

    pk1

    pk2

    pk1

    pk2 F

    k1k2

    ;

    where is a positive constant, and F is the rescalingfunction. One can demonstrate that this property is true ifand only if the probability functionpk follows a power law,i.e., pk / k, where is the power-law exponent [139].

    One of the characteristics of scale-free networks (whichdifferentiates them from Erdos-Renyi graphs for which thedegrees of the nodes follow a Poisson distribution) is that afew nodes have many connections while a large number of

    nodes have very few connections. Highly connected nodesare sometimes called hubs and are thought to play aparticular role in the network. For instance, in [80], theauthors argued that, in the context of protein interactionnetworks, hubs correspond to essential proteins in the sensethat mutants lacking this protein have lethal phenotype.Scale-free networks satisfy also the small-world property.

    Recently, several works started to question the results ofBarabasi et al. This questioning may be classified into fourmain categories.

    The first concerns the quality of the data modeled intothe networks. It has been argued that the high rate oferrors in the data used to build the networks create

    artefactual links that invalidate or at least weaken some ofthe results obtained. This concerns either the scale-freeproperty or the existence of hubs and their enrichmentwith essential nodes [4], [29]. This observation has been so

    LACROIX ET AL.: AN INTRODUCTION TO METABOLIC NETWORKS AND THEIR STRUCTURAL ANALYSIS 11

  • 7/29/2019 Lacroix v 2008

    12/24

    far applied more to PPI networks than to metabolicnetworks. Incompleteness of data is, however, a problemthat concerns metabolic as much as other types ofbiological networks. Stumpf et al. [169] thus noticed thatthe networks considered in the literature, including formodeling metabolism, usually correspond to partial datareflecting our incomplete knowledge of the studiedprocesses. They therefore asked the legitimate question: if

    the networks we observe are scale-free, does it imply thatthis is also true for the larger networks from which theyare extracted? Their answer to this question is no.

    The second category of questioning is methodological.Two main works [91], [169] have thus contested theproposition that the biological networks available andcommonly used in the literature are scale-free. Indeed, thetest that is classically applied to show that the degrees of agraph are drawn from a power-law distribution consists infitting a straight line on a log-log plot. However, as stated in[91], fitting a straight line does not necessarily make thepoints follow it. More formal methods are required toascertain that the scale-free distribution is indeed the best

    model for the data observed. Khanin and Wit [91] showedthat several networks considered as scale-free using theelementary fit-to-a-line test were no longer considered assuch when applying a maximum likelihood method andchi-squared goodness-of-fit tests. They suggested otherdistributions have similar qualitative properties as observedin biological networks, namely a few nodes with a highnumber of connections and many with few connections.These distributions include truncated power-law (whichhowever they indicated exhibits also a poor fit to the data),generalized Pareto law, stretched exponential distribution,geometric random graph, geometric distribution, or combi-nations of the above [91]. By applying statistical modelselection methods using maximum likelihood inference,composite likelihood methods, the Akaike informationcriterion, and goodness-of-fit tests, Stumpf et al. [169]obtained that the degree distribution of present-day meta-bolic networks appear to be better described by a lognormalmodel (ifY is a random variable with a normal distribution,then X expY has a lognormal distribution). Like thescale-free distribution, the lognormal, as well as othersmentioned by Khanin and Wit or tested by Stumpf et al., isfat-tailed, but they are not scale-free.

    The third category of questioning the scale-freeness or ofother general properties of metabolic networks is more

    deeply rooted in biology. Indeed, as observed by Alm andArkin [3], one important characteristic of metabolism isforgotten by all purely topological studies such as the nodedegree distribution: this is that each node has a specificidentity, which furthermore is usually distinct from theidentity of other nodes. The node degree distributioncaptures therefore only a very small part of the realcharacteristics of biological networks. As advanced by Almand Arkin, whereas the Internet might function similarly ifindividual nodes were rewired while keeping the sameoverall topology, metabolic reactions are highly specific andedges cannot, in general, be swapped because of additionalconstraints such as conservation of mass. They further

    mention a comment by J. Doyle that biological networksmight perhaps more accurately be considered as scale richbecause they are composed of many nodes of different typesorganized into modular and hierarchical structures.

    Finally, at a more conceptual level, it can be argued, asKeller did [87], that even if a network is indeed proven to bescale-free, knowing this would not be very informativesince this class is possibly too general. Therefore, it seemsthat there is a need for finer measures to reach a morerelevant biological interpretation.

    The works on small-world and scale-free networksillustrate the initial enthusiasm in the field for general

    results on the structure of biological networks. Bothhave now been shown to have limited impact on ourunderstanding of metabolism. It seems that either themeasures or the interpretation we make of these measureshave to be changed or further elaborated to be useful whentrying to make sense out of the structure of metabolism. Thecomments of Alm and Arkin [3] point out also to crucialproblems for testing any of the hypotheses that have beenadvanced concerning the global structural properties con-sidered in this section but also any of the issues that weshall discuss next: what is a good null graph model, andtherefore, what are the properties that we expect by chance?Despite an abundant literature, this is a question that seems

    to remain widely open.

    3.2 Modularity

    Wagner et al. informally define modularity as an abstractconcept that seeks to capture the various levels and types ofheterogeneity found in organisms [191]. The authors arguethat different kinds of modules may be distinguished. Allhave, however, in common the fact that they are integratedwith respect to a certain kind of process (such as, forinstance, natural variation, function, development, and soon) and should be relatively autonomous from otherprocesses or parts of the organisms to which they belong.All ideas of modularity appear therefore to refer to a patternof connectedness in which elements are grouped into highlyconnected subsets, the modules, which are more looselyconnected to other such groups. The notion of connectionshould not be understood here in the strict sense of physicalinteraction but rather for the reason that elements areconsidered as connected when they belong to a commonprocess. The independence of modules may be spatial ortemporal, chemical or genetic, structural or dynamic [198].

    Modularity has been perceived both as a fundamentalnotion (deciding if a network is modular is informativeper se) and as an operational notion (decoupling a systeminto independent modules may be computationally and

    formally very efficient). No really clear formal definition ofthis concept has been reached.Most biologists do, however, agree on the fact that

    modularity is present in biology. This is obvious already ata high level [171]. Organ transplants are an example of this.In the same way as organs can be considered as themodules of an organism, an organism may also be seen asthe unit of a population. At a lower scale, organs may bedecomposed into cells composed of molecules, which inturn may be further decomposed into modules such asprotein domains. At the intermediary level between cellsand molecules, the picture becomes, however, less clear.Nevertheless, several authors have argued in favor of

    modularity at all levels, including the cellular level [198].Several examples of cellular modules can be given[70]DNA replication, glycolysis, protein synthesissomeof which have been successfully reconstructed in vitro,

    12 IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO. 4, OCTOBER-DECEMBER 2008

  • 7/29/2019 Lacroix v 2008

    13/24

    which in itself represents an excellent validating criterionfor modularity. From a different perspective, the very factthat phenotypic evolution can be studied on a character bycharacter basis lead Wagner to plead in favor of theexistence of genetically independent modules [190], forwhich the term evolutionary module was later coined. Howmodules originated however, for instance did they arisethrough the action of natural selection or because of biased

    mutational mechanisms, remain largely open questions.In the context of metabolism, several methods have been

    applied to identify modules, using different operationaldefinitions.

    3.2.1 Top-Down Identification

    Pathways. A natural way to define modules in the contextof metabolic networks is to build on preexisting conceptslike the one of metabolic pathways. Indeed, glycolysis hasbeen shown to be reconstituted in vitro, thereby outliningits independence with respect to the rest of metabolism.One limitation of this approach is that the concept ofmetabolic pathway has itself never been defined formally.

    Instead, the partition of a network into pathways is partlydue to historical reasons, related to the way metabolism wasdiscovered and therefore reflects one subjective view ofmetabolism.

    Attempts at formalizing the notion of a metabolicpathway have however been made. Yamada et al. [200],for instance, define a pathway module to be comprised ofenzymes that have the same phylogenetic profiles and arealso close to each other in the metabolic network. Thephylogenetic profile of an enzymatic gene is represented bya string that encodes its presence or absence in everyavailable fully sequenced genome.

    The concept of elementary mode has also been proposedas a formalized definition of a metabolic pathway [156] andis therefore another good candidate for defining the conceptof module. Indeed, an elementary mode can be interpretedas a minimal set of enzymes that operate together at steadystate. Unfortunately, the exploding number of elementarymodes (more than half a million for a network of around100 reactions [96]) strongly limits their usage as anoperational definition. This framework may still be usefulif employed, for instance, together with the notion ofcoupled reactions [18], also named cosets [128], i.e., reactionsthat participate in the same elementary modes. Moreformally, coupled reactions are defined as reaction pairs

    such that the fluxes passing through them, respectively v1and v2, share any of the following types of coupling [18]:

    1. Directional coupling v1 ! v2, if a nonzero flux forv1 implies a nonzero flux for v2 but not necessarilythe reverse.

    2. Partial coupling v1 $ v2, if a nonzero flux for v1implies a nonzero, though variable, flux for v2 andvice versa.

    3. Full coupling v1 , v2, if a nonzero flux for v1implies not only a nonzero but also a fixed flux for v2and vice versa.

    Connectivity-based definitions. Besides the notion of a

    pathway, various other methods were developed to identifystructures in networks that would be densely connectedwithin the module and loosely connected to the rest of thenetwork. Many are inspired by longstanding mathematical

    work on modular graphs and graph decomposition. Greatcare must be taken however as the terms module anddecomposition in particular are often employed in adifferent and sometimes looser sense in the bioinformaticscommunity. For ease of reference, we chose to follow theterms as used in the bioinformatics community but do callattention to the potential confusion this may create. It is aninteresting issue whether more attention should not be paid

    to even seemingly unrelated mathematical theory onmodularity and decomposition when trying to define andidentify modules in biological networks.

    In the bioinformatics effort to formalize the notion ofmodules in graphs, one may distinguish between two mainsituations: all nodes are classified in at least one module;some nodes may remain unclassified. The first case leads tothe so-called network-decomposition approaches, the sec-ond to methods for module detection.

    Concerning module detection, Spirin and Mirny [166]proposed a method to detect modules in protein interac-tion networks (nodes represent proteins and edgesrepresent interactions between proteins) that could be

    applied also to a metabolic reaction or compound graph.A module is defined by the authors as a dense subgraph.The density of a subgraph is given by the functionQm; n 2m=nn 1, where m is the number ofinteractions between the n nodes of the subgraph. Astatistical criterion then enables to decide if the valuetaken by Q is exceptional. The null model considered is arandom graph model where the degree sequence is thesame as in the studied graph. One may observe that inorder to be able to deal with large subgraphs, the authorsuse heuristics (local search, simulated annealing) for thecounting as well as approximations and simulations for

    the statistical part. Indeed, the problem of searchingfor the heaviest induced subgraph has been shown to beNP-hard [145], [99]. A problem is said to be NP (fornondeterministic polynomial time) if given a solution, thiscan be checked for correctedness in polynomial time.Informally speaking, a problem is said to be NP-hard ifan algorithm for solving it can be translated (in poly-nomial time) into one for solving any other NP-problem.In other words, a problem is NP-hard if it is as at least ashard as any NP-problem. A problem that is both NP andNP-hard is called an NP-complete problem.

    Most network decomposition methods yield nonoverlap-ping modules, i.e., modules constitute a partition of the

    network. Elementary modes are one notable exception tothis. Graph partitioning is a method widely used, forinstance in parallel computing (for a review, see [53]). In itssimplest formulation, graph partitioning consists in separat-ing the nodes of a graph intop disjoint subsets of similar size,while minimizing the number of edges between subsets.This problem is NP-complete even ifp 2 [60]. A limitationfor the applicability of graph partitioning to other fields isthat the number of modules has to be known in advance,which is not the case when considering metabolism.

    Other methods initially developed for sociology arecommonly used when the number of modules is not known[195]. The problem still consists in separating the nodes of a

    graph into disjoint subsets but the criterion to minimize isnot anymore the number of intergroup edges. Indeed, inthis case, the optimal solution would be a unique modulecontaining the whole network. Instead, the criterion to

    LACROIX ET AL.: AN INTRODUCTION TO METABOLIC NETWORKS AND THEIR STRUCTURAL ANALYSIS 13

  • 7/29/2019 Lacroix v 2008

    14/24

    optimize, termed modularity function, is the sum for allmodules of the difference between the number of intramo-dule edges and the number of intramodule edges expectedunder a null model [120]. The random graph modelgenerally used is again one that preserves the degreesequence of the nodes. In practice, the number of observededges is given by the adjacency matrix of the graph, and thenumber of expected edges is given by

    kikj2m , where ki

    (respectively kj) is the degree of node i (respectively j)and m is the number of edges in the graph. Finding thepartition that optimizes this criterion is a difficult problem.Several heuristics have been proposed. Guimera andNunes Amaral [66] use a simulated annealing approach,which they then apply to the metabolic network ofEscherichia coli. The modules identified (the method finds19) are then compared to the metabolic pathways as definedin KEGG. In some cases, modules can be given a generalfunction. More recently, the same type of approach wasused by Parter et al. [130] to analyze the relationshipbetween modularity in the metabolic network of bacteriaand variability in their environment. On the methodologicalside, Newman [121] introduced a matricial formulation ofthe graph partition problem, which enables to use moreefficient techniques from spectral algorithmics, whileDaudin et al. [35] cluster nodes using a method based onmixture degree distributions to obtain modules differentlydefined from those of Newman and Guimera.

    Two modular decomposition methods based on informa-tion provided by the stoichiometric matrix have been morerecently proposed [136], [205]. In [205], Yoon et al.attempted to identify related modules by applying analgorithm for top-down partitioning of directed graphs withnonuniform edge weights [205]. The weights are deter-mined by the metabolic flux distribution and require toperform FBA on the network. In [136], the idea was toestablish a distance between any two pairs of reactions iand j. The distance chosen is a Pearsons correlationcoefficient between the fluxes carried by i and j for allpossible steady states of the system. Obtaining suchcorrelation coefficients is done by computing the null spaceof the stoichiometric matrix of the metabolic network. Usingsuch distances, a hierarchy is then constructed where themetabolic system is represented by a tree whose root nodecorresponds to the whole system, leaf nodes to thereactions, and intermediate nodes to unique subsystems ofreactions. Cutting the tree at any given level producesmodules of a certain granularity.

    Finally, more ad hoc methods with no clearly specified

    methodology have suggested that metabolic networks areorganized in a hierarchical modular [144] or bow-tie nested[206] way. The perception in this case is that metabolicnetworks present a highly modular core-periphery struc-ture, in which the core modules are tightly linked togetherand perform basic metabolic functions, whereas theperiphery modules interact with only a few other modulesand accomplish relatively independent and specializedfunctions. This reflects the giant strong component ideaearlier described by Ma and Zeng [109], [110].

    It seems clear that further work in the field of moduleidentification should include a more thorough discussion ofthe modularity function and especially of the null model

    considered, suggestions for improving the optimizationprocedure to avoid local optima, and propositions ofalternate ways for module validation. More generally, allmethods presented here have been defined for simple

    graphs, new methods that would apply to bipartite graphsor hypergraphs probably need to be developed.

    Motifs. Close to the concept of module is the one ofmotif. Both are thought to be building blocks of a network.Mostly for computational reasons, only small motifs havebeen studied, but the difference between the concepts ofmodule and motif is not so much a question of size as ofdefinition. A motif is thus generally not required to be

    autonomous but rather repeated (reused).Motifs have first been introduced in the context of generegulatory networks where they have been defined aspatterns of interconnections, more formally isomorphicsubgraphs that appear unexpectedly often in a network[163]. Once ag