Annual Plant Reviews Volume 43 (Biology of Plant Metabolomics) || Data Integration, Metabolic Networks and Systems Biology

c09 BLBK354-Hall January 18, 2011 7:49 Trim: 234mm×156mm Series: APR Char Count=

Annual Plant Reviews (2011) 43, 261–316 http://onlinelibrary.wiley.comdoi: 10.1002/9781444339956.ch9

Chapter 9

DATA INTEGRATION,METABOLIC NETWORKS ANDSYSTEMS BIOLOGYHenning Redestig1, Jedrzej Szymanski2, Masami Y.Hirai1, Joachim Selbig2, Lothar Willmitzer2, ZoranNikoloski2 and Kazuki Saito1

1RIKEN Plant Science Center, Yokohama-shi, 17-2-2 Tsurumi-ku, Suehiro-cho,230-0045, Japan2Max Planck Institute for Molecular Plant Physiology, Am Muhlenberg 1, 14476 Golm,Germany

ABSTRACT As analytical techniques and data pre-processing methods continueto improve, the bottleneck of metabolomics is shifting towards later stages of dataanalysis and biological interpretation. High-coverage metabolomics is only possi-ble when combining data from multiple platforms necessitating efficient methodsfor data integration. Metabolomic data sets with high coverage provide a uniqueopportunity to estimate and study metabolic networks. Once established, thesenetworks can provide a backbone for systems biology approaches where the aimis to construct fundamental models of metabolic regulation. In this chapter, weprovide an overview of status of these topics and describe current methods andtools, their drawbacks and advantages for integrative plant metabolomics.

Keywords: xc-ms; multi-platform metabolomics; combined profiling; networkanalysis; metabolic regulation

9.1 Introduction

One of the characteristic features of metabolomics, even common to all‘omics’ technologies, is that data integration and subsequent network anal-ysis are the major issues of the research.For data integration, there are twolayers in terms of data components: intra-metabolomics integration (how

Annual Plant Reviews Volume 43, Biology of Plant Metabolomics, First Edition. Edited by Robert Hall.C© 2011 Blackwell Publishing Ltd. Published 2011 by Blackwell Publishing Ltd.

261


262 � Biology of Plant Metabolomics

we can integrate metabolome data from different analytical platforms) andintra-omics integration (how we can integrate transcriptome, proteome andmetabolome).These topics will be discussed in Sections 9.2 and 9.3 respec-tively. A major issue of metabolic networks relates to how they can be es-timated from experimental data; this topic will be dealt with in Section 9.4.All these studies aim towards systems biology where the issue is how wecan establish mathematical models based on metabolome data and how touse them to better understand metabolic regulation; this is discussed in Sec-tion 9.5. These studies involve mathematics and bioinformatics as the maintechnology, which may not be familiar to most plant biologists. This chap-ter schematically describes those issues on data integration and subsequentnetwork analysis leading to systems biology.

9.2 Combining multiple metabolomics platforms

The central idea behind omics approaches is that a biological system canonly be understood when it is observed in a holistic manner. Therefore, theambition of genomics, transcriptomics and metabolomics is to profile everyavailable gene, transcript and metabolite in order to build a comprehensivemolecular picture of the studied system. With the advent of digital expressionprofiling (Brenner et al., 2000), transcriptomics now joins genomics in thegroup of technologies that can give near-complete profiles, that is true omicsdata. Metabolomics is unfortunately still far from this stage.

Metabolite profiling depends on the separation of a wide range of chemicalcompounds and is, therefore, technically more challenging than transcrip-tomics and genomics, which only measure a single type of molecule. Thechemical complexity and very wide concentration range of the metabolomemake it impossible for any currently available platform to give unbiased cov-erage (Lenz & Wilson, 2007). However, the toolbox available for chemicalprofiling is very large with each technology having its own advantages anddisadvantages with respect to experiment throughput, sampling and cover-age. Therefore, a recent trend within the field of metabolomics is to combinemultiple platforms in order to obtain more complete data sets.

Parallel profiling not only sets high demands on optimized experimentalprocedures (t’Kindt et al., 2009) but also raises critical questions of how toanalyze the generated data sets. In the following sections, we will look athow data sets from different platforms can be stitched together to allowfor intra-integrative metabolomics. This discussion will then be extendedtowards inter-integrative omics in Section 9.3. We will exclusively deal withdata analysis where pre-processing steps have already been accomplishedand a single-platform approach would have arrived at a finalized data set. Wefocus on applications where one not only attempts to perform classificationand biomarker discovery but also aims at providing a biological interpretationof identified patterns.

Data analysis strategies are often easier to understand when consider-ing an actual application, and therefore, most methods will be discussed by


Data integration, metabolic networks and systems biology � 263

Table 9.1 The main classes of platforms that are currently used for performingmetabolomics

Platform Advantages DisadvantagesClasses ofcompounds

H-NMR Rapid, potentiallynon-destructive

Low sensitivity,convoluted

Unbiased

GC-MS Sensitive, robust Requires derivatizationsteps

Mainly volatile

CE-MS Rapid, high resolution Immature, lowreproducibility, lack ofcomprehensive libraries

Charged

LC-MS Wide coverage Matrix effects, lack ofcomprehensive libraries

Non-volatile

FTICR-MS High resolution Low coverage of smallmolecules, expensive

Largelyunbiased

referring to an example data set. The data we will use come from a studyof different cultivars of tomato, Solanum lycopersicon, which was performedusing a combination of gas chromatography (GC) mass spectroscopy (MS),liquid chromatography (LC) MS and capillary electrophoresis (CE) MS, andtherefore, the discussion will be geared towards these platforms. However,the questions of data integration and interpretation are general, and therefore,we aim to be largely platform independent.

9.2.1 Current applications of multi-platform-basedmetabolomics

There exist a plethora of different analytical techniques that can be used to per-form large-scale measurements of metabolite abundances. The performanceof different platforms may also be optimized towards a particular task bytuning the experimental protocols leading to an impressive array of differenttechniques; see Table 9.1 for an overview of the main technologies. What iscommon to all technologies is that they are unable to give both chemicalunbiasedness and high sensitivity. Colour Plate 9.1 shows a schematic mapof MS-based platforms and their coverage of different types of molecules.

Nuclear magnetic resonance (NMR), and particularly H-NMR, was thefirst technique to be used for large-scale metabolic measurements (Lenz &Wilson, 2007). H-NMR has several advantages such as requiring little or nosample preparation, high reproducibility and unbiasedness towards classesof chemical compounds. However, a crucial drawback of NMR-based systemsis its low sensitivity and difficulty in metabolite identification. On the otherhand, these problems are less pronounced in MS-based systems (Dettmeret al., 2007). The most widely used inlet for MS is GC, which results in sys-tems with high sensitivity and throughput but depends on derivatizationtechniques to measure non-volatile compounds such as sugars, nucleosidesand amino acids. The combined advantages of H-NMR and GC-MS have led



several authors to use these two platforms in parallel and thereby obtain animproved coverage of the metabolome of rat plasma (Williams et al., 2006)zebrafish livers (Ong et al., 2009) and melon (Biais et al., 2009).

LC-MS is rapidly gaining popularity because of its high sensitivity anddetection capabilities and is able to profile many plant secondary metabolitesas well as lipids and phospholipids. The potential and complementarity ofH-NMR and LC-MS were shown by Moco et al. (2008) when profiling tomatoand in Ong et al.’s study of zebrafish livers. However, LC-MS is not chemicallyunbiased and typically cannot resolve important compounds such as sugarphosphates. Therefore, combining LC-MS with GC-MS is a viable option forincreased coverage, as was shown by Ni et al. (2007) and Buscher et al. (2009).

It is important to note that a platform is defined not only by its separationand detection system but also by the experimental protocols used. Choice ofextraction procedures can have a strong impact on final coverage. Notably,t’Kindt et al. (2009) found that the number of detected LC peaks increasedtwofold when using a MeOH-based extraction procedure compared with us-ing chloroform. Optimized usage of GC-MS and LC-MS can indeed give veryimpressive coverage of the metabolome. Van der Werf et al. (2007) collated ametabolite list for Escherichia coli, Saccharomyces cerevisiae and Bacillus subtiliscomprising 905 different compounds. Then, by examining the physiochemicalproperties, they developed a platform comprising a total of six configurationsof LC-MS and GC-MS. The obtained multi-platform could measure 96% ofthe metabolites that could be obtained as commercial standards (399 in total).

CE is a formidable technique for separating charged molecules, and whenused as an MS inlet, it can cover many of the biologically very important ionicmetabolites such as NAD+/NADH. CE-MS has been used together withGC-MS to profile drought stress in Arabidopsis thaliana (Urano et al., 2008),but its full potential in multi-platform metabolomics has yet to be realized.Especially for plants with their rich content of secondary metabolites, a multi-platform approach employing CE-MS will also benefit from using LC-MS andGC-MS.

9.2.2 Our example data set

The data set we use for demonstration comes from a metabolomics study ofa comparison between two miraculin overexpressing (cultivar Moneymaker)lines and six other tomato cultivars in their red ripening stage. All sampleswere measured in six biological replicates. The current data set consists ofmetabolite profiles from GC-MS, CE-MS and LC-MS; see Table 9.2 for anoverview and Figure 9.1 for principal component analysis (PCA) score scatterplots of the individual data sets. The focus in this part of the chapter ison the methodological aspects of the data analysis, and therefore, we willtreat the experiment as a generic comparison, refraining from any biologicaldiscussion.



Table 9.2 The number of features in the example data set

Metabolites Peaks

CE 52 857LC 58 412GC 105 263Combined 169 1478

Note: The number of metabolites refers to the unique number of identifiedmetabolites and peaks refer to all peaks.

9.2.3 Analysis of multi-platform data sets

Our discussion of multi-platform metabolomics data analysis starts afterdata acquisition and initial pre-processing have taken place. Here, we arefaced with a data matrix from each analytical platform with an estimatedabundance level for each metabolite (columns) and samples (rows). The goalof the data analysis obviously depends on the experimental design but ingeneral involves a combination of one or more of the following tasks:

• Identification of metabolite responses to applied treatments.• Classification of unknown samples to different biologically relevant

groups.• Extraction of correlation structures between metabolites and unknown bi-

ological factors in order to learn more about the nature of the biologicalsamples.

• The identification of metabolite–metabolite correlations to extract informa-tion about metabolite regulation.

There exist a plethora of different tools and algorithms for accomplishingthese tasks but very few of them can handle more than one data matrix atthe same time (not counting the response or experiment design matrix). Onecould analyze each data set on its own and then summarize the findings by

−10 0 10 20 30

−20

−10

010

20

CE

PC1

PC

2

MMMM

MM

MM

MM

MM56B56B

MT

56B56B

56B

7C7C

7C7C

7C

7C A1

A1

A1

A1

A1

A1

MT

MT

MT

MT

56B

MT

−10 −5 0 5

−10

05

1015

GC

PC1

PC

2

MMMM

MMMM MM

MM

56B

56B

MT

56B56B56B

7C

7C

7C7C7C7C

A1

A1A1 A1A1A1

MTMTMTMT

56B

MT

−5 0 5 10 15

−5

05

1015

LC

PC1

PC

2

MM

MMMMMM

MMMM56B

56B

MT56B

56B

56B7C

7C7C 7C

7C

7C

A1

A1

A1

A1

A1

A1

MT

MT

MTMT56BMT

Figure 9.1 PCA score plots of the example data sets from GC-MS, CE-MS and LC-MS.The main patterns related to the different cultivars can be seen in all data sets eventhough they profile different metabolites.



comparing the results either manually or by using an appropriate statisticalframework. Applications of pure classification problems where we are onlyinterested in predicting class membership decision techniques, for exampleensemble classifiers or voting schemes, can be used to combine the results ina process called high-level data fusion (Steinmetz et al., 1999; Roussel et al.,2003). However, for applications where the goal is to interpret the metabolitelevels from a biological perspective, there are several shortcomings withanalyzing each data block independently:

(i) Dependency patterns between metabolites measured on different plat-forms cannot be detected unless they are analyzed together.

(ii) Certain metabolites can be measured on multiple platforms and may,therefore, be present in more than one data set. Redundancy will biasthe analysis towards finding changes related to the multiply measuredmetabolites.

(iii) Results from different platforms may be contradictive, making it difficultto draw a consensus conclusion.

These issues make it preferable to integrate, or fuse, the data sets, and thereare two main strategies for how to do this – mid- and low-level data fusion;see Figure 9.2 for an overview. Mid- and low-level data fusion both havebeen used previously for metabolomics studies that also aim to interpret thedata from a biological point of view (Smilde et al., 2005; Ni et al., 2007); thefollowing sections will treat these methods further.

9.2.4 Mid-level data fusion

The main strategy behind mid-level data fusion is to first summarize each dataset independently using either feature selection or dimensionality reduction.The extracted features are then concatenated to form a new data set thatis used to build a top-level model representing all data sets together; seeFigure 9.3.

One can imagine several different implementations of mid-level fusionby using different summarization and top-level modelling techniques, andwhich one to use obviously depends on the question at hand. HierarchicalPCA (HPCA) is a technique that well represents the concept of mid-level datafusion and has been used in several metabolomics data studies (Smilde et al.,2005; Forshed et al., 2007; Biais et al., 2009).

HPCA can be used to provide an unsupervised model of the variancepresent in all data sets and thereby gives an overview of the major trendsand differences between the blocks. The output is the same as for classicalPCA but the score vectors, the matrix, can be seen as the meta-metabolitesthat describe as much as possible variation on all platforms. The model isobtained by calculating a PCA model that uses the same score vectors for all



Figure 9.2 The main flow-scheme of data analysis for multi-platform data analysis.Pre-processing of raw data is performed using platform native algorithms. Thepre-processed data is then subjected to data integration techniques and possibly furtherspecialized analysis.

data sets and is given by:

X1 = T PT1 + E1 (9.1)

X2 = T PT2 + E2. (9.2)

T can be obtained by doing local PCA models for the different blocks and thena top-level model of the obtained scores as indicated in Figure 9.3. Parameterestimation is done very similarly to ordinary PCA. Here, we use the algorithmgiven in the appendix of Westerhuis et al. (1998).

Because each platform is given its own loading (weight) matrix, it is easyto assess how well the model approximates the variation on the individual



Figure 9.3 Mid-level data fusion. Each data block is modelled using a feature selectionor dimensionality reduction technique such as PCA. The extracted features are thenconcatenated to form the input for a top-level model. The top-level model may beinterpreted directly by PCA (forming hierarchical PCA) or a supervised approach, in whichcase we also use a response matrix containing, for example, phenotypical traits.

platforms. Statistics such as the ratio of explained variance, R2, can be usedto assess whether a certain platform deviates strongly from the others orwhether there are components that are only present on a certain platform.

Figure 9.4 shows the first three top-level PCs for our example data set alongwith the classical statistic (ratio explained variance) for the three differentplatforms. The first two PCs clearly capture most of the class-discriminatingvariance and this is more pronounced in the LC data set as indicated by thehigher for PC1. The CE platform has the highest value for PC3, which isunrelated to the experimental design; a tendency that could also be seen inFigure 9.1.

The mid-level fusion approach is very useful for detecting differences be-tween the different platforms. Analytical bias can be expected to be fairlyindependent between platforms, and methods like HPCA may be used todetect such patterns. Components that are unique to a single platform, butunrelated to the studied biological factors, are indications of analytical bias,which may warrant the application of normalization strategies (see Section9.2.5).



R 2

R 2

R 2

Figure 9.4 The first three top-level HPCA components for the example data set. The R2

values indicate how much the top-level component explains within the individual blocks.The first two components contain the class-separating variance. These componentsexplain the LC platform data better than the other platforms indicating that the LC datacapture the biologically interesting variance slightly better.

The variable blocking strategy used in mid-level data fusion is motivatedfrom a technological point of view, grouping metabolites according to whichplatform they were measured on.However, from the biological aspect, thisblocking is not relevant and may even complicate interpretation. The prob-lem with redundancy from multiply represented metabolites as discussed inthe introduction of this section is not solved by mid-level data fusion nor isit easy to backtrack which metabolites are responsible for observed patternsin the top-level model. Therefore, in the next section, we will look at a com-plementary approach, low-level data fusion, that attempts to address theseissues.



(a) (b)

Figure 9.5 Analytical bias blurs the biological information. (a) The main componentsin a pure data set show a clear separation of the two types of biological samples. (b)When analytical bias coming from run order and batch effects, the biological informationis no longer clearly visible. This irrelevant variance must be removed by normalizationbefore the data can be interpreted correctly.

9.2.5 Low-level data fusion

From a biological perspective, a multi-platform metabolomics data set is thesame as a single-platform data set. A multitude of data analysis approacheshave been developed for such data sets, and therefore, the optimal wayto prepare multi-platform data can be argued to be a single matrix withabundance estimates for all measured metabolites. The construction of sucha data matrix is called low-level data fusion (Roussel et al., 2003).

The simplest way to construct a summarized matrix is to just concatenatethe different matrices horizontally. However, this is generally not a goodidea for two main reasons. Firstly, on a single platform, analytical error, orbias, may complicate interpretation by obscuring the biological variance;see Figure 9.5. Multivariate analysis methods such as PCA, partial least-squares regression (PLS) and especially orthogonal signal correction (OSC)based methods (Trygg & Wold, 2002) are often able to deal with such bias bycorrecting the data during model estimation. However, this is not necessarilythe case when the bias is high dimensional, as will be the case when differentplatforms, and biases, are directly combined. Therefore, regression modelsbased on concatenated data sets may have low predictive power, and theassociation with experimental design may seem lower than it should be.

Secondly, as previously mentioned, certain metabolites may be measuredmultiply, both within each platform and across them. Direct matrix con-catenation will inflate this redundancy as all multiply measured metabolitesbecome present more than once in the top-level matrix.

A solution to these two problems may be achieved by applying data nor-malization to each data block prior to merging in order to first suppressany analytical error. When this has been done, identified metabolites may begrouped and summarized to a single representative feature. In the followingsubsections, we will look closer at these two tasks.



9.2.5.1 NormalizationOn chromatography-based metabolomics platforms, it is quite common tohave an analytical bias present in the data coming from variations in separa-tion efficiency, ionization, dilution effects and derivatization (Gullberg et al.,2004; Styczynski et al., 2007).

If the bias is mainly coming from variations in the total chromatogram area,that is a dilution effect, it is often adequate to scale each chromatogram sothat the median equals 1 for all samples. However, in experiments where thetotal analyte concentration may have changed, as may happen during fruitripening (Carrari & Fernie, 2006) and carbohydrate accumulation in the plantcold stress response (Cook et al., 2004), this approach may severely distort thedata (Sysi-Aho et al., 2007).

Instead, it is often preferable to monitor the bias analytically using isotopi-cally labelled internal standards (ISs). This is commonly done by adding oneor several ISs in equal amounts to each sample and then using these as arepresentation of a known quantity. Once the abundance estimates ISs havebeen obtained, the variance they exhibit can be used to correct the remainingdata. A common way to do this correction is to scale each peak area, xAnalyte,by the estimated area of the IS:

xNormalized = xAnalyte

xIS. (9.3)

Alternatively, one may also use regression-based methods to normalize byremoving the variance that can be attributed to a correlation with the ISs.This approach has the strong advantage that it becomes straightforward touse multiple ISs, which may provide a better approximation of the effect ofthe bias on different chemical classes of metabolites. If the analytes and ISsare given as matrices, then such a normalization can be illustrated by:

XAnalytes = f (XIS) + XNormalized (9.4)

where f () may be estimated by, for example, multiple linear regression (MLR)(Sysi-Aho et al., 2007).

The model given in (9.4) assumes that the structured variance of the IS onlycomes from the analytical error; however, this assumption may not alwaysbe true. If the peak of IS and analytes are not perfectly separated, concen-tration changes in one compound may cause variance in another. This isgenerally called matrix effects (Birkemeyer et al., 2005). When analytes af-fect the measurements of ISs, the matrix effect is called cross-contribution(Liu et al., 2002), which is a serious problem for IS-based normalization. Theseverity of the problem is easily seen considering an analyte that is affectedby the experimental design and in turn, via cross-contribution, causes thesame signal to be visible in the IS. When the covariance between the ana-lyte and the IS is removed during normalization, only measurement noisewill remain in the analyte data. A normalization method that can cope withthese problems is the cross-contribution compensating multiple standard



normalization (CCMN) algorithm (Redestig et al., 2009). CCMN adds a cor-rection step to the normalization that removes the covariance between the ISsand the experimental design under the assumption that any such covariancecan be attributed to cross-contribution effects. Denoting the experimentaldesign with G, the correction is done by fitting the model:

XIS = g(G) + E (9.5)

the variance that is used for normalization, TZ, is isolated from E via cross-validated PCA:

E = TZ PT + E ′ (9.6)

and the normalization is done after fitting the function:

XNormalized = XAnalytes − h(TZ). (9.7)

The GC-MS-based data in our example data set were measured using 11different ISs and these can be used for normalization. Figure 9.6 shows theeffect of normalization using either the median scaling, single IS as in (9.4),or CCMN. The plot shows the percentage of the total sum of squares that canbe attributed to the relevant factor, the cultivars, and the order in which thesamples were injected in the GC-MS instrument (the run order). The CCMNtechnique clearly gives an improved reduction of the dependency on the runorder and thereby a higher importance to the cultivar effect.

Using proper normalization, the analytical bias can be minimized, and thisstrongly facilitates data fusion. In comparative studies, multiple IS-based

Run orderCultivar

Figure 9.6 The relative sum of squares (SS) for each peak explains the cultivar andrun-order factors. The CCMN normalized data have a stronger dependence on thecultivar than the raw data and are uncorrelated to the run order. Median and single ISbased normalization suppress the run-order effect but fail to remove it completely.



normalizations have been found to be preferable to single IS or strictly sta-tistical normalization approaches for both LC-MS (Sysi-Aho et al., 2007) andGC-MS (Redestig et al., 2009) data. In some situations, ISs cannot be usedbecause of, for example, increased costs, and in those cases, median scalingor multivariate correction methods such as orthogonal projection to latentstructures (Bylesjo et al., 2007a) may instead be applied.

9.2.5.2 Concatenation and summarisationAfter each platform has been normalized to suppress analytical error, thescales of data blocks need to be adjusted before they can be merged. Themost straightforward way to do this is to scale each variable to remove thedependency between the platform and variance across the different peaks.Several possible scaling techniques are available, each with slightly differentscope, and it is important to be aware of their inherit problems (van denBerg et al., 2006). Unit variance (UV) scaling, that is dividing each peak byits standard deviation, is perhaps the most commonly applied scaling. UVscaling discards the importance of the magnitude of peaks and only looks athow they vary across the data set. A central problem with this technique isthat peaks that are invariant, due to very low or stable abundances, are puton an equal footing with truly changing metabolites. This thereby increasesthe noise, and therefore, it is useful to filter away invariant peaks prior toapplying UV scaling.

The next step after scaling is to reduce the redundancy in the data comingfrom metabolites that are multiply measured both across and within thedifferent platforms. A direct way to do this is to gather all peaks that areannotated to the same metabolite and replace them with a representativefeature. In theory, this may seem a fairly uncomplicated task, but as always,there are practical concerns that have to be addressed. A considerable obstacleis related to how metabolomic data sets are annotated; in particular, howmetabolites are named. Compound naming in chemistry is a complicatedtopic and an impressive number of different naming schemes have beendeveloped to describe different chemical structures. The reason for the widediversity is that the optimal way to name compounds depends on the scope.A biologist may prefer to use references to an online resource such as theKyoto Encyclopedia of Genes and Genomes (KEGG) compounds database1

to keep track of metabolites, but an analytical chemist will also need to payattention to metabolite derivatives and may use exact names such as InChi(international chemical identifier) codes or links to the PubChem database2 .Hence, the same metabolite may be annotated with different identifiers acrossdifferent data sets and these must be consolidated before summarization cantake place.

1 http://www.kegg.jp/kegg/compound.2 http://pubchem.ncbi.nlm.nih/gov.



Before identifier unification

CE LC

GC 0

125

57

0

50

2

3

2

After identifier unification

CE LC

GC 0

66

46

5

15

30

3

4

Figure 9.7 The overlap between the three platforms in the example data set beforeand after identifier consolidation.

Recently, a software solution, MetMask (Redestig et al., 2010), was devel-oped that organizes and can keep track of metabolite identifiers in an auto-mated manner by creating a local database from a diverse set of resources.MetMask considerably facilitates working with multi-platform data sets asconversion from one type of identifiers to another can be done in secondswithout the need to query multiple online resources. Figure 9.7 shows theoverlap between the three platforms in the example data set when calcu-lated using the identifiers originally used on the different platforms and afteridentifier integration using MetMask.

Once metabolite identifiers have been unified, PCA can be used to extracta summary feature that describes as much variance as possible in all fea-tures in a least-squares sense. Colour Plate 9.2 shows the concept of featuresummarization using PCA.

After summarisation, our example data set has 169 identified annotatedmetabolites. The new data set is scaled to unit variance to make metaboliteprofiles comparable across the different platforms. Figure 9.8 shows a scoresand loadings scatter plot for the final data. The cultivars are separated clearlyand in the loadings plot the contribution of the different platforms is shown.

9.2.6 Conclusion

No single analytical platform can provide high resolution and chemically un-biased coverage of the metabolome. A recent trend to deal with this problemis to profile the same samples on multiple platforms and then combine the ob-tained data sets. In this part of the chapter, we looked at different applicationsand strategies for integrative analysis of such data sets.

Two main strategies were discussed: mid-level and low-level data fusion. Inmid-level data fusion, each data set is first summarized into a set of represen-tative metabolite features. These features are then combined and analyzedtogether to find differences and common trends among the platforms. In



−10 −5 0 5 10

−15

−10

−5

05

Scores

26.57% of the variance explainedPC 1

PC

2

MM

MMMM MMMM

MM56B

56B

MT

56B

56B

56B 7C7C7C

7C7C

7C

A1

A1

A1A1

A1

A1

MTMT

MT

MT

56B

MT

−0.10 −0.05 0.00 0.05 0.10

−0.

100.

000.

10

Loadings

PC 1

PC

2 MIX

MIX

MIX

MIX MIXMIX

CEMIX

MIX

MIX

MIXCE

CE

MIX

MIX

MIX

CE

MIX

MIX CEMIX

MIX

CE

MIX

CE

MIX

CE

MIXMIX

CEMIX

CE

MIXMIXMIX

MIX

MIX

CE

MIX

MIX

MIX

MIX

MIX

MIX

MIX

CE

MIX

MIX MIXCE

MIXMIX

CE

CELC

LC

LC

LC

LC

LC

LC

LCLC

MIX

LCLC

LCLC

MIXMIX

LC

LCMIX

LC

LC

MIX

MIX

LC

LC

MIX

MIX

LC

MIX

LC

LC

LC

LCLC

MIXLC

MIXLC

MIX

MIX LC

LC

LC

LCLC

LC

LC

LC

MIX

LCLC

MIX

LC

LC MIX

LCLC

MIXLCLC LC

LCLC

LC

LC

LC

LC

GC

GC

GC

GCGC

GCGC

GC

GC

GC

GC

GC

GC

GC

GC

GC

GC

GC

GCGC

GC

GC

GC

GC

GC

GC

GC

GC

GCGC

GC

GC

GCGC

GC

MIX

GCGCGC

GC

MIX

GC GCGC

GC

GC

GC

GC

GC

GC

GC

GC

GCGCGC

GC

GC

GC

GC

GCGC

GC

GCGC

GC

GC

GC

GC

GC

GC

GC

GC

GC

GCGC

GCGC

GC

GC

GC

GC

GC

GC

GCGCGC

GC

GCGC

GC

GC

GC

GC

GC

Figure 9.8 PCA score and loading plots of the summarized data considering the 169metabolites. Total of NR samples including Moneymaker (MM), Aichi First (A1), 7C(Miraculin overexpressor line 7C-14), 56B (Miraculin overexpressor line 56B-6) and MicroTom (MT). In the loadings plot, the source of the metabolite feature is indicated by CE,LC, GC or MIX for the metabolite features coming from multiple platforms.

low-level data fusion, data sets are first made comparable by normalizationand scaling. Redundancy is then removed and the data sets are finally con-catenated to obtain a summarized large data matrix that is used as input forfurther data analysis.

The use of multiple platforms is a promising development that allowsfor both high-resolution measurements and chemical unbiasedness, therebypotentially enabling truly system-wide metabolomics.

9.3 Integrating transcriptome and metabolome data

9.3.1 Emergence of omics in plant physiology

To understand physiological phenomena of plants such as development, re-sponses to environmental stimuli, metabolism, etc., physiological and/orbiochemical studies using various plant species had long been conducted.Around the end of 1980s, molecular genetics using A. thaliana as a model planthad emerged with the success of the ABC model (Bowman et al., 1991), whichconcisely modelled how flower organs were determined. At the same time,the emergence of molecular biology, in which all physiological phenomenawere understood as actions of genes and proteins, had enabled researchers indifferent study fields to share their knowledge by using DNA base sequencesas a common language. Thus, genome sequencing of Arabidopsis was startedin order to understand a whole plant mechanism as integration of gene func-tions. During the last decade after the completion of Arabidopsis genomesequencing (Arabidopsis Genome Initiative, 2000), functional genomics, that



is functional elucidation of Arabidopsis genes, which had been identified justby DNA sequences, was the major concern of plant science.

Concurrently with the acceleration of genome sequencing projects basedon the improvement of DNA sequencing, the technologies for DNA microar-ray and soft ionization of biological macromolecules had been developed toenable transcriptome, proteome and metabolome analyses. The first paperdescribing the result of microarray analysis in Arabidopsis appeared in 2000(Wang et al., 2000). Also, in the same year, the first metabolite profiling ofArabidopsis using GC-MS was reported (Fiehn et al., 2000). In that study,Fiehn et al. analyzed metabolite profiles of mutant plants and their parentalecotypes (accessions) and showed the usefulness of metabolite profiling as atool for functional genomics.

Before the emergence of molecular genetics and molecular biology, plantphysiologists have been trying to understand physiological phenomena asinteraction of several factors involved. Now that it is possible to obtain vo-luminous amounts of information simultaneously on tens of thousands ofbiomolecules by means of omics, a novel strategy handling more interactionsthan ever before has become required and expected from systems biology.

9.3.2 Integration of omics for systematic understanding of awhole plant

Transcripts, which are handled in transcriptomics, are directly related to thegenome, as they are products of gene transcription. Proteins, the targets ofproteomics, are also directly related to the genome, because amino acid se-quences of proteins are encoded by the genome. On the other hand, metabo-lites do not have a direct relationship to the genome, as they are produced asa consequence of sequential chemical reactions catalyzed by enzymes. Thisis one reason why the integration of metabolomics with the other omics ap-proaches is important in terms of functional genomics. Besides, in terms ofsystems biology, comprehensive data on the accumulation patterns of tran-scripts, proteins and metabolites, obtained by multi-omics, enable us to havea bird’s-eye view on physiological phenomena of the plants. Thus, integrationof omics paves the way for understanding of the plant as a complex system.

In the following sections, some examples of the studies based on integrationof metabolomics with genomics and transcriptomics are introduced.

9.3.3 Integration of transcriptome and metabolome data into asingle matrix

In the narrow sense, integration of transcriptome and metabolome datameans that both data sets are integrated into a single matrix. When tran-scriptome data (originally obtained, for example, as signal intensity of fluo-rescence emitted from fluorescent-labelled targets hybridized to microarray)



and metabolome data (originally obtained, for example, as peak area or peakheight of ion intensity by GC/LC/CE-MS) are appropriately normalized,both data sets can be integrated into a single matrix.Thus, the integrated dataset can be subjected to multivariate analysis for a global understanding ofmetabolic network, which varies depending on genetic background and/orenvironment.

Urbanczyk-Wochniak et al. (2003) conducted analyses of transcript andmetabolic profiles of transgenic potato tubers using microarray and GC-MS,respectively. In that study, co-occurrence of transcripts and metabolites wereevaluated by calculating Spearman’s rank-correlation coefficient between ev-ery pair of transcript/metabolite accumulation levels. Of the 26,616 possi-ble pairs, 571 showed significant correlation, most of which was novel andincluded several strong correlations to nutritionally important metabolites. Itwas also shown that metabolic profiling has a higher resolution than expres-sion profiling in terms of the discriminatory power to distinguish betweendifferent potato tuber systems.

Time-series transcriptome and metabolome (obtained by Fourier-transform ion cyclotron resonance mass spectrometry; FT-ICR-MS) dataof sulphur-starved Arabidopsis were integrated into a single matrix, andapplied to batch-learning self-organizing mapping (BL-SOM) (Abe et al.,2003; Kanaya et al., 2001) to classify the sulphur-deficiency-responsivegenes/metabolites according to their accumulation patterns after the shiftfrom sulphur-sufficient to sulphur-starved condition (Hirai et al., 2005). Sim-ilarity of accumulation patterns was calculated as Euclidean distance in themulti-dimensional space in BL-SOM. Genes/metabolites showing similar ac-cumulation patterns were classified into a cluster on the resulting featuremap. In this analysis, glucosinolates (GSLs) with different side chains wereclustered, suggesting that GSL metabolism is coordinately regulated. Thisidea was supported by the fact that the genes encoding known GSL biosyn-thetic enzymes were also clustered. This suggested that unknown genes thatclustered along with the known GSL biosynthesis genes might also be in-volved in GSL biosynthesis. On the basis of this assumption, novel genesinvolved in GSL biosynthesis, such as those encoding transcription factors(Hirai et al., 2007) and enzymes (Hirai et al., 2005; Sawada et al., 2009a, 2009b),were identified.

Recently, Mounet et al. (2009) analyzed the transcriptional and metabolicchanges in expanding tomato fruit tissues with tomato microarrays and an-alytical methods including proton NMR and LC-MS, respectively. Pairwisecomparisons of metabolite contents and gene expression profiles detected upto 37 direct gene–metabolite correlations involving regulatory genes. Corre-lation network analyses revealed the existence of major hub genes correlatedwith ten or more regulatory transcripts and embedded in a large regulatorynetwork, suggesting that a strategy based on the combined analysis of differ-ent developing fruit tissues can be very helpful to pinpoint candidate regula-tory genes linked to compositional changes and fruit development in tomato.



9.3.4 Global understanding of physiological phenomena andgene functional identification by relating metabolome totranscriptome

Even if not being integrated into a single matrix, parallel analyses of transcrip-tome and metabolome data can lead to a global understanding of physiolog-ical phenomena. Besides, relating metabolome to transcriptome in a certaingenetic background or under certain environmental conditions is a powerfulway to identify gene functions. Two earlier studies using Arabidopsis (Hiraiet al., 2004; Tohge et al., 2005) had an impact on the field of plant biotechnology(Lawrence, 2006; Taroncher-Oldenburg & Marshall, 2007).

Hirai et al. (2004) analyzed the transcriptome and metabolome (obtained byFT-ICR-MS) in leaves and roots of Arabidopsis under nutritional stresses. Re-spective data sets were subjected to PCA to show the effects of treatments ontranscriptome or metabolome. The results revealed: (1) long-term sulphur de-ficiency, nitrogen deficiency, and sulphur and nitrogen deficiency had similareffects on the metabolome and transcriptome, (2) the metabolite and tran-script profiles differed considerably between long- and short-term sulphurdeficiency and (3) between in leaves and in roots, and (4) the effects of O-acetylserine treatment were similar to those of short-term sulphur deficiency,suggesting that O-acetylserine regulates the global metabolite and transcriptprofiles in short-term sulphur deficiency. The fact that similar clustering pat-terns in PCA were obtained by using the transcriptome and metabolome dataindicated that the global transcript and metabolite profiles were strongly re-lated to each other (Hirai et al., 2004).

Tohge et al. (2005) related metabolome data to transcriptome data of aT-DNA activation-tagged line pap1-D (Borevitz et al., 2000), in which theexpression of the gene coding for a MYB transcription factor PAP1 was en-hanced. In this line, anthocyanins (cyanidin glycosides) and some flavonoids(quercetin glycosides) were specifically accumulated, concomitantly with theinduction of the expression of a limited number of genes including those in-volved in the core structure formation of anthocyanins. These results revealedthat the PAP1 specifically induces the expression of the genes involved in an-thocyanin production or accumulation, leading to an increase in anthocyaninlevels. In addition, intensive analysis using LC-MS/MS of glycosylation pat-terns of anthocyanins detected in pap1-D led to the identification novel gly-cosyl transferase genes among the genes induced in the pap1-D (Tohge et al.,2005).

Thus, metabolic profiling corresponding to transcript profiling under spe-cific conditions or in a specific genotype is a powerful way to discovernovel genes and to reveal metabolic pathways, especially for secondarymetabolism. This is the case with plant species other than Arabidopsis. Tran-scriptome of jasmonate-elicited tobacco bright yellow 2 cell cultures was ana-lyzed by means of cDNA-amplified fragment length polymorphism (cDNA-AFLP) (Goossens et al., 2003). The changes in the transcriptome were well



correlated with the observed shifts in the biosynthesis of the metabolites,i.e. accumulation of nicotine and various nicotinic acid-derived alkaloids, in-vestigated by targeted metabolite analysis. This result led to the creation ofnovel tools for metabolic engineering of medicinal plant systems in general(Goossens et al., 2003).

9.3.5 Application of public transcriptome data sets

In the case of Arabidopsis, transcriptome data were systematically obtainedand released to the public by the efforts of international consortia, AtGen-Express and NASCArrays (Kilian et al., 2007; Goda et al., 2008; Schmid et al.,2005; Craigon et al., 2004). These data sets of global gene expression profileshave promoted the development of web-based tools for in silico gene expres-sion analyses and accelerated functional elucidation of Arabidopsis genes.Co-expression analysis has become a well-recognized strategy for identifica-tion of candidates responsible for the gene function of interest (Saito et al.,2008).

Global gene expression during development of Arabidopsis, in samplescovering many stages from embryogenesis to senescence and diverse organs,were analyzed (Schmid et al., 2005) as a part of the AtGenExpress expressionatlas. This data set, together with the other data sets of AtGenExpress (Kilianet al., 2007; Goda et al., 2008), are used for various web applications such asASIDB4 (Rawat et al., 2008), ATTED-II5 (Obayashi et al., 2009), BAR6 (Toufighiet al., 2005), CoexProcess7, GeneCAT8, PED9 (Horan et al., 2008) and PRIMe10(Akiyama et al., 2008). Recently, Matsuda et al. (2009) obtained and released aLC-MS/MS-based metabolome data set (AtMetExpress development), whichis compatible to the transcriptome data set reported by Schmid et al. (2005).This data set has paved the way to in silico integrated analyses of devel-opmental transcriptome and metabolome data for gene function elucidation(Matsuda et al., 2009).

9.3.6 Visualisation of transcriptome and metabolome data onmetabolic map

To understand the characteristics of transcript and metabolite profiles ob-tained as a huge volume of numerical data, visualisation by projection oftranscriptome and metabolome data on metabolic map is a useful approach.Several web-based tools have been developed to assist data visualization.KaPPA-View11 has been developed for displaying quantitative data for indi-vidual transcripts and metabolites on the same set of plant metabolic path-way maps (Tokimatsu et al., 2005). By default, about 150 maps covering about1400 compounds and about 1400 reactions have been prepared correspond-ing to Arabidopsis, rice, Lotus japonicus and tomato (as of January 2010).MapMan12 is a user-driven tool that displays large data sets onto diagramsof metabolic pathways or other processes (Thimm et al., 2004; Usadel et al.,



2005). The Pathway Tools Omics Viewer13 paints data values from the user’shigh throughput and other experiments onto the cellular overview diagramfor an organism (Paley & Karp, 2006). MetGenMap also provides the platformof systems biology analysis (Joung et al., 2009).

9.3.7 Multivariate analysis and classification of genes andmetabolites

To catch the global characteristics of transcriptome and metabolome visually,PCA is often conducted. In many cases, transcriptome and metabolome dataare obtained as time-series data. Transcripts and metabolites can be classifiedon the basis of their time-dependent accumulation patterns by hierarchicalcluster analysis (HCA), SOM, k-means algorithm, etc. A multivariate regres-sion method called O2PLS has also been applied for the integration of mul-tiple data sets from, for example, transcriptome, proteome and metabolome(Bylesjo et al., 2007b; Suzuki et al., 2010; Bylesjo et al., 2009). Gene ontology isoften used to classify transcripts on the basis of their function.

9.3.8 A wide range of applications of integrated transcript andmetabolite profiling

In the last couple of years, in Arabidopsis as well as other species, parallelanalyses of transcript and metabolite profiles have been brought into actionfor gene identification, the global understanding of physiological phenomenaand metabolic engineering. In these studies, transcript profiles have beenanalyzed by means of commercial or custom-made microarrays, cDNA-AFLPand quantitative real-time polymerase chain reaction (RT-PCR), dependingon the plant species of interest. Metabolic profiles are often analyzed by GC-MS to reveal metabolic changes in primary metabolism. NMR and CE-MS canalso be used for this purpose. On the other hand, LC-MS is utilized for theprofiling of secondary metabolites. Targeted analyses by ultra performanceliquid chromatography etc. can also give good information.

9.3.8.1 Gene identificationIn the study reported by Andersson-Gunneras et al. (2006), transcriptome(microarray) and metabolome (GC-MS) data were obtained during tensionwood (TW) formation in Populus in response to a gravitational stimulus. TWis characterized by the formation of fibres with a thick inner gelatinous cellwall layer mainly composed of crystalline cellulose. They identified key stepsfor the divergence of the carbon flow from lignin and hemicellulose to cellu-lose biosynthesis, and the genes encoding components of hormone signallingpathways and transcription factors differentially expressed between TW andnormal wood.

To clarify the flavonoid biosynthetic pathway in Arabidopsis, Yonekura-Sakakibara et al. (2008) conducted LC-MS analysis for identification of the



flavonoids produced by wild-type and flavonoid biosynthetic mutant lines.The structures of newly identified and known flavonols were deduced by LC-MS profiling of these mutants. Transcriptome co-expression analysis basedon public microarray data in ATTED-II (Obayashi et al., 2009) led to identifi-cation of the candidate genes encoding glycosyltransferases and UDP-sugarsynthase presumed to be involved in flavonoid biosynthesis. Reversed genet-ical studies confirmed their predicted functions (Yonekura-Sakakibara et al.,2007).

Luo et al. (2009a) chemically identified the polyamine conjugates, whichaccumulate in Arabidopsis seeds, as disinapoyl spermidine derivatives. Toidentify enzymes responsible for the formation of the disinapoyl spermidinederivatives in Arabidopsis seed, they searched the publicly available microar-ray data in Genevestigator (Zimmermann et al., 2004) for the genes encodingthe BAHD family of acyl transferases, which are expressed strongly in seeds.Metabolic profiling of the knockout line of the candidate gene and a biochem-ical assay revealed that the candidate gene actually catalyzed the formationof disinapoyl spermidine derivatives (Luo et al., 2009a).

9.3.8.2 Elucidation of biological roles of genesTo reveal the role of nodule-enhanced sucrose synthase (MtSucS1) duringnodulation of Medicago truncatula, the expression profiles of M. truncatulaand Sinorhizobium meliloti genes, coding for proteins associated with nodulemetabolism and maintenance of symbiotic N2 fixation, were examined inroot nodules of MtSucS1-reduced transgenic plants (quantitative RT-PCR).Metabolic alterations, as well as phenotypic changes, were measured by GC-MS. The results supported the model that MtSucS1 was required for theestablishment and maintenance of an efficient N-fixing symbiosis (Baier et al.,2007).

AtMyb41, which is transcriptionally regulated in response to salinity, des-iccation, cold and abscisic acid (ABA), is a transcription factor suggested tocontrol stress responses linked to cell wall modifications.To further character-ize AtMyb41, the transcriptome and metabolome of AtMyb41-overexpressinglines were analyzed. The data indicated that AtMyb41 is involved in distinctcellular processes, including control of primary metabolism and negativeregulation of short-term transcriptional responses to osmotic stress (Lippoldet al., 2009).

9.3.8.3 Elucidation of stress responsesPlant responses to biotic and abiotic stresses are elucidated by integratedanalyses of transcript and metabolite profiles. Transcript and metabolic pro-files of the resistant and susceptible cultivars of Vitis vinifera to fungi wereanalyzed by cDNA microarray, quantitative real-time PCR and NMR, respec-tively. The integration of data sets revealed differences in transcripts andmetabolites between both cultivars, which are probably associated with the



innate resistance of the resistant cultivar to the mildews (Figueiredo et al.,2008).

Urano et al. (2008) integrated transcriptome and metabolome (GC-MS andCE-MS) analysis of Arabidopsis under dehydration stress to show transcrip-tional regulation of the biosynthesis of the branched-chain amino acids, sac-charopine, proline and polyamine. Analysis of a 9-cis-epoxycarotenoid diox-igenase (NCED) 3-gene knockout mutant revealed that this transcriptionalregulation was ABA dependent.

A bioinformatics approach was taken to reveal the temporal characteristicsof the response to nutritional stress. In the study by Morioka et al. (2007)the time-series transcriptome and metabolome data sets of sulphur-starvedArabidopsis (Hirai et al., 2005) were applied to establish a novel algorithmto predict the transition time point of the transcriptome and metabolomeduring the process of adaptation. The transition time point, at which thetranscriptome and/or the metabolome change drastically in response to sul-phur starvation, was determined by using the novel method based on a lineardynamical system model. The results revealed that both the metabolome andtranscriptome transitioned between 12 and 24 hours after the plants weretransferred to sulphur starvation.

9.3.8.4 Quantitative trait locus (QTL) analysesTo describe the genetic regulation of variation in the Arabidopsismetabolome, metabolome analysis by GC-MS was conducted on 210 recom-binant inbred lines (Bayreuth-0 x Shahdara) of Arabidopsis, which was pre-viously used for targeted metabolite QTL and global expression QTL (eQTL)analysis (Wentzell et al., 2007). Metabolic traits were less heritable than theaverage transcript trait, suggesting that there are differences in the powerto detect QTLs between transcript and metabolite traits. A large number ofmetabolite QTLs with moderate phenotypic effects were identified. Frequentepistatic interactions controlling a majority of the variation were also found(Rowe et al., 2008).

9.3.8.5 Global understanding of physiological phenomenaIn the paper reported by Kolbe et al. (2006), changes in transcriptome,metabolome (GC-MS) and metabolic fluxes 14C-Glc labelling) were analyzedin Arabidopsis leaves in response to manipulation of the thiol-disulfide statusby feeding dithiothreitol. The results provided a global picture of the effectof redox and revealed the utility of transcript and metabolite profiling assystemic strategies to uncover the occurrence of redox modulation in vivo.

To understand the environmental and hormonal regulation of theactivity–dormancy cycle in perennial plants, transcript and metabolite pro-filing (GC-MS) of isolated cambial cells of aspen were analyzed (Druart et al.,2007). The dynamics of transcriptional and metabolic networks was revealedand potential targets of environmental and hormonal signals in the regulationof the activity–dormancy cycle in the cambial meristem were identified.



Kusnierczyk et al. (2008) analyzed the early defence response against aphidattack in Arabidopsis by transcriptome and targeted metabolite (GSL andcamalexin) analyses. A model of plant–aphid interactions at the early phaseof infestation was proposed.

Comparative transcriptome and metabolome (LC-MS and GC-MS) analy-ses were carried out on peel and flesh tissues during tomato fruit developmentto broaden knowledge related to the fruit surface (Mintz-Oron et al., 2008).

Brautigam et al. (2009) observed dynamic changes in both transcriptome(nuclear genes encoding chloroplast proteins) and metabolome (GC-MS)during photosynthetic acclimation in response to light quality-induced re-dox signalling in Arabidopsis.

To analyze the effects of reduced carbon flow into starch on carbon–nitrogenmetabolism and related pathways, Weigelt et al. (2009) conducted transcrip-tome (microarray) and metabolome (HPLC and GC-MS) analyses of ADP-glucose pyrophosphorylase-deficient pea embryos.

Howell et al. (2009) analyzed transcriptome and metabolite profiling of riceembryo tissue during germination. This revealed that during rice germinationan immediate change in some metabolite levels was followed by a two-step,large-scale rearrangement of the transcriptome that was mediated by RNAsynthesis and degradation and was accompanied by later changes in metabo-lite levels. A variety of common sequence motifs, potential binding sites fortranscription factors were identified by in silico analysis using three main cis-element databases, namely the Rice Cis-Element Search database (Doi et al.,2008), the MEME Web server (Bailey et al., 2006) and the Regulatory SequenceAlignment Tool (Thomas-Chollier et al., 2008), of the 1-kb upstream regions oftranscripts displaying similar changes in abundance identified. Additionally,newly synthesized transcripts peaking at 3 hours after imbibition displayed asignificant enrichment of sequence elements in the 3′ untranslated region thathad been previously associated with RNA instability (Howell et al., 2009).

In the study by Wang et al. (2009), comparative transcriptome (microarray)and targeted metabolome analysis (GC-MS) uncovered important features ofthe molecular events underlying pollination-induced (in wild-type tomato)and pollination-independent (in IAA9 down-regulated tomato) fruit set.

Hernandez et al. (2009) analyzed global gene expression (macroarray) andmetabolome (GC-MS) to investigate the responses of nodules from commonbean (Phaseolus vulgaris L.) plants inoculated with Rhizobium tropici grownunder P-deficient and P-sufficient conditions.

9.3.8.6 Towards metabolic engineeringTransgenic Arabidopsis plants expressing the entire biosynthetic pathwayfor the tyrosine-derived cyanogenic glucoside dhurrin as accomplished byinsertion of CYP79A1, CYP71E1 and UGT85B1 from Sorghum bicolor wereshown to accumulate 4% dry-weight dhurrin with marginal inadvertent ef-fects on plant morphology, free amino acid pools, transcriptome and targetedmetabolome (LC-MS) (Kristensen et al., 2005). Interestingly, when incomplete



pathways (CYP79A1 and CYP71E1) were inserted, metabolic crosstalk ordetoxification reactions were found to induce significant changes in plantmorphology, the transcriptome and the metabolome.

In the experiment presented by Dauwe et al. (2007), lignin biosynthe-sis genes (Cinnamoyl CoA reductase and/or cinnamyl alcohol dehydroge-nase) were down-regulated in transgenic tobacco. cDNA-AFLP-based tran-script profiling, combined with HPLC- and GC-MS-based metabolite pro-filing, revealed differential transcripts and metabolites within monolignolbiosynthesis, as well as a substantial network of interactions between mono-lignol and other metabolic pathways.

9.3.9 Future perspective

In the last couple of years, functional elucidation of genes in Arabidopsis hasbeen a major concern in plant physiology. Accumulation of knowledge ona single plant species has led to a better understanding of the interactionsamong different biological events, such as development, stress response andmetabolism, which could not have been elucidated by independent studiesusing different plant species. Now that rapid advances have been made inomics technologies such as DNA sequencing, microarrays and MS for bio-logical macromolecules, a huge volume of information can be accumulatedon each plant species of interest, presumably enabling the understanding ofeach plant species as a system that is composed of a number of interactions.

When comparing different plant species as different plant systems in termsof genes, transcripts and proteins, we depend on the concept of ‘homologue’.For example, we treat a novel gene in rice as a homologue of an Arabidopsisgene. In this meaning, we cannot directly compare different plant systems.On the other hand, as metabolites are not direct products of gene expression,there is no ‘homologous metabolite’ in metabolomics. For example, glucose inArabidopsis is just glucose in rice, but not a glucose homologue. This meansthat metabolomics technology is easier to apply to all plant species than theother omics technologies. In this chapter, the authors describe usefulness ofintegration analyses of metabolomics and other omics. On the other hand,the authors believe that metabolomics can also provide a novel philosophyto understand metabolic systems by direct comparison of different plantsystems, but not by reducing this to gene function.

9.4 Network inference in metabolomics

With recent developments in analytical technology, metabolomics offers themeans to address one of the main challenges in systems biology: reverse en-gineering of metabolic networks. Here, we concentrate on the constraint-freeapproaches relying on quantitative metabolomic data for reverse engineer-ing of the metabolic system. Although a number of such approaches exist,



none of them could be treated as an optimal and unique method to fullyaddress the task, giving rather specific hints about structure of the underly-ing metabolic system. However, these hints were shown to be sufficient forbiologically relevant and verifiable hypotheses creation. In this section, wereview the commonly used network inference approaches. In Section 9.4.3,we describe the concept of relevance networks and explain the relationshipbetween stoichiometry of metabolic pathways and coordinated fluctuationsof its constituents. A range of relevance network refining approaches areoutlined, improving accuracy of the network reconstruction. Finally, in Sec-tion 9.4.5 we present the concept of Bayesian networks and its use inmetabolomics in the context of structure learning.

9.4.1 Coverage of metabolic pathways by MS data

High-throughput metabolic profiling is capable of covering only a small partof the total plant metabolome. Even using integrative analysis of data com-ing from different platforms allows us to cover just a small percentage ofthe 200,000 compounds estimated to be present in plant kingdom (Weckw-erth, 2003). Consequently, metabolomics is far behind transcriptomics, withits near-complete coverage provided by microarrays and also some recentproteomic approaches (de Godoy et al., 2006).

The low coverage of the metabolome by metabolomics is a result of notonly the limited capability to measure low abundant or unstable intermedi-ates but also a high number of peaks without determined chemical structure.For instance, in a recent study by Muller-Linow et al. (2007), GC-MS analysiscould only map 39 of the measured metabolites on the metabolic pathways.Another important issue concerning the coverage of metabolic pathways isthe potential bias introduced by the extraction procedure and the equipmentset-up. Analysis of the polar phase of plant extract using GC provides re-sults biased towards central metabolism including mainly the metabolitesfrom glycolysis, tricarboxylic acid (TCA) cycle and amino acids synthesis.Analogously, the non-polar phase of the extract usually contains lipids andsemi-polar compounds of the secondary metabolism. This fact must be takeninto account if the measurements are related to any topological parametersof the metabolic networks (Szymanski et al., 2009).

On the other hand, metabolomic approaches have a big advantage overother ‘omic’ techniques, which is the extensive knowledge about metabolicpathways. Whereas transcriptomic analysis deals with only few regulatorypathways resolved in details (Davuluri et al., 2003), databases such as KEGGand AraCyc (Kanehisa & Goto, 2000; Karp et al., 2005) currently cover 1438metabolites of A. thaliana, connected by 1320 reactions (de Oliveira Dal’Molinet al., 2009). The availability of such information significantly limits the num-ber of possible solutions explaining metabolic phenotypes (Edwards & Pals-son, 2002), allows prediction of regulatory events (Stelling et al., 2002) and



accelerates discovery of new metabolic pathways (May et al., 2008; Gavaiet al., 2009).

Despite apparent problems, many approaches have been developed for un-ravelling the molecular relationships on the basis of the analysis of metabolicprofile data. These might be classified as follows: (a) knowledge based,where the starting point for data interpretation is knowledge about themetabolites’ chemistry and the structure of the metabolic pathways; (b)pathway independent, where relationships between metabolites are inves-tigated based just on the similarity of their profiles, and then compared usingknowledge-based networks. In the following sections, we discuss this secondclass of approaches.

9.4.2 Goals of de novo metabolic network reconstruction

The main rationale for de novo metabolic network reconstruction is inferenceof the underlying interactions between constituents of the cellular machin-ery. In the case of metabolic systems, there are many possible interactiontypes that might be reflected in the data, including biochemical reactions,enzymatic activity regulation, gene regulatory circuits, and also cell compart-mentalization or proteins interactions such as organization of enzymes intometabolons. The main issues in network inference are: experiment design,computational approach and the interpretation of results. While choosing aparticular network inference approach, one has to take into account require-ments of the method according to the data quality and experimental designand its eligibility concerning the size of the investigated metabolic system.Unfortunately, there is no optimal approach, and the methods proposed inthe literature differ significantly with respect to all of these points.

Common network inference algorithms are based on measuring and scor-ing the similarity of analytes profiles. This similarity might be computedusing various statistical methods and generally gives information about howmuch changes of two variables are dependent on each other in the analyzeddata set.If computed for all the pairs of system entities, the output is a squarematrix. This is subsequently discretized by a thresholding procedure to ob-tain an adjacency matrix, which is transformed into an undirected graph – anetwork abstraction of the initial similarity matrix. All networks created inthis way – commonly called relevance networks – share the feature of beingfully based on the experimental data. As such they are a product of the com-plete metabolic system dynamics, including stoichiometry, regulation and allother levels of system complexity. Therefore, although not easy to interpret,relevance networks are regarded as an alternative for gleaning the structureof the system using classical biochemical methods.

A direct advantage of relevance networks reconstruction is the creationof an intuitively interpretable abstraction of the data structure. More impor-tantly, it allows applying a broad range of network analysis tools, whichintroduce a system-wide context of reasoning, such as scoring centrality of



the network elements and detection of the community structure. In relevancenetworks, each measured compound is characterized by a set of its networkproperties, such as degree, betweenness or closeness. These properties reflectmuch more than belonging to a particular data cluster. Being dependent onthe surrounding network structure, they allow the scoring of the ‘importance’of a particular metabolite for the integrity of the graph. The concept assumesthat the ‘key players’ in metabolic pathways, such as pleiotropic substrates,flux switches or intermediates connecting biochemically distant pathways,possess certain unique properties in relevance networks. This translates di-rectly into the possibility of discriminating ‘candidate metabolites’ – com-pounds having high rank in the context of their local network parameters –and of speculating about possible roles for such metabolites in the systemreorganization under the conditions analyzed. Another benefit is also theidentification of the community structure, giving information about tightlyconnected regions of the network and giving a hint about the real number ofclusters in the data (Klie et al., 2010) Consequently, relevance networks areregarded as a valuable hypothesis creation tool in metabolomics (Morgenthalet al., 2006; Kusano et al., 2007; Szymanski et al., 2009). In the following sec-tions, we present the most popular approaches used in relevance networksreconstruction, compare inference quality and discuss their main advantagesand drawbacks.

9.4.3 Relevance networks

9.4.3.1 Pearson correlationOne of the simplest methods for network inference is correlation analysis,highlighting linear interdependencies between system variables. The mostcommonly used correlation measure is the Pearson correlation (PC) coeffi-cient:

rxy = cov(xy)√

var(x)var(y)(9.8)

In order to construct a correlation network, the symmetric matrix contain-ing correlation coefficients for each pair of variables in the system is trans-formed into a graph. In this graph, nodes represent metabolites, whereasstatistically significant correlations constitute edges between them. Despiteits simplicity, the method has proven to be very useful in plant metabolomicstudies, for example, for phenotyping of metabolic states of different plantorgans and genotypes (Morgenthal et al., 2006). It has been shown that aremarkable number of metabolite pairs are robustly correlated, also acrossdifferent genotypes of Solanum tuberosum (Roessner et al., 2001), and in dif-ferent organisms (Martins et al., 2004; Fiehn, 2003; Broeckling et al., 2005).This phenomenon was also proven to be independent on the quantificationtechnique (Weckwerth et al., 2004). Taking into account variable conditionsand genotype backgrounds for which these correlations were detected, it is



profound that they must have a very common biological background, suchas a similar structure of underlying stoichiometric and regulatory networks.

Being a global property, metabolic correlations depend upon all biochem-ical reactions and regulatory interactions within the system. Therefore, highcorrelations between metabolites might be a result of various metabolic mech-anisms not necessarily related to direct connection in metabolic pathways.These mechanisms are (for example): (a) chemical equilibrium, (b) mass con-servation, (c) asymmetric control and (d) high variance in expression of asingle gene, having a high control over the concentration of two metabolites(Camacho et al., 2005). In consequence, comparing the correlation networksdirectly to the underlying metabolic pathways gives a high number of falsepositives. The other reason for the high false positive rate is that unrefinedcorrelation networks possess a high number of fully connected triplets, mean-ing that two highly correlated neighbours of a particular node are usuallyalso highly correlated with each other. Therefore, highly connected nodes –called network hubs – have a high clustering coefficient, which reduces theirvalue as ‘most important’ nodes in the graph and translates into a low poten-tial of correlation analysis for ‘candidates selection’ in general. This has beenshown in in silico studies (Cakir et al., 2009) and also experimentally usingGC-MS profiles (Muller-Linow et al., 2007; Weckwerth et al., 2004) and is oneof the most significant drawbacks of the approach.

On the other hand, it has been shown that correlation analysis providesa relatively small number of false negatives (Soranzo et al., 2007) and, thus,gives very reliable information about independence of the system variables.This fact makes simple correlation analysis a useful tool to investigate otherfeatures of the network topology, such as community structure. In one ofour studies, PC network reconstruction from the metabolomic profiles ofE. coli exposed to different environmental conditions revealed a remarkablystable community structure (Szymanski et al., 2009). Importantly, this robustcommunity structure separated metabolites constituting different metabolicpathways, such as metabolites of the TCA and amino acid metabolism, sugarphosphates and lipid related compounds (see Colour Plate 9.3). This observa-tion highlights an apparent relationship between metabolic correlations andunderlying metabolic pathways, which have been investigated in detail intheoretical studies (Steuer et al., 2003; Camacho et al., 2005).

9.4.3.2 Evaluation of the relationship between covariance matrix andpathway stoichiometry

The most direct interpretation of metabolic correlations could be that themost strongly correlated metabolites also exhibit the highest proximity inmetabolic networks. However, this is not the case, and the relationship be-tween pathways stoichiometry and correlation matrix is much less trivial.

An in silico study by Steuer et al. (2003) based on a model of S. cerevisiaeglycolysis (Hynne et al., 2001) showed that metabolic correlations originatedin a combination of stoichiometric and kinetic factors, which if known, might



be used to predict the co-variance matrix. Co-variance and thus correlationmatrix might be deduced from the model reaction rate laws using theoremsof Metabolic Control Analysis (MCA) (Heinrich & Schuster, 1996). Here, themetabolic system is represented by the Jacobian matrix, which is a linearapproximation of the metabolic system dynamics around its current steadystate and contains both stoichiometry and elasticity coefficients of the system(Hofmeyr, 2001):

J = NR∈So L (9.9)

where J is the Jacobian, NR is a reduced stoichiometric matrix determined byGaussian elimination to row echelon form and containing only independentrows (reactions), L is the link matrix satisfying the relation N = LNR (seeHofmeyr (2001) for detailed description). Importantly, the Jacobian matrixcould be precisely related to metabolite–metabolite co-variance matrix underthe assumption of finitely small fluctuations of the metabolic system:

J � + � J T = −2D (9.10)

where J is the Jacobian, JT is its transpose, � is the co-variance matrix andD is a fluctuation matrix (van Kampen, 1992). To make fluctuation matrix Dknown, we have to define the source of the fluctuations. In the simulationstudy, this was done by defining the fluctuations in one element as the sourceof the variation, and the variance of all other elements as responses to thechosen reaction (Steuer et al., 2003), that is all elements in D are zero exceptthe chosen element where Di,j �= 0.

Unfortunately, the reverse approach towards reconstruction of the Jacobianfrom the co-variance matrix is not straightforward. For a theoretical exper-iment, where M metabolites are measured, one obtains fluctuation matrixD and co-variance matrix �, which, if substituted into (9.10), gives a linearsystem of equations for the entries of the Jacobian. However, because � issymmetric, there are only M(M + 1)/2 independent equations (number ofindependent entries of the co-variance matrix) for M2 unknowns (numberof entries of the Jacobian). Thus, knowing the co-variance matrix is not suf-ficient for the complete reconstruction of the system, giving a whole rangeof possible solutions. In order to solve this problem, Steuer et al. (2003) pro-posed several different approaches. The most direct is solving the system byparametrization with respect to certain, possibly already known or possibleto estimate, entries of the Jacobian. Another option is to exploit the redun-dancy of the Jacobian elements (Diaz-Sierra et al., 1999; Klamt et al., 2002) orreduce the solution range using known topological parameters of metabolicnetworks such as their general sparseness (Jeong et al., 2000). In this case, anoptimization approach would score solutions with the maximal number of‘zero’ elements higher, something that does not have to be true in specificcases, but was shown to approximate efficiently the correct solution in mostcases (Yeung et al., 2002).



The presented relationship between the Jacobian and correlations in themetabolic system has important consequences. It shows not only that know-ing stoichiometry and rate laws of the metabolic system one can preciselypredict the correlations but also, more importantly, that each factor affectingthe Jacobian of the system, such as change in enzymatic activity or knockoutof the enzyme gene, will also change the correlation pattern. This situationis indeed observed in biological experiments, where mutants are comparedwith wild-type plants (Weckwerth et al., 2001; Kusano et al., 2007). Moreover,we can conclude that correlation patterns observed for the same metabolicsystem exposed to different perturbations could be used to estimate the levelof common and specific parts of system regulation even if the specific regu-latory events have not been identified.

9.4.3.3 Mutual informationInvestigation of reaction kinetics shows that the interactions between metabo-lites are often nonlinear in nature and the use of linear similarity measuressuch as correlation coefficients might lead to spurious results (Husmeier et al.,2005). Therefore, nonlinear similarity measures, such as mutual information,are attracting increasing attention in the field. Mutual information (MI) is anonlinear similarity measure based on variables entropy (Butte & Kohane,2000). Formally, for a system in possible states described by measurementsa1, . . . , aMA each with corresponding probability p(ai) Shannon entropy isdescribed as:

H(A) = −MA∑

i=1

p(ai ) log p(ai ) (9.11)

The joint entropy of two variables is then defined as:

H(A, B) = −MA∑

i=1

MB∑

j=1

p(ai , b j ) log p(ai , b j ). (9.12)

Then, if variables A and B are statistically independent, the joint entropy issimply:

H(A, B) = H(A) + H(B) (9.13)

However, in the case that there is no assumption about statistical indepen-dence, the joint entropy should be expressed as conditional entropy:

H(A, B) = H(A|B) + H(B) (9.14)

where

H(A|B) = −MA∑

i=1

MB∑

J =1

p(ai , b j ) log p(ai |b j ) (9.15)



The mutual information then might be defined as:

I (A, B) = H(A) = H(B) − H(A, B) (9.16)

MI then describes the relationship between two variables by exploitinga difference between their joint entropy and their entropy as independentvariables.

Despite its conceptual elegance and simplicity, mutual information hasbeen shown to have significant drawbacks if used for real experimental data.Firstly, it is shown to be sensitive to the sample size (Steuer et al., 2002); there-fore, only big data sets are suitable to be analyzed in this way. Dependingon the algorithm used to estimate the Shannon entropy, mutual informationcould also be very sensitive to outliers. To avoid these drawbacks, compu-tationally demanding algorithms are used, such as B-spline functions (Daubet al., 2004) or kernel density estimation (Lake, 2009), which becomes prob-lematic if high-order conditioning is performed.

Several advanced MI-based network inference algorithms were developed,each combining diverse entropy estimation techniques and network refiningtechniques with different trade-offs between performance and computationalfeasibility (Butte & Kohane, 2000; Faith et al., 2007; Margolin et al., 2006;Meyer et al., 2007). Up to now, some of these have been successfully used forefficient gene networks inference (Meyer et al., 2008). For reverse engineeringof transcription factor networks of E. coli, MI-based approaches were shownto perform similar to partial Pearson correlation (PPC) (Soranzo et al., 2007).Analogous results were obtained for artificial metabolic networks (Cakir et al.,2009). They indicate that MI might be advantageous only for specific, highlynonlinear systems, but is less beneficial in broader metabolomic studies.

9.4.4 Refining relevance networks

Relevance networks are direct visual representations of the underlying simi-larity/distance matrix and in most cases do not approximate the underlyingmetabolic system with satisfactory precision. However, there are methods torefine relevance networks by introducing additional statistical criteria for thevariables interdependence, which result in significant improvement of theirnetwork inference performance. The two most common are: (a) conditioning,a method based on path analysis (Shipley, 2002) and (b) network pruning(Margolin et al., 2006).

9.4.4.1 ConditioningConditioning of the similarity measures, such as correlation coefficient, orig-inates in path analysis and attempts to identify and remove indirect rela-tionships from the network of dependencies. As an example, the correlationbetween variables x and y conditioned on variable z is a correlation betweenparts of x and y regressed on z. In other words, rxy,z quantifies to what extentthe parts of x and y, which do not correlate with z, are correlated with each



other.Such conditioning might be performed on 1 to n variables, dependingon the chosen ‘order’. For PC, the first order partial coefficients might becomputed as follows:

1st order : rxy,z = rxy − rxzryz√(1 − r2

xz)(1 − r2yz)

(9.17)

Second and higher order partial correlation coefficients might be computedusing similar equations (de la Fuente et al., 2004). However, each additionalconditioning adds one degree of freedom to the test, and therefore, for con-ditioning on n variables one needs more observations than variables in thedata set. For data sets containing less observations, an alternative approachis graphical Gaussian scaling (GGM) (for more information, see original pa-pers of Schafer and Strimmer (2005)). Conditioning of the initial correlationmatrix using partial correlation or GGM was shown to reduce substantiallythe number of indirect, thus false positive, interactions and extract chainspossessing Markov properties in theoretical studies (Edwards, 2000). Studieson E. coli confirmed applicability of PPC coefficients for investigation andefficient recovery of known gene regulatory networks from microarray data(Soranzo et al., 2007). As shown by Cakir et al. (2009), PPC also gives relativelyprecise results in the case of simple metabolic pathways. As an example ofa model of the E. coli threonine synthesis pathway (Chassagnole et al., 2002)shows that for small networks (linear pathway of 4 metabolites), conditioningon one variable is sufficient, whereas for more complex networks, such as theglycolysis pathway of S. cerevisiae (13 metabolites and 18 reactions) (Teusinket al., 2000), or the E. coli central carbon metabolism (18 metabolites and 30reactions) (Chassagnole et al., 2002), GGM performs much better.

Analogous to the PC, one can use conditional mutual information (CMI).In this case, the mutual information between variables x and y conditionedon z is computed for this part of x and y for which MI with z equals 0. Thefirst-order CMI is then defined as:

I (A; B|C) = H(A, C) + H(B, C) − H(C) − H(A, B, C) (9.18)

CMI was shown to allow very efficient inference of gene regulatory net-works for small-scale synthetic models (Liang & Wang, 2008). However, So-ranzo et al. (2007) have shown that CMI performs similarly to less compu-tationally demanding partial PPC for bigger scale synthetic and biologicaldata sets, which is in agreement with findings that most of gene dependen-cies found in transcriptomic studies are of linear nature (Steuer et al., 2002).Analogous studies on metabolic networks show very similar results (Cakiret al., 2009), suggesting that CMI-based algorithms are a very good choicefor small-scale systems of expected nonlinear nature, but are not feasible forlarge-scale omic data sets, where alternative, less sophisticated methods infernetworks with similar accuracy.



9.4.4.2 PruningAn alternative and simple method to remove indirect interactions is pruning,which is a very general term concerning removal of irrelevant elements of thebelief network before invoking inference. Such a removal might be knowledgebased as for instance pruning of metabolic pathways for ‘currency metabo-lites’ – a common routine in metabolic network topology studies (Gerlee et al.,2009; Ma & Zeng, 2003). For networks derived from quantitative data, severalstatistical pruning methods were developed (Costa et al., 2002). A simple, butvery successful method is, for instance, the data processing inequality (DPI)algorithm (Margolin et al., 2006). This method attempts to remove indirect re-lationships by reducing the number of fully connected triplets in the networkusing an information theory property – DPI. The DPI theorem states that onecannot get more information out of a set of data than was there to begin with.Thus, if we deal with a chain of interactions x ↔ y ↔ z with no alternativepath between x and z, DPI states that:

I (x, z) ≤ [I (x, y); I (y, z)] (9.19)

where I is the mutual information between data sets collected for particularnetwork nodes. Here, MI might be replaced with other similarity scores, suchas the correlation coefficient. Implementation of this theorem to the wholenetwork requires the following steps: (1) the algorithm starts from a relevancenetwork as a weighted graph, where each edge has a weight correspondingto the similarity score in the initial matrix; (2) then all fully connected tripletsof nodes are examined and for each fully connected triplet x, y, z an edge Sxz

is marked to be removed if:

abs(Zxz) ≤ min(abs(Sxy), abs(Sxz)) × (1 − � ) (9.20)

where S is a similarity score and � is a tolerance parameter; (3) finally, markededges are removed from the graph.

Despite its relative simplicity, the method was shown to remove a signif-icant number of false positive interactions if used for PC and MI relevancenetworks reconstructed from transcript and metabolic data (Cakir et al., 2009;Soranzo et al., 2007).

9.4.4.3 Mutual rankingAnother approach for refining relevance networks is mutual ranking. Thismethod was used for refining gene co-expression networks, because of itscomputational feasibility and the property of efficient identification of net-work communities even in huge and tightly connected graphs (Obayashiet al., 2008, 2007). A mutual ranking algorithm takes as an input the originalsimilarity matrix and states that nodes x and y are connected in the networkonly if y belongs to the top n nodes with the highest similarity score withrespect to x and vice versa; x belongs to the top n nodes with the highestsimilarity score with respect to y. Therefore, they are considered to be mu-tually top ranked. Parameter n has to be chosen arbitrarily and becomes the



maximum node degree that a node in the network can have. The result issignificantly different from the relevance network obtained by interactions’significance thresholding. Connectivity of the mutual ranking network is re-duced proportionally to the local density of the relevance graph. This makesthe method advantageous in the deconvolution of the community structurein densely connected graphs. Two main drawbacks of the method are: (1)the original network topology is lost and this may have consequences for theinterpretation and (2) the possibility that some inferred edges were not be sta-tistically significant in the original similarity matrix. Therefore, the method isnot suitable to explore the topological parameters such as degree distributionof the network, and it is suggested additionally to filter the network edgesusing statistical significance tests.

9.4.5 Bayesian networks

Another class of data-based graphs are Bayesian networks. Bayesian net-works combine multivariate probability distribution with graphic represen-tation, introducing high prediction power and the possibility of incorporatingprior knowledge, such as direction of metabolic fluxes to constrain the spaceof possible solutions. Methods for learning Bayesian network models wereshown to be powerful tools in biology and proved to recover regulatoryevents in the yeast cell cycle (Murphy & Mian, 1999), to reconstruct gene reg-ulatory pathways from microarray data (Friedman et al., 2000), and were alsosuccessfully applied in metabolomics to uncover new metabolic pathways(Gavai et al., 2009). Here, we introduce the concept of the Bayesian networkand its application as a tool for network inference.

9.4.5.1 DefinitionIn principle, Bayesian network is a multivariate probability distribution withimplemented variables independence information, represented as a graph. Itconsists of the following two parts:

(i) Quantitative, which is a factorized joint probabilistic distribution on aset of random variables. This part allows prediction of the likelihoodor the concentration levels of a set of metabolites, if concentration ofone or several metabolites is given. It allows some initial assumptions,but usually is learned directly from the experimental data. Learning thequantitative part from the data is called parameter learning.

(ii) Qualitative, which is an acyclic directed graph (ADG) – a graphical rep-resentation of interdependencies between system variables. This partconstrains the model and allows implementing existing knowledge, suchas the structure of metabolic pathways or directionality of biochemicalreactions to reduce the space of possible model solutions. On the otherhand, if the underlying network is not known, a space of possible ADGsis explored in a process called structure learning.



While using a metabolic pathway as a scaffold for a Bayesian network, onemust assume its hierarchical structure and defined fluxes in order to representit as an ADG graph. However, ADG is not an eligible representation for manymetabolic pathways. This limitation concerns mostly central metabolism withmultiple pleiotropic compounds and abundant cycles. On the other hand,it is applicable for many pathways with defined inputs or end products,such as glycolysis, amino acids synthesis and many pathways of secondarymetabolism.

ADG structure in Bayesian network analysis is required to constrain asso-ciated joint probability distribution by the introduction of the independenceinformation. For small networks, the independence of system variables mightbe extracted manually, just by reading the graph. To exemplify it here, weintroduce some basic definitions concerning ADGs. Formally, in ADG eachpair of connected nodes describes a parent–child relationship. For a pair ofnodes that are connected with an arc x1 → x2, the node x1 is called a parent ofx2 and x2 is called a child of x1. Analogously, for a path (x1, . . ., xn) a node x1 iscalled a descendant of a node xn of a directed path from x1 to node xn, whereasxn is an ancestor of x1.In a metabolic system, these relationships might be re-garded in terms of product–substrate interactions. Knowing the structure ofthese relationships, the conditional independence between the graph nodes,thus between variables of the system, is identified using d-separation criteria(Pearl, 1998). To express these, first, all three-element subgraphs are classifiedas: (1) serial connection for x1 → x2 → x3 and x1 ← x2 ← x3; (2) divergingconnection for x1 ← x2 → x3; or (3) converging connection for x1 → x2 ← x3.Formally then, two nodes of the network G(V,E) are defined as d-separatedby a set of nodes S ∈ V if a node xx exists such that: (1) xx ∈ S, xx is on thepath connecting the two nodes and does not have a converging connectionon any path connecting the nodes; (2) xx �∈S or any of its descendants.

Here, D-separation analysis allows using ADG G as an independence mapfor joint probability distribution P. Using G one can now factorize joint prob-ability distribution P as follows:

P(X1, . . . , Xn) =n∏

i=1

P(Xi |pa (xi )) (9.21)

where Xi is a set of variables associated with node xi and pa(Xi) is a set ofrandom variables associated with the parents of node xi.

9.4.5.2 Learning Bayesian networksThere is an abundance of approaches to learn both the quantitative and quali-tative parts of a Bayesian network. Because this section focuses on data-drivennetwork inference approaches, here we concentrate on fitting the qualitativepart. There are two basic types of structure learning: score based and constrainbased.



Score-based learning methods are used if the ADG of the system is un-known. Therefore, the whole space of possible ADGs is explored and scoredwith respect to their potential to explain the data. Whereas there are multiplescoring methods for score-based learning, using most of them one can fit thebest model using Bayes’ theorem:

P(M|D) = P(D|M)P(M)P(D)

(9.22)

where M is a Bayesian model containing both structure and probabilities, Dis the data set, P(M) is the likelihood of the model and P(D) is the likelihoodof the data. Usually, unknown model likelihood might be replaced with auniform probability distribution. P(D), on the other hand, might also beunknown, but as we see, it is a normalizing factor in equation (9.23), andtherefore, knowing it is not crucial.Thus, to compare two models, M and M′,a Bayes factor might be used:

B F (D) = P(D|M)P(D|M′)

(9.23)

if the assumption about P(M) = P(M′) as the uniform probability distributionis kept. In such a way, a space of possible ADGs might be explored and thefitness of the models could be compared. However, the Bayes factor doesnot introduce any assumption about the network complexity. Therefore, itis obvious that the more complex graphs will always give a better fit withthe best fit for a complete graph, where all the nodes are connected. This, ofcourse, makes it useless with respect to structure learning, and therefore, apenalty factor for network complexity should be introduced to the scoringequation (9.23). According to Heckerman (1995), one obtains:

score (M, D) = log P(D|M) + log P(M) − log P(D) − penalty (M) (9.24)

An important issue at this point is how to limit the network complexity. Thismight be achieved using information about the known biological networkstopology, for example the sparseness of metabolic networks (Jeong et al.,2000).

Constraint-based learning methods in contrast do not explore the wholesearch space. Instead, they are based on statistical independence tests suchas partial correlations or CMI. In this sense, they might be regarded as animprovement of already described network inference techniques by the in-troduction of the Bayesian network concept. Constraint-based learning takesthe advantage of the possibility of incorporating the prior knowledge about



dependencies between variables, such as known structure of biochemicalreactions or directionality of the network edges, which is as important is-sue in metabolomic studies. One of the algorithms successfully used in theanalysis of metabolic networks (Gavai et al., 2009) is the PC algorithm (Peterand Clark) (Sprites et al., 2000). The algorithm assumes the existence of per-fect representation of the variables independence by the ADG graph and thesparseness of the network. It searches for independence between variablesand reconstructs a hypothetical graph in following steps:

1. Reconstruction of an undirected graph.2. Identification of possible locations of converging connections by testing

independence of the network elements.3. Introduction of directionality to the edges using information about con-

verging connections and avoiding the production of cycles.

The first two steps lead to generation of the network scaffold on the basisof statistical testing of the variables independence. The third step is based onrules to reconstruct ADGs (Meek, 1995).

9.4.6 Summary

Observed coordination of metabolic fluctuations is difficult to relate directlyto the structure of metabolic pathways. Whereas available approaches offera range of possibilities to address the task, none of them gives the com-plete solution – accurate reverse engineering of metabolic networks. This isnot surprising, taking into account how difficult the task is even in in silicostudies and how many factors other than pathways stoichiometry affect fluc-tuations of metabolites in vivo. Nevertheless, the methods presented shouldbe regarded as valuable tools for hypotheses creation. They allow selectionof ‘candidate’ metabolites, which could be investigated with respect to theirrole in particular experimental conditions in independent experiments, andreveal community structure of the metabolome, which could be related tothe large-scale organization of metabolic network. These might give valuablehints to a biologist, if integrated with current knowledge and interpreted toanswer defined biological questions (Angelovici et al., 2009; Nikiforova et al.,2005; Sanchez et al., 2009; Kusano et al., 2007). In addition to the reconstruc-tion of an approximated network structure by relevance networks, we haveintroduced here a concept of Bayesian networks, which adds an importantfeature in the field: the prediction of metabolic states in response to novelstimuli. It is obvious that before relevance networks become a reliable sourceof information that does not need laborious verification by independent ex-periments, analytical and computational techniques will need to be furtherdeveloped. However, this is a very central problem in systems biology andalso concerns other omic technologies.



9.5 Metabolomics: the bridge between constraint-basedand kinetic modelling

Plant systems biology approaches attempt to provide a holistic view of plantsystems by analyzing their response to changes in environmental condi-tions (e.g. light, temperature, supply of nutrients), which may affect morethan one of the system’s constituents (e.g. transcripts, proteins, metabolites).Unlike unicellular model organisms (e.g. E. coli), plants are composed of dif-ferent tissues consisting of heterogeneous cell populations and multiple cellcompartments (Weckwerth, 2003). Because of the high level of subcellularlocalization, a plant-specific metabolic network, describing all biochemicalreactions leading to a particular biological function, contains multiple copiesof a biochemical pathway in different member-bound compartments (Mintz-Oron et al., 2009). The promise of plant systems biology is that by studying theproperties of plant-specific, highly compartmentalized metabolic networks,coupled to networks of gene regulatory and protein–protein interaction, wecould better understand the functions of the system and bioengineer a desiredsystemic behaviour.

Although metabolic networks represent but one facet of biological sys-tems, they describe the interplay between two key constituents – enzymesand metabolites – which can be effectively employed to alter the systems’ dy-namics. A metabolic network can be succinctly described by the stoichiometryof its biochemical reactions and may also include the known allosteric regu-lations. The analysis of metabolic models with allosteric regulations requiresdetailed kinetic modelling, currently hindered by the lack of: (1) accuratekinetic parameters and (2) knowledge of the underlying kinetic mechanism.Consequently, we focus on a classical constraint-based modelling approach,called flux balance analysis (FBA), which is based on the premise of massconservation in a network of biochemical reactions described only by theirstoichiometry.

In the following sections, we review the classical constraint-based ap-proaches and recent extensions giving particular focus to integratingmetabolomics data to improve the resulting predictions. Since FBA-basedapproaches operate on a given mass-balanced metabolic network, in Section9.5.1 we review the characteristics of the existing genome-scale metabolicnetworks for plants and green algae. Moreover, we show which analyticaltechniques should be employed in order to incorporate metabolomics datawith the reviewed metabolic networks. The details of the classical FBA and itsrelation to metabolomics are given in Section 9.5.2; particularly, the review isfocused on the specification of the optimization functions, its limitations andthe issue of imposing further constraints. Finally, the implications of time-series metabolomics data to the recent developments of dynamic FBA (dFBA)are presented in Section 9.5.3. This chapter then closes with a brief discus-sion of the challenges and opportunities of coupling FBA with metabolomicsdata.



9.5.1 Plant-specific genome-scale metabolic networks

The sequencing of the entire genomes of green algae (e.g. Chlamydomonasreinhardtii (Merchant et al., 2007)) and plant model species (e.g. A. thaliana(Arabidopsis Genome Initiative, 2000)) provides the framework on the ba-sis of which their genome-scale metabolic networks could be reconstructed.However, in A. thaliana, the function of only half of its ≈27,000 genes has beendetermined on the basis of sequence similarity, while the function of merely11% has been experimentally confirmed (MASC, 2007; Saito et al., 2008). As aresult, the existing plant-specific genome-scale metabolic networks offer onlya small, condensed view of the intertwined biochemical pathways.

An important tool in creating genome-scale metabolic networks is the path-way databases, such as ChlamyCyc (May et al., 2009) and AraCyc (Zhanget al., 2005). These databases can serve as a starting point for establishingwell-defined, mass-preserving, network models in accord with data fromhigh-throughput experiments.

Currently, there is one algae-specific and two plant-specific genome-scalemetabolic networks consisting of the following: (1) 484 unique biochemicalreactions and 458 metabolites, organized in three compartments – cytosol,chloroplast and mitochondria, for C. reinhardtii (Boyle & Morgan, 2009); (2)1320 unique reactions, 1438 metabolites and 130 transporters, compartmen-talized between the cytosol, mitochondia, vacuole, plastid and peroxisomefor A. thaliana (de Oliveira Dal’Molin et al., 2009); (3) 232 reactions and 255metabolites for A. thaliana without compartments (Poolman et al., 2009). Inaddition, several studies have considered compartment-, tissue- and organ-specific metabolic networks (e.g. for barley seed (Grafahrend-Belau et al.,2009) or mitochondria (Ramakrishna et al., 2001)). Regardless of the coverageof the ‘real’ metabolism, these models can serve as a starting point towardsmore rigorous analyses to identify effects of gene deletions and changingavailability of nutrients.

9.5.1.1 Metabolomics data and metabolic networksThe existing mass-balanced genome-scale metabolic networks contain onlya limited number of metabolites, e.g. E. coli (1677), Buchnera aphidicola (443),S. cerevisiae (583), C. reinhardtii (1164), A. thaliana (2493) and Homo sapiens(994). With respect to the distribution of molecular weight of the metabolitesacross these species-specific networks, we observe that most fall in the rangeof [0, 600] (see Colour Plate 9.4). Therefore, confronting the existing metabolicmodels with metabolomics data would require application of various analyt-ical techniques (see Colour Plate 9.1).

9.5.2 Classical flux balance analysis

FBA is a modelling framework developed to characterize the capabilitiesand properties of metabolic networks. A metabolic network consists of the



metabolites together with the biochemical reactions in which they are in-volved, including their formation, degradation, transport and cellular uti-lization. For every metabolite Xi, the mass balance is derived as follows:

d Xi

dt=

∑si jv j − bi (9.25)

where sij is the stoichiometric coefficient associated with each flux vj, throughreaction j, and bi is the net transport flux of Xi. These concepts are illustrated inColour Plate 9.5a, where there are four internal fluxes, three transport fluxesthrough three metabolites X1, X2 and X3. The mass-conservation relationunder the steady-state conditions (dXi/dt = 0) reduces to the expression:

∑si, jv j − bi = 0

or equivalently, over all intermediates:

S · x − b = 0

where S is the stoichiometric matrix (m rows and n columns), v is the vectorof n metabolic fluxes and b is the vector representing m transport fluxes (i.e.known consumption rates, bi-product production rates and uptake rates).For an illustration, see Colour Plate 9.5b. As the system described in the pre-vious equation is underdetermined (n < m), there exist multiple solutionscorresponding to feasible flux distributions, each representing a particularmetabolic state, satisfying these stoichiometric constraints. As the null spaceof the stoichiometric matrix S is composed of all vectors v such that S · v =0, it contains all feasible flux distributions representing the capabilities of themetabolic genotype. The transport fluxes represent environmental conditionsthat, along with the genotype, define the metabolic state. Then, the questionaddressed by FBA is: Which of these feasible metabolic states is manifestedin the studied metabolic model (network)? For the example system, the con-straints are given in Colour Plate 9.5c, together with their schematic view inColour Plate 9.5d.

FBA relies on the assumption that the metabolic system exhibits a metabolicstate that is optimal under certain criteria. Usually, this objective is expressedas linear combination of fluxes contained in v, which leads to a linear program-ming (LP) problem:

min(max)z =∑

civi s.t. (9.26)

Sv − b = 0 (9.27)

0 ≤ �i ≤ vi ≤ �i (9.28)

with z representing the phenotypic property and c is a vector of coefficientsthat are either costs or benefits derived from fluxes. The bounds �i and �i



represent known constraints on the maximum and minimum values thatfluxes can assume. Note that the mass-conservation relation forms part of theconstraints, as illustrated in Colour Plate 9.5d, where the blue point representsthe solution to the LP formulation.

The importance of a flux vi with respect to a chosen objective z can beinvestigated with respect to the difference of the optimal values obtainedfor z when vi is included in the basic solution. However, such an approachrequires that the chosen objective function be capable of describing the opti-mality principle of the system under different environments. Recent studieshave revealed that no single objective function describes the metabolic stateunder all conditions. Moreover, if the accuracy of prediction is taken intoconsideration, then the metabolic operating principle may be best describedby a nonlinear objective function, causing computational challenges (Schuetzet al., 2007).

The most common choice for objective functions is maximization of yield,biomass or growth, which allows for a wide range of predictions consistentwith experimental observations for simple model organisms (Edwards &Palsson, 2002; Schuster et al., 2008). Other optimization functions includeminimization of adenosine triphosphate (ATP) production to determine theconditions for energy efficiency and minimization of nutrient uptake (Schuetzet al., 2007).

The biomass is usually represented as a stoichiometrically balanced reac-tion, describing the formation of biomass from various cellular components,as well as various co-factors, which are required for driving the process for-ward. The yield is then given as biomass per nutrient.

A trivial way to arrive at the biomass reaction is assessing the per centcontribution of each macromolecule (e.g. RNA or DNA, lipid, protein) pergram of dry weight. Each macromolecule is then broken down into individualrepresentatives (e.g. amino acids for proteins) that should be present in thenetwork. In some cases, only a limited number of biomass precursors areused in the balanced reaction.

Metabolomics data can easily be employed to refine the biomass reaction bydetermining the contribution of each biomass precursor. However, we pointout that the formulation of the biomass reaction in virtually all existing studiesneglects the variability of measurements of the metabolome. Moreover, insuch a setting, the biomass reaction needs to be determined for each particularenvironmental condition prior to carrying out FBA on the entire network.For instance, in C. reinhardtii, under autotrophic conditions 1 biomass = 0.002DNA + 0.051 RNA + 2.005 protein + 2.008 carbohydrate + 0.203 lipid + 0.010chlorophyll a + 0.016 chlorophyll b + 9.350 ATP (polymerization) + 29.890ATP (maintenance), while under heterotrophic conditions biomass = 0.002DNA + 0.051 RNA + 1.706 protein + 1.752 carbohydrate + 0.307 lipid +0.020 chlorophyll a + 0.009 chlorophyll b + 8.890 ATP (polymerization) +29.890 ATP (maintenance) (Boyle & Morgan, 2009).



9.5.2.1 Refinement of the objective functionSeveral approaches have been proposed to overcome the issue with a priorispecification of the most likely objective function. Given an experimentallydetermined metabolic state (i.e. a flux distribution), ObjFind attempts toidentify weightings ci, called coefficients of importance, on reaction fluxesvi in a network while minimizing the difference between the resultant fluxdistribution and the experimentally determined one (Burgard & Maranas,2003). More formally, ObjFind solves the following:

minci =∑

(vi − vexpi )2 s.t. (9.29)

maxvi

∑civi s.t. (9.30)

∑si jvi − bi = 0 (9.31)

0 ≤ �i ≤ vi ≤ �i (9.32)∑

ci = 1 (9.33)

ci ≥ 0 (9.34)

A high ci then indicates a reaction that is more likely a component of thecellular objective function. However, ObjFind is unable to a priori definea biomass reaction, since in FBA the objective function is usually definedin terms of biomass reaction (included in the set of constraints) and not aweighting on multiple reactions. As a result, if the biomass is not includedin the FBA constraints, the ObjFind may result in a suboptimal combina-tion of reactions and, consequently, wrong predictions. The approach hasbeen extended to detecting gene knockouts for applications in biotechnology(Burgard et al., 2003).

In the extension of the ObjFind approach, termed biological objectivesolution search (BOSS), the biomass reaction is a de novo reaction added tothe stoichiometric matrix S through a bi-level optimization procedure (Gian-chandani et al., 2008):

minci

∑(vi − v

expi )2 s.t. (9.35)

max vbiomass s.t. (9.36)∑

si jvi − bi = 0 (9.37)

0 ≤ �i ≤ vi ≤ �i (9.38)

Here, the biomass function is allowed to take any form, as long as it isconfined to be one linear stoichiometric reaction whose coefficients are de-termined by the framework. BOSS operates by guessing values for the sto-ichiometric coefficients over many runs, using a single-level optimizationalgorithm to generate one biomass reaction prediction per run. Finally, the



biomass reaction is the one whose cluster contains most of the biomass re-actions minimizing the sum-squared error between the flux distributionscomputed by the framework and the experimentally determined flux data.

9.5.2.2 Thermodynamic constraints and metabolomics dataThe successive addition of constraints further confines the solution space(see Colour Plate 9.5d, rightmost panel) and alters the final optimal solution.Predictions made without considering, for instance, energy balance may notin general be physically realistic. The reason for this is that in classical FBA,thermodynamic constraints are somewhat naively accounted by providingupper and lower bounds for the fluxes (thus, fixing the reaction reversibility).Therefore, the dependence of reversibility on intercellular conditions, whichmay change in response to environmental changes, is not accounted for. InBeard et al. (2002), additional constraints are imposed by considering energybalance in the network: by analysis of the null space of the stoichiometricmatrix, which includes only the internal reactions, a vector �� of chemicalpotential differences associated with reaction fluxes can be obtained, suchthat vi��i < 0. The latter ensures that the entropy production is positive foreach reaction. To estimate the chemical potentials, a quadratic programmingapproach is used that minimizes the norm |��|2. This method was recentlyextended to a multi-objective optimization approach, suitable for higher levelorganisms (Nagrath et al., 2007).

To increase the predictive power, FBA has been extended by includingadditional constraints to ensure thermodynamic plausibility (Hoppe et al.,2007). This constraint implies that the flux directions are consistent with thecorresponding changes in Gibb’s free energies of reactions, which in turndepend on the metabolite concentrations. The optimization problem corre-sponds to a mixed integer linear program (MILP) with quadratic scoring func-tion, penalizing any thermodynamic discrepancies and measured metaboliteconcentrations.

Metabolomics data can be effectively employed to estimate each ��i (e.g.for the reaction A → B, �� = −kBT ln (Keq XA/XB), where XA and XB are theconcentrations of A and B, respectively, and Keq is the equilibrium constant ofthe reaction). However, it must be pointed out that any uncertainties in themetabolomics data, together with other factors, such as pH and pMg, mayhave a significant influence on the obtained results (Vojinovic & von Stockar,2009).

9.5.2.3 FBA-based approaches in the study of mutantsThe reviewed FBA-based approaches have been essential in predicting themetabolic state of unicellular organisms exposed to a long-term evolutionarypressure under the assumption that they operate towards optimizing a suit-ably defined function of reaction fluxes. Since the same argument may nothold for genetically engineered organisms, two constraint-based approaches



have been proposed: minimization of metabolic adjustments (MOMA) andregulatory on/off minimization (ROOM).

Given a metabolic state of a wild type, vw, MOMA is based on the hy-pothesis that fluxes v in the genetically engineered organism (e.g. via geneknockout) undergo a minimal redistribution with respect to vw. The mini-mal redistribution is assessed via the Euclidean distance between the twometabolic states v and vw, resulting in the following quadratic program:

min(v − vw)T (v − vw) s.t. (9.39)

S · v − b = 0 (9.40)

0 ≤ �i ≤ vi ≤ �i (9.41)

vi = 0 (9.42)

where vi = 0 is the flux for a knockout reaction.In ROOM, the aim is to minimize the significant flux changes with respect

to the wild-type metabolic state vw. The formalization of the approach resultsin the following MILP:

min∑

yi s.t. (9.43)

S · v − b = 0 (9.44)

0 ≤ �i ≤ vi ≤ �i (9.45)

vi = 0, yi ∈ {0, 1} (9.46)

vi − yi (�i − vuw) ≤ vu

w (9.47)

vi − yi (�i − vlw) ≥ vl

w (9.48)

vuw = vw + �|vw|+ ∈ (9.49)

vlw = vw − �|vw|− ∈ (9.50)

where vlw and vu

w determine the thresholds for the significance of flux changes(with � and ∈ denoting the relative and absolute ranges of tolerance, respec-tively).

Although MOMA and ROOM have been employed to study the redis-tribution of fluxes in altered bacterial strains, their major drawback lies inthe choice of a reference metabolic state (i.e. that of the wild type). As clas-sical FBA often results in multiple solutions, it may not be instrumental indetermining the reference state. Nevertheless, even if fluxes are experimen-tally measured, these approaches rely on determining hypothetical wild-typestate v on the basis of the classical FBA with the additional constraint that itminimizes the distance to the experimental data while providing an optimalgrowth rate. It is worth pointing out that, thus far, there exist no mutantstudies that combine metabolomics data with constraint-based approaches.Colour Plate 9.6 illustrates the concept of MOMA with respect to the solutionof FBA.



9.5.3 Dynamic flux balance analysis

With the decreasing cost of metabolomics technologies, it is possible to gathertime-resolved data that may be used to obtain better understanding of theinvestigated system. To allow integration of (metabolomics) time-series data,the classical static FBA approaches have been extended to dFBA. There aretwo dFBA approaches particularly interesting in the era of metabolomics:(1) static-optimization dFBA and (2) dynamic-optimization dFBA. The tech-nical details for the formulation of these approaches are rather involved,and thus, we offer a brief description (directing the reader to relevantpapers).

In the static-optimization dFBA approaches (Mahadevan et al., 2002; Covert& Palsson, 2002), the time course of interest is discretized and classical FBAis solved at the beginning of each time interval, followed by integration ofyield (biomass or growth). This approach is based on the quasi-steady-stateassumption that fluxes are constant over the small time interval. Recently, thestatic-optimization dFBA has been combined with MOMA, with the objectiveto minimize the Euclidean distance between concentration of metabolites, andapplied to the study of cooperative regulation of photosynthesis in C3 plants(Luo et al., 2009b).

In the dynamic-optimization dFBA approaches along with the system ofdynamic equations, several additional constraints must be imposed for arealistic prediction of the metabolite concentrations and fluxes. These in-clude non-negative metabolite and flux levels, limits on the rate of change offluxes and any additional nonlinear constraints on the transport fluxes. Theresulting formulation is given by a nonlinear programming problem. Thisdynamic-optimization approach may be suitable for the analysis of tran-sient behaviours in response to system perturbations/fluctuations, for exam-ple diauxic growth of E. coli (Mahadevan et al., 2002). However, dynamic-optimization dFBA is restricted to systems of small size due to the computa-tional complexity of the employed formulation.

Finally, one could envision that the dFBA approaches can be used to in-tegrate several types of networks (e.g. metabolic, protein–protein interac-tion and gene-regulatory networks) to simulate the dynamics of the system.The idFBA framework (Lee et al., 2008) is the first step towards integrat-ing stoichiometric reconstructions of signalling, metabolic and regulatoryprocesses. A major challenge for such an integration is the fact that the var-ious processes operate on different time-scales. idFBA attempts to addressthis issue by including slow reactions in a time-delayed fashion with dis-cretization of time as in the static-optimization dFBA. Moreover, specifyingthe objectives of the integrated networks is more involved than consider-ing metabolic networks alone. To this end, the idFBA framework relies onBOSS and has been employed for analyzing a portion of the high-osmolarityglycerol response pathway in S. cerevisiae with reasonable quantitativepredictions.



9.5.4 Challenges and opportunities

As a constrained modelling approach, FBA is based on the assumption thatthe system operates at a steady state, which may be a far-fetching assumptionin plant systems, controlled by circadian rhythms (Dodd et al., 2005). More-over, while the optimization principle seems to produce valuable results inthe study of simpler model organisms, such as E. coli and S. cerevisiae, thereexist no validation that the same should hold for model plant systems.

Although metabolomics data have already played a major role in metabolicnetwork reconstruction (Feist et al., 2009), yielding a static snapshot of the sys-tem, time-resolved concentrations of metabolites may significantly contributeto understanding the system’s dynamics (Kell, 2004). To this end, the recentlydevised approaches, such as the reviewed dynamic FBA, may be employed togenerate time-course predictions comparable to those stemming from moreintricate kinetic models. Moreover, the simplifying assumptions of the twodFBA approaches, when coupled with time-resolved metabolomics data, maylead to devising methods for data-driven model discrimination.

Finally, we must stress that the solution of any constraint-based approachis only as good as the constraints placed in the network model capturingthe biochemical details. Therefore, the first step in carrying out an FBA-basedstudy entails careful curation of the model, together with the thermodynamicfeasibility of the included biochemical reactions.

Acknowledgements

We thank Miyako Kusano, Fumio Matsuda and Akira Oikawa for kindlysharing their GC-MS, LC-MS and CE-MS data used in Section 9.2.

References

Abe, T., Kanaya, S., Kinouchi, M. et al. (2003) Informatics for unveiling hidden genomesignatures. Genome Research 13, 693–702.

Akiyama, K., Chikayama, E., Yuasa, H. et al. (2008) PRIMe: a web site that assemblestools for metabolomics and transcriptomics. In Silico Biology 8, 339–345.

Andersson-Gunneras, S., Mellerowicz, E.J., Love, J. et al. (2006) Biosynthesis ofcellulose-enriched tension wood in Populus: global analysis of transcripts andmetabolites identifies biochemical and developmental regulators in secondary wallbiosynthesis. The Plant Journal 45, 144–165.

Angelovici, R., Fait, A., Zhu, X. et al. (2009) Novel genes and metabolic networksregulating seed maturation and germination, revealed by seed-specific alterationof lysine metabolism Plant Physiology 151, 2058–2072.

Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flow-ering plant Arabidopsis thaliana. Nature 408, 796–815.

Baier, M.C., Barsch, A., Kuster, H. et al. (2007) Antisense repression of the Med-icago truncatula nodule-enhanced sucrose synthase leads to a handicapped



nitrogen fixation mirrored by specific alterations in the symbiotic transcriptomeand metabolome. Plant Physiology 145, 1600–1618.

Bailey, T.L., Williams, N., Misleh, C. et al. (2006) MEME: discovering and analyzingDNA and protein sequence motifs. Nucleic Acids Research 34, W369–W373.

Beard, D.A., Liang, S. and Qian, H. (2002) Energy balance for analysis of complexmetabolic networks. Biophysical Journal 83, 79–86.

Biais, B., Allwood, J.W., Deborde, C. et al. (2009) 1H NMR, GC-EI-TOFMS, and dataset correlation for fruit metabolomics: application to spatial metabolite analysis inmelon. Analytical Chemistry 81, 2884–2894.

Birkemeyer, C., Luedemann, A., Wagner, C. et al. (2005) Metabolome analysis: thepotential of in vivo labeling with stable isotopes for metabolite profiling. Trends inBiotechnology, 23, 28–33.

Borevitz, J.O., Xia, Y., Blount, J. et al. (2000) Activation tagging identifies a con-served MYB regulator of phenylpropanoid biosynthesis. Plant Cell 12, 2383–2394.

Bowman, J.L., Smyth, D.R. and Meyerowitz, E.M. (1991) Genetic interactions amongfloral homeotic genes of Arabidopsis. Development 112, 1–20.

Boyle, N. and Morgan, J. (2009) Flux balance analysis of primary metabolism inChlamydomonas reinhardtii. BMC Systems Biology 3, 4.

Brautigam, K., Dietzel, L., Kleine, T. et al. (2009) Dynamic plastid redox signals inte-grate gene expression and metabolism to induce distinct metabolic states in photo-synthetic acclimation in Arabidopsis. Plant Cell 21 (9), 2715–2732, tpc.108.062018.

Brenner, S., Johnson, M., Bridgham, J. et al. (2000) Gene expression analysis by mas-sively parallel signature sequencing (MPSS) on microbead arrays. Nature Biotech-nology 18, 630–634.

Broeckling, C.D., Huhman, D.V., Farag, M.A. et al. (2005) Metabolic profiling of Med-icago truncatula cell cultures reveals the effects of biotic and abiotic elicitors onmetabolism. Journal of Experimental Botany 56, 323–336.

Burgard, A.P. and Maranas, C.D. (2003) Optimization-based framework for inferringand testing hypothesized metabolic objective functions. Biotechnology and Bioengi-neering 82, 670–677.

Burgard, A.P., Pharkya, P. and Maranas, C.D. (2003) OptKnock: a bilevel programmingframework for identifying gene knockout strategies for microbial strain optimiza-tion. Biotechnology and Bioengineering 84, 647–657.

Butte, A. and Kohane, I. (2000) Mutual information relevance networks: functionalgenomic clustering using pairwise entropy measurements. In: Pac Symp Biocomput,vol. 5. Citeseer, pp. 418–429.

Bylesjo, M., Eriksson, D., Sjodin, A. et al. (2007a) Orthogonal projections to latentstructures as a strategy for microarray data normalization. BMC Bioinformatics 8,207.

Bylesjo, M., Eriksson, D., Kusano, M. et al. (2007b) Data integration in plant biology:the O2PLS method for combined modeling of transcript and metabolite data. PlantJournal 52, 1181–1191.

Bylesjo, M., Nilsson, R., Srivastava, V. et al. (2009) Integrated analysis of transcript,protein and metabolite data to study lignin biosynthesis in hybrid aspen. Journal ofProteome Research 8, 199–210.

Buscher, J.M., Czernik, D., Ewald, J.C. et al. (2009) Cross-platform comparison of meth-ods for quantitative metabolomics of primary metabolism. Analytical Chemistry 81,2135–2143.



Cakir, T., Hendriks, M.M., Westerhuis, J.A. et al. (2009) Metabolic network discoverythrough reverse engineering of metabolome data. Metabolomics 5, 318–329.

Camacho, D., de la Fuente, A. and Mendes, P. (2005) The origin of correlations inmetabolomics data. Metabolomics 1, 53–63.

Carrari, F. and Fernie, A.R. (2006) Metabolic regulation underlying tomato fruit de-velopment. Journal of Experimental Botany 57, 1883–1897.

Chassagnole, C., Noisommit-Rizzi, N., Schmid, J.W. et al. (2002) Dynamic modelingof the central carbon metabolism of Escherichia coli. Biotechnology and Bioengineering79, 53–73.

Cook, D., Fowler, S., Fiehn, O. et al. (2004) A prominent role for the CBF cold re-sponse pathway in configuring the low-temperature metabolome of Arabidopsis.Proceedings of the National Academy of Sciences of the United States of America 101,15243–15248.

Costa, M.A., Braga, A.P. and de Menezes, B.R. (2002) Constructive and pruning meth-ods for neural network design. In: Brazilian Symposium on Neural Networks, vol. 0.pp. 49–49.

Covert, M.W. and Palsson, B.O. (2002) Transcriptional regulation in constraints-basedmetabolic models of Escherichia coli. Journal of Biological Chemistry 277, 28058–28064.

Craigon, D.J., James, N., Okyere, J. et al. (2004) NASCArrays: a repository for microar-ray data generated by NASC’s transcriptomics service. Nucleic Acids Research 32,D575–577.

Daub, C., Steuer, R., Selbig, J. et al. (2004) Estimating mutual information using B-spline functions – an improved similarity measure for analysing gene expressiondata. BMC Bioinformatics 5, 118.

Dauwe, R., Morreel, K., Goeminne, G. et al. (2007) Molecular phenotyping of lignin-modified tobacco reveals associated changes in cell-wall metabolism, primarymetabolism, stress metabolism and photorespiration. The Plant Journal 52, 263–285.

Davuluri, R.V., Sun, H., Palaniswamy, S.K. et al. (2003) AGRIS: Arabidopsis gene reg-ulatory information server, an information resource of Arabidopsis cis-regulatoryelements and transcription factors. BMC Bioinformatics 4, 25.

de Godoy, L.M., Olsen, J.V., de Souza, G.A. et al. (2006) Status of complete proteomeanalysis by mass spectrometry: SILAC labeled yeast as a model system. GenomeBiology 7, R50.

de la Fuente, A., Bing, N., Hoeschele, I. et al. (2004) Discovery of meaningful as-sociations in genomic data using partial correlation coefficients. Bioinformatics 20,3565–3574.

de Oliveira Dal’Molin, C.G., Quek, L.E., Palfreyman, R.W. et al. (2009) AraGEM- Agenome-scale reconstruction of primary metabolic betwork in Arabidopsis thaliana.Plant Physiology 152, 579–589.

Dettmer, K., Aronov, P.A. and Hammock, B.D. (2007) Mass spectrometry-basedmetabolomics. Mass Spectrometry Review 26, 51–78.

Diaz-Sierra, R., Lozano, J.B. and Fairen, V. (1999) Deduction of chemical mechanismsfrom the linear response around steady state. The Journal of Physical Chemistry A103, 337–343.

Dodd, A.N., Salathia, N., Hall, A. et al. (2005) Plant circadian clocks increase photo-synthesis, growth, survival, and competitive advantage. Science 309, 630–633.

Doi, K., Hosaka, A., Nagata, T. et al. (2008) Development of a novel data mining toolto find cis-elements in rice gene promoter regions. BMC Plant Biology 8, 20.



Druart, N., Johansson, A., Baba, K. et al. (2007) Environmental and hormonal regula-tion of the activity dormancy cycle in the cambial meristem involves stage-specificmodulation of transcriptional and metabolic networks. The Plant Journal 50, 557–573.

Edwards, D. (2000) Introduction to Graphical Modelling. Springer, New York.Edwards, J.S. and Palsson, B.O. (2002) The Escherichia coli MG1655 in silico metabolic

genotype: its definition, characteristics, and capabilities. Proceedings of the NationalAcademy of Sciences of the United States of America 9, 5528–5533.

Faith, J.J., Hayete, B., Thaden, J.T. et al. (2007) Large-scale mapping and validation ofEscherichia coli transcriptional regulation from a compendium of expression profiles.PLoS Biology 5, e8.

Feist, A.M., Herrgard, M.J., Thiele, I. et al. (2009) Reconstruction of biochemical net-works in microorganisms. Nature Reviews Microbiology 7, 129–143.

Fiehn, O. (2003) Metabolic networks of Cucurbita maxima phloem. Phytochemistry 62,875–886.

Fiehn, O., Kopka, J., Dormann, P. et al. (2000) Metabolite profiling for plant functionalgenomics. Nature Biotechnology 18, 1157–1161.

Figueiredo, A., Fortes, A.M., Ferreira, S. et al. (2008) Transcriptional and metabolicprofiling of grape (Vitis vinifera L.) leaves unravel possible innate resistance againstpathogenic fungi. Journal of Experimental Botany 59, 3371–3381.

Forshed, J., Idborg, H. and Jacobsson, S.P. (2007) Evaluation of different techniquesfor data fusion of LC/MS and H-1-NMR. Chemometrics and Intelligent LaboratorySystems 85, 102–109.

Friedman, N., Linial, M., Nachman, I. et al. (2000) Using Bayesian networks to analyzeexpression data. Journal of Computational Biology 7, 601.

Gavai, A.K., Tikunov, Y., Ursem, R. et al. (2009) Constraint-based probabilistic learningof metabolic pathways from tomato volatiles. Metabolomics 5, 419–428.

Gerlee, P., Lizana, L. and Sneppen, K. (2009) Pathway identification by networkpruning in the metabolic network of Escherichia coli. Bioinformatics 25, 3282–3288.

Gianchandani, E.P., Oberhardt, M.A., Burgard, A.P. et al. (2008) Predicting biologicalsystem objectives de novo from internal state measurements. BMC Bioinformatics 9,43.

Goda, H., Sasaki, E., Akiyama, K. et al. (2008) The AtGenExpress hormone and chem-ical treatment data set: experimental design, data evaluation, model data analysisand data access. Plant Journal 55, 526–542.

Goossens, A., Hakkinen, S.T., Laakso, I. et al. (2003) A functional genomics approachtoward the understanding of secondary metabolism in plant cells. Proceedings of theNational Academy of Sciences of the United States of America 100, 8595–8600.

Grafahrend-Belau, E., Schreiber, F., Koschutzki, D. et al. (2009) Flux balance analysisof barley seeds: a computational approach to study systemic properties of centralmetabolism. Plant Physiology 149, 585–598.

Gullberg, J., Jonsson, P., Nordstrom, A. et al. (2004) Design of experiments: an efficientstrategy to identify factors influencing extraction and derivatization of Arabidopsisthaliana samples in metabolomic studies with gas chromatography/mass spectrom-etry. Analytical Biochemistry 331, 283–295.

Heckerman, D. (1995) A tutorial on learning with bayesian networks. Tech. rep.,Microsoft Research.

Heinrich, R. and Schuster, S. (1996) The Regulation of Cellular Systems. Chapman &Hall, New York.



Hernandez, G., Valdes-Lopez, O., Ramırez, M. et al. (2009) Global changes in thetranscript and metabolic profiles during symbiotic nitrogen fixation in phosphorus-stressed common bean plants. Plant Physiology 151, 1221–1238.

Hirai, M.Y., Klein, M., Fujikawa, Y. et al. (2005) Elucidation of gene-to-gene andmetabolite-to-gene networks in arabidopsis by integration of metabolomics andtranscriptomics. Journal of Biological Chemistry 280, 25590–25595.

Hirai, M.Y., Sugiyama, K., Sawada, Y. et al. (2007) Omics-based identification of Ara-bidopsis Myb transcription factors regulating aliphatic glucosinolate biosynthesis.Proceedings of the National Academy of Sciences of the United States of America 104,6478–6483.

Hirai, M.Y., Yano, M., Goodenowe, D.B. et al. (2004) Integration of transcriptomicsand metabolomics for understanding of global responses to nutritional stresses inArabidopsis thaliana. Proceedings of the National Academy of Sciences of the UnitedStates of America 101, 10205–10210.

Hofmeyr, J.H.S. (2001) Metabolic control analysis in a nutshell. In: Yi, T.M., Hucka, M.,Morohashi, M. and Kitano, H. (eds.) 2nd International Conference on Systems Biology,Omnipress, Madison, WI. pp. 291–300.

Hoppe, A., Hoffmann, S. and Holzhutter, H.G. (2007) Including metabolite concen-trations into flux balance analysis: thermodynamic realizability as a constraint onflux distributions in metabolic networks. BMC Systems Biology 1, 23.

Horan, K., Jang, C., Bailey-Serres, J. et al. (2008) Annotating genes of known andunknown function by large-scale coexpression analysis. Plant Physiology 147, 41–57.

Howell, K.A., Narsai, R., Carroll, A. et al. (2009) Mapping metabolic and transcripttemporal switches during germination in rice highlights specific transcription fac-tors and the role of RNA instability in the germination process. Plant Physiology149, 961–980.

Husmeier, D., Dybowski, R. and Roberts, S. (2005) Probabilistic Modeling in Bioinfor-matics and Medical Informatics. Springer, New York.

Hynne, F., Dano, S. and Sorensen, P.G. (2001) Full-scale model of glycolysis in Saccha-romyces cerevisiae. Biophysical Chemistry 94, 121–163.

Jeong, H., Tombor, B., Albert, R. et al. (2000) The large-scale organization of metabolicnetworks. Nature 407, 651–654.

Joung, J.G., Corbett, A.M., Fellman, S.M. et al. (2009) Plant MetGenMAP: an in-tegrative analysis system for plant systems biology. Plant Physiology 151, 1758–1768.

Kanaya, S., Kinouchi, M., Abe, T. et al. (2001) Analysis of codon usage diversity ofbacterial genes with a self-organizing map (SOM): characterization of horizontallytransferred genes with emphasis on the E. coli O157 genome. Gene 276, 89–99.

Kanehisa, M. and Goto, S. (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes.Nucleic Acids Research 28, 27–30.

Karp, P.D., Ouzounis, C.A., Moore-Kochlacs, C. et al. (2005) Expansion of the BioCyccollection of pathway/genome databases to 160 genomes. Nucleic Acids Research 33,6083–6089.

Kell, D.B. (2004) Metabolomics and systems biology: making sense of the soup. CurrentOpinion in Microbiology 7, 296–307.

Kilian, J., Whitehead, D., Horak, J. et al. (2007) The AtGenExpress global stress expres-sion data set: protocols, evaluation and model data analysis of UV-B light, droughtand cold stress responses. Plant Journal 50, 347–363.



Klamt, S., Schuster, S. and Gilles, E.D. (2002) Calculability analysis in underdeter-mined metabolic networks illustrated by a model of the central metabolism inpurple nonsulfur bacteria. Biotechnology and Bioengineering 77, 734–751.

Klie, S., Nikoloski, Z. and Selbig, J. (2010) Biological cluster evaluation for gene func-tion prediction. Journal of Computational Biology 17, 1–8.

Kolbe, A., Oliver, S.N., Fernie, A.R. et al. (2006) Combined transcript and metaboliteprofiling of Arabidopsis leaves reveals fundamental effects of the thiol-disulfidestatus on plant metabolism. Plant Physiology 141, 412–422.

Kristensen, C., Morant, M., Olsen, C.E. et al. (2005) Metabolic engineering of dhurrin intransgenic Arabidopsis plants with marginal inadvertent effects on the metabolomeand transcriptome. Proceedings of the National Academy of Sciences of the United Statesof America 102, 1779–1784.

Kusano, M., Fukushima, A., Arita, M. et al. (2007) Unbiased characterization ofgenotype-dependent metabolic regulations by metabolomic approach in Arabidop-sis thaliana. BMC Systems Biology 1, 53.

Kusnierczyk, A., Winge, P., Jørstad, T. et al. (2008) Towards global understanding ofplant defence against aphids timing and dynamics of early Arabidopsis defenceresponses to cabbage aphid (Brevicoryne brassicae) attack. Plant Cell Environment31, 1097–1115.

Lake, D.E. (2009) Nonparametric entropy estimation using kernel densities. MethodsEnzymology 467, 531–546.

Lawrence, S. (2006) Trends in biotech literature 2005. Nature Biotechnology 24, 380.Lee, J.M., Gianchandani, E.P., Eddy, J.A. et al. (2008) Dynamic analysis of inte-

grated signaling, metabolic, and regulatory networks. PLoS Computational Biology4, e1000086.

Lenz, E.M. and Wilson, I.D. (2007) Analytical strategies in metabonomics. Journal ofProteome Research 6, 443–458.

Liang, K.C. and Wang, X. (2008) Gene regulatory network reconstruction using con-ditional mutual information. EURASIP Journal on Bioinformatics and Systems Biology2008, 253894.

Lippold, F., Sanchez, D.H., Musialak, M. et al. (2009) AtMyb41 regulates transcriptionaland metabolic responses to osmotic stress in Arabidopsis. Plant Physiology 149,1761–1772.

Liu, R.H., Lin, D.L., Chang, W.T. et al. (2002) Isotopically labeled analogues for drugquantitation. Analytical Chemistry 74, 618A–626A.

Luo, J., Fuell, C., Parr, A. et al. (2009a) A novel polyamine acyltransferase responsiblefor the accumulation of spermidine conjugates in Arabidopsis seed. Plant Cell 21,318–333.

Luo, R., Wei, H., Ye, L. et al. (2009b) Photosynthetic metabolism of C3 plants showshighly cooperative regulation under changing environments: a systems biologicalanalysis. Proceedings of the National Academy of Sciences of the United States of America106, 847–852.

Ma, H.W. and Zeng, A.P. (2003) The connectivity structure, giant strong componentand centrality of metabolic networks. Bioinformatics 19, 1423–1430.

Mahadevan, R., Edwards, J.S. and Doyle F.J., 3rd. (2002) Dynamic flux balance analysisof diauxic growth in Escherichia coli. Biophysical Journal 83, 1331–1340.

Margolin, A., Nemenman, I., Basso, K. et al. (2006) ARACNE: An Algorithm for theReconstruction of Gene Regulatory Networks in a mammalian cellular context.BMC Bioinformatics 7, S7.



Martins, A.M., Camacho, D., Shuman, J. et al. (2004) A systems biology study oftwo distinct growth phases of Saccharomyces cerevisiae cultures. Current Genomics 5,649–663.

MASC (2007) The Multinational Coodinated Arabidopsis thaliana Functional GenomicsProject – Annual Report 2007. Technical Report, The Multinational ArabidopsisSteering Committee.

Matsuda, F., Hirai, M.Y., Sasaki, E. et al. (2009) AtMetExpress development: a phy-tochemical atlas of Arabidopsis thaliana development. Plant Physiology 152, 566–578.

May, P., Christian, J.O., Kempa, S. et al. (2009) ChlamyCyc: an integrative systemsbiology database and web-portal for Chlamydomonas reinhardtii. BMC Genomics 10,209.

May, P., Wienkoop, S., Kempa, S. et al. (2008) Metabolomics- and proteomics-assistedgenome annotation and analysis of the draft metabolic network of Chlamydomonasreinhardtii. Genetics 179, 157–166.

Meek, C. (1995) Causal inference and causal explanation with background. In: Uncer-tainty in Artificial Intelligence: Proceedings of the Conference on Uncertainty in ArtificialIntelligence, Morgan Kaufmann Pub, San Fransisco. p. 403.

Merchant, S.S., Prochnik, S.E., Vallon, O. et al. (2007) The Chlamydomonas genomereveals the evolution of key animal and plant functions. Science 318, 245–250.

Meyer, P., Lafitte, F. and Bontempi, G. (2008) minet: A R/bioconductor package forinferring large transcriptional networks using mutual information. BMC Bioinfor-matics 9, 461.

Meyer, P.E., Kontos, K., Lafitte, F. et al. (2007) Information-theoretic inference of largetranscriptional regulatory networks. EURASIP Journal on Bioinformatics and SystemsBiology 2207, 79879.

Mintz-Oron, S., Aharoni, A., Ruppin, E. et al. (2009) Network-based prediction ofmetabolic enzymes’ subcellular localization. Bioinformatics 25, i247–i252.

Mintz-Oron, S., Mandel, T., Rogachev, I. et al. (2008) Gene expression and metabolismin tomato fruit surface tissues. Plant Physiology 147, 823–851.

Moco, S., Forshed, J., De Vos, R.C.H. et al. (2008) Intra- and inter-metabolite correlationspectroscopy of tomato metabolomics data obtained by liquid chromatography-mass spectrometry and nuclear magnetic resonance. Metabolomics 4, 202–215.

Morgenthal, K., Weckwerth, W. and Steuer, R. (2006) Metabolomic networks in plants:transitions from pattern recognition to biological interpretation. Biosystems 83,108–117.

Morioka, R., Kanaya, S., Hirai, M.Y. et al. (2007) Predicting state transitions in the tran-scriptome and metabolome using a linear dynamical system model. BMC Bioinfor-matics 8, 343.

Mounet, F., Moing, A., Garcia, V. et al. (2009) Gene and metabolite regulatory net-work analysis of early developing fruit tissues highlights new candidate genesfor the control of tomato fruit composition and development. Plant Physiology 149,1505–1528.

Muller-Linow, M., Weckwerth, W. and Hutt, M.T. (2007) Consistency analysis ofmetabolic correlation networks. BMC Systems Biology 1, 44.

Murphy, K. and Mian, S. (1999) Modelling Gene Expression Data using Dynamic BayesianNetworks. Technical Report, University of California, Berkeley.



Nagrath, D., Avila-Elchiver, M., Berthiaume, F. et al. (2007) Integrated energy andflux balance multiobjective framework for large-scale metabolic networks. Annalsof Biomedical Engineering 35, 863–885.

Ni, Y., Su, M., Qiu, Y. et al. (2007) Metabolic profiling using combined GC-MS and LC-MS provides a systems understanding of aristolochic acid-induced nephrotoxicityin rat. FEBS Letters 581, 707–711.

Nikiforova, V.J., Daub, C.O., Hesse, H. et al. (2005) Integrative gene-metabolite net-work with implemented causality deciphers informational fluxes of sulphur stressresponse. Journal of Experimental Botany 56, 1887–1896.

Obayashi, T., Hayashi, S., Shibaoka, M. et al. (2008) COXPRESdb: a database of coex-pressed gene networks in mammals. Nucleic Acids Research 36, D77–D82.

Obayashi, T., Kinoshita, K., Nakai, K. et al. (2007) ATTED-II: a database of co-expressedgenes and cis elements for identifying co-regulated gene groups in Arabidopsis.Nucleic Acids Research 35, D863–D869.

Obayashi, T., Hayashi, S., Saeki, M. et al. (2009) ATTED-II provides coexpressed genenetworks for Arabidopsis. Nucleic Acids Research 37, D987–D991.

Ong, E.S., Chor, C.F., Zou, L. et al. (2009) A multi-analytical approach for metabolomicprofiling of zebrafish (Danio rerio) livers. Molecular BioSystems 5, 288–298.

Paley, S.M. and Karp, P.D. (2006) The Pathway Tools cellular overview diagram andOmics Viewer. Nucleic Acids Research 34, 3771–3778.

Pearl, J. (1998) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Infer-ence. Morgan Kaufmann, Menlo Park.

Poolman, M.G., Miguet, L., Sweetlove, L.J. et al. (2009) A genome-scale metabolicmodel of Arabidopsis thaliana and some of its properties. Plant Physiology 151,1570–1581.

Ramakrishna, R., Edwards, J.S., McCulloch, A. et al. (2001) Flux-balance analysis ofmitochondrial energy metabolism: consequences of systemic stocihiometric con-straints. The American Journal of Physiology- Regulatory, Integrative, and ComparativePhysiology 280, R695–R704.

Rawat, A., Seifert, G. and Deng, Y. (2008) Novel implementation of conditional co-regulation by graph theory to derive co-expressed genes from microarray data.BMC Bioinformatics 9, S7.

Redestig, H., Kusano, M., Matsuda, F. et al. (2010) Consolidating metabolite identi-fiers to enable contextual and multi-platform metabolomics. BMC Bioinformatics 11,214.

Redestig, H., Fukushima, A., Stenlund, H. et al. (2009) Compensation for sys-tematic cross-contribution improves normalization of mass spectrometry basedmetabolomics data. Analytical Chemistry 81, 7974–7980.

Roessner, U., Luedemann, A., Brust, D. et al. (2001) Metabolic profiling allows com-prehensive phenotyping of genetically or environmentally modified plant systems.Plant Cell 13, 11–29.

Roussel, S., Bellon-Maurel, V., Roger, J. et al. (2003) Fusion of aroma, FT-IR and UVsensor data based on the Bayesian inference. Application to the discrimination ofwhite grape varieties. Chemometrics and Intelligent Laboratory Systems 65, 209–219.

Rowe, H.C., Hansen, B.G., Halkier, B.A. et al. (2008) Biochemical networks and epis-tasis shape the Arabidopsis thaliana metabolome. Plant Cell 20, 1199–1216.

Saito, K., Hirai, M.Y. and Yonekura-Sakakibara, K. (2008) Decoding genes with coex-pression networks and metabolomics – ‘majority report by precogs’. Trends in PlantScience 13, 36–43.



Sanchez, D.H., Szymanski, J., Erban, A. et al. (2009) Mining for robust transcriptionaland metabolic responses to long-term salt stress: a case study on the model legumeLotus japonicus. Plant Cell Environ, NA, NA.

Sawada, Y., Kuwahara, A., Nagano, M. et al. (2009a) Omics-based approaches to me-thionine side chain elongation in Arabidopsis: characterization of the genes encod-ing methylthioalkylmalate isomerase and methylthioalkylmalate dehydrogenase.Plant and Cell Physiology 50, 1181–1190.

Sawada, Y., Toyooka, K., Kuwahara, A. et al. (2009b) Arabidopsis bile acid: sodiumsymporter family protein 5 is involved in methionine-derived glucosinolate biosyn-thesis. Plant and Cell Physiology 50, 1579–1586.

Schmid, M., Davison, T.S., Henz, S.R. et al. (2005) A gene expression map of Arabidopsisthaliana development. Nature Genetics 37, 501–506.

Schuetz, R., Kuepfer, L. and Sauer, U. (2007) Systematic evaluation of objective func-tions for predicting intercellular fluxes in Escherichia coli. Molecular Systems Biology3, 119.

Schuster, S., Pfeiffer, T. and Fell, D.A. (2008) Is maximization of molar yield inmetabolic networks favoured by evolution. Journal of Theoretical Biology 252,497–504.

Schafer, J. and Strimmer, K. (2005) An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics 21, 754–764.

Shipley, B. (2002) Cause and Correlation in Biology: A User’s Guide to Path Analysis,Structural Equations and Causal Inference. Cambridge University Press, Cambridge,UK.

Smilde, A.K., Van Der Werf, M.J., Bijlsma, S. et al. (2005) Fusion of mass spectrometry-based metabolomics data. Analytical Chemistry 77, 6729–6736.

Soranzo, N., Bianconi, G. and Altafini, C. (2007) Comparing association network algo-rithms for reverse engineering of large-scale gene regulatory networks: syntheticversus real data. Bioinformatics 23, 1640–1647.

Sprites, P., Glymour, C. and Scheines, R. (2000) Causation, Prediction and Search. TheMIT Press, Cambridge USA.

Steinmetz, V., Sevila, F. and Bellon-Maurel, V. (1999) A methodology for sensor fusiondesign: application to fruit quality assessment. Journal of Agricultural EngineeringResearch 74, 21–31.

Stelling, J., Klamt, S., Bettenbrock, K. et al. (2002) Metabolic network structure deter-mines key aspects of functionality and regulation. Nature 420, 190–193.

Steuer, R., Kurths, J., Daub, C.O. et al. (2002) The mutual information: Detect-ing and evaluating dependencies between variables. Bioinformatics 18, S231–S240.

Steuer, R., Kurths, J., Fiehn, O. et al. (2003) Observing and interpreting correlations inmetabolomic networks. Bioinformatics 19, 1019–1026.

Styczynski, M.P., Moxley, J.F., Tong, L.V. et al. (2007) Systematic identification of con-served metabolites in GC/MS data for metabolomics and biomarker discovery.Analytical Chemistry 79, 966–973.

Suzuki, M., Kusano, M., Takahashi, H. et al. (2010) Rice-Arabidopsis FOX line screen-ing with FT-NIR-based fingerprinting for GC-TOF/MS-based metabolite profiling.Metabolomics 6, 137–145.

Sysi-Aho, M., Katajamaa, M., Yetukuri, L. et al. (2007) Normalization method formetabolomics data using optimal selection of multiple internal standards. BMCBioinformatics 8, 93.



Szymanski, J., Jozefczuk, S., Nikoloski, Z. et al. (2009) Stability of metabolic correlationsunder changing environmental conditions in Escherichia coli – a systems approach.PLoS ONE 4, e7441.

Taroncher-Oldenburg, G. and Marshall, A. (2007) Trends in biotech literature 2006.Nature Biotechnology 25, 961.

Teusink, B., Passarge, J., Reijenga, C.A. et al. (2000) Can yeast glycolysis be understoodin terms of in vitro kinetics of the constituent enzymes? Testing biochemistry.European Journal of Biochemistry 267, 5313–5329.

Thimm, O., Blasing, O., Gibon, Y. et al. (2004) MAPMAN: a user-driven tool to displaygenomics data sets onto diagrams of metabolic pathways and other biologicalprocesses. Plant Journal 37, 914–939.

Thomas-Chollier, M., Sand, O., Turatsinze, J.V. et al. (2008) RSAT: regulatory sequenceanalysis tools. Nucleic Acids Research 36, W119–W127.

t’Kindt, R., Morreel, K., Deforce, D. et al. (2009) Joint GC-MS and LC-MS platformsfor comprehensive plant metabolomics: repeatability and sample pre-treatment.Journal of Chromatography B, NA, NA.

Tohge, T., Nishiyama, Y., Hirai, M.Y. et al. (2005) Functional genomics by integratedanalysis of metabolome and transcriptome of Arabidopsis plants over-expressinga MYB transcription factor. Plant Journal 42, 218–235.

Tokimatsu, T., Sakurai, N., Suzuki, H. et al. (2005) KaPPA-view: a web-based analysistool for integration of transcript and metabolite data on plant metabolic pathwaymaps. Plant Physiology 138, 1289–1300.

Toufighi, K., Brady, S.M., Austin, R. et al. (2005) The Botany Array Resource: e-Northerns, Expression Angling, and promoter analyses. Plant Journal 43, 153–163.

Trygg, J. and Wold, S. (2002) Orthogonal projections to latent structures (O-PLS).Journal of Chemometrics 16, 119–128.

Urano, K., Maruyama, K., Ogata, Y. et al. (2008) Characterization of the ABA-regulatedglobal responses to dehydration in Arabidopsis by metabolomics. Plant Journal 57,1065–1078.

Urbanczyk-Wochniak, E., Luedemann, A., Kopka, J. et al. (2003) Parallel analysis oftranscript and metabolic profiles: a new approach in systems biology. EMBO Reports4, 989–993.

Usadel, B., Nagel, A., Thimm, O. et al. (2005) Extension of the visualizationtool MapMan to allow statistical analysis of arrays, display of correspond-ing genes, and comparison with known responses. Plant Physiology 138, 1195–1204.

van den Berg, R.A., Hoefsloot, H.C.J., Westerhuis, J.A. et al. (2006) Centering, scaling,and transformations: improving the biological information content of metabolomicsdata. BMC Genomics 7, 142.

Van Der Werf, M.J., Overkamp, K.M., Muilwijk, B. et al. (2007) Microbial metabolomics:toward a platform with full metabolome coverage. Analytical Biochemistry 370,17–25.

van Kampen, N. (1992) Stochastic Processes in Physics and Chemistry. Elsevier, Amster-dam.

Vojinovic, V. and von Stockar, U. (2009) Influence of Uncertainties in PH, pMg, activitycoefficients, metabolite concentrations, and other factors on the analysis of thethermodynamic feasibility of metabolic pathways. Biotechnology and Bioengineering103, 780–795.



Wang, H., Schauer, N., Usadel, B. et al. (2009) Regulatory features underlyingpollination-dependent and -independent tomato fruit set revealed by transcriptand primary metabolite profiling. Plant Cell 21, 1428–1452.

Wang, R., Guegler, K., LaBrie, S.T. et al. (2000) Genomic analysis of a nutrient re-sponse in Arabidopsis reveals diverse expression patterns and novel metabolic andpotential regulatory genes induced by nitrate. Plant Cell 12, 1491–1510.

Weckwerth, W. (2003) Metabolomics in systems biology. Annual Review of Plant Biology54, 669–689.

Weckwerth, W., Loureiro, M.E., Wenzel, K. et al. (2004) Differential metabolic networksunravel the effects of silent plant phenotypes. Proceedings of the National Academy ofSciences of the United States of America 101, 7809–7814.

Weckwerth, W., Tolstikov, V. and Fiehn, O. (2001) Metabolomic characterization oftransgenic potato plants using GC/TOF and LC/MS. In: Proceedings of the 49thASMS Conference on Mass spectrometry and Allied Topics. pp. 1–2.

Weigelt, K., Kuster, H., Rutten, T. et al. (2009) ADP-glucose pyrophosphorylase-deficient pea embryos reveal specific transcriptional and metabolic changes ofcarbon–nitrogen metabolism and stress responses. Plant Physiology 149, 395–411.

Wentzell, A.M., Rowe, H.C., Hansen, B.G. et al. (2007) Linking metabolic QTLs withnetwork and cis-eQTLs controlling biosynthetic pathways. PLoS Genetics 3, e162.

Westerhuis, J., Kourti, T. and MacGregor, J. (1998) Analysis of multiblock and hierar-chical PCA and PLS models. Journal of Chemometrics 12, 301–321.

Williams, R., Lenz, E.M., Wilson, A.J. et al. (2006) A multi-analytical platform approachto the metabonomic analysis of plasma from normal and Zucker (fa/fa) obese rats.Molecular BioSystems 2, 174–183.

Yeung, M.K., Tegner, J. and Collins, J.J. (2002) Reverse engineering gene networks us-ing singular value decomposition and robust regression. Proceedings of the NationalAcademy of Sciences of the United States of America 99, 6163–6168.

Yonekura-Sakakibara, K., Tohge, T., Matsuda, F. et al. (2008) Comprehensive flavonolprofiling and transcriptome coexpression analysis leading to decoding gene-metabolite correlations in Arabidopsis. Plant Cell 20, 2160–2176.

Yonekura-Sakakibara, K., Tohge, T., Niida, R. et al. (2007) Identification of a flavonol7-O-rhamnosyltransferase gene determining flavonoid pattern in Arabidopsis bytranscriptome coexpression analysis and reverse genetics. Journal of Biological Chem-istry 282, 14932–14941.

Zhang, P., Foerster, H., Tissier, C.P. et al. (2005) MetaCyc and AraCyc. MetabolicPathway Databases for plant research. Plant Physiology 138, 27–37.

Zimmermann, P., Hirsch-Hoffmann, M., Hennig, L. et al. (2004) GENEVESTIGA-TOR. Arabidopsis microarray database and analysis toolbox. Plant Physiology 136,2621–2632.

Documents

Annual Plant Reviews Volume 43 (Biology of Plant Metabolomics) || Data Integration, Metabolic Networks and Systems Biology