12
Computers and Chemical Engineering 29 (2005) 565–576 Large-scale inference of the transcriptional regulation of Bacillus subtilis Anshuman Gupta a , Jeffrey D. Varner b , Costas D. Maranas a,a Department of Chemical Engineering, The Pennsylvania State University, University Park, PA 16802, USA b Process Science Group, Genencor International In, 925 Page Mill Rd, Palo Alto, CA 94304, USA Received 19 December 2003; received in revised form 22 March 2004 Available online 12 October 2004 Abstract This paper addresses the inference of the transcriptional regulatory network of Bacillus subtilis. Two inference approaches, a linear, additive model and a non-linear power-law model, are used to analyze the expression of 747 genes from B. subtilis obtained using Affymetrix GeneChip ® arrays under three different experimental conditions. A robustness analysis is introduced for identifying confidence levels for all inferred regulatory connections. Both the linear and non-linear methods produce candidate networks that share a scale-free or a “hub- and-spoke” topology with a small number of global regulator genes influencing the expression of a large number of target genes. The two computational approaches in tandem are able to identify known global regulators with a high level of confidence. The linear model is able to identify the interactions of highly expressed genes, particularly those involved in genetic information processing, energy metabolism and signal transduction. Conversely, the non-linear power-law approach tends to capture development regulation and specific carbon and nitrogen regulatory interactions. © 2004 Elsevier Ltd. All rights reserved. Keywords: Bacillus subtilis; Transcriptional regulation; Signal transduction 1. Introduction The data explosion currently overwhelming biology presents a challenging paradox: “you can see everything, but understand very little”. Sequencing a genome, which only a few years ago was a tremendous feat, is now routine. Many examples of transcript array analysis can be found in the lit- eratures (Chu et al., 1998; Cohen, Pilpel, Mitra, & Church, 2002; Gardner, di Bernardo, Lorenz, & Collins, 2003; Lee, Rinaldi, & Robert, 2002; Spellman et al., 1998; Wen et al., 1998); an exponential increase since the ground breaking work of Brown and co-workers on S. cerevisiae (DeRisi, Iyer, & Brown, 1997). Genomics, proteomics and metabolomics are now the cutting edge of physiological analysis (Gill et al., 2002; Herrgard, Covert, & Palsson, 2003; Ideker et al., 2001; Misra et al., 2002; Oh, Rohlin, Kao, & Liao, 2002; Stephanopoulos, Hwang, Schmitt, Misra, & Stephanopoulos, Corresponding author. Tel.: +1 814 863 9958; fax: +1 814 865 7846. E-mail address: [email protected] (C.D. Maranas). 2002). These technologies when married with physiologi- cal measurements, genome-wide transcription measurements and genetic sequence generate huge tracts of integrated data. Buried inside this vast amount of interconnected information lay the output of millions of years of evolution. Gene expres- sion, the process by which a genetic blueprint is turned into a working component of the organism through an interme- diate mRNA message, is a meticulously managed endeavor. Understanding the control of gene expression has long been an important challenge because its understanding is central to our ability to steer an organism in directions other than for which it was programmed. Expression data alone does not hold the answer to this puzzle. However, if no transcript is present, there will be no protein (the reverse is not always true). Thus, understanding the regulation of gene expression networks is a necessary though not sufficient first step towards elucidating the flow of information in biological systems. A number of studies have been conducted and frameworks proposed for the purpose of extracting regulatory networks from gene expression data. Most early network inference 0098-1354/$ – see front matter © 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.compchemeng.2004.08.030

Large-scale inference of the transcriptional regulation of Bacillus subtilis

Embed Size (px)

Citation preview

Page 1: Large-scale inference of the transcriptional regulation of Bacillus subtilis

Computers and Chemical Engineering 29 (2005) 565–576

Large-scale inference of the transcriptional regulation ofBacillus subtilis

Anshuman Guptaa, Jeffrey D. Varnerb, Costas D. Maranasa,∗a Department of Chemical Engineering, The Pennsylvania State University, University Park, PA 16802, USA

b Process Science Group, Genencor International In, 925 Page Mill Rd, Palo Alto, CA 94304, USA

Received 19 December 2003; received in revised form 22 March 2004Available online 12 October 2004

Abstract

This paper addresses the inference of the transcriptional regulatory network ofBacillus subtilis. Two inference approaches, a linear,additive model and a non-linear power-law model, are used to analyze the expression of 747 genes fromB. subtilisobtained using AffymetrixGeneChip® arrays under three different experimental conditions. A robustness analysis is introduced for identifying confidence levels forall inferred regulatory connections. Both the linear and non-linear methods produce candidate networks that share a scale-free or a “hub-and-spoke” topology with a small number of global regulator genes influencing the expression of a large number of target genes. The twoc del is ablet olism ands and nitrogenr©

K

1

pufee2R1w&aa2S

ogi-mentsd data.tionres-intome-avor.been

entralfor

s nott isays

ssionards

.orks

orksnce

0d

omputational approaches in tandem are able to identify known global regulators with a high level of confidence. The linear moo identify the interactions of highly expressed genes, particularly those involved in genetic information processing, energy metabignal transduction. Conversely, the non-linear power-law approach tends to capture development regulation and specific carbonegulatory interactions.

2004 Elsevier Ltd. All rights reserved.

eywords: Bacillus subtilis; Transcriptional regulation; Signal transduction

. Introduction

The data explosion currently overwhelming biologyresents a challenging paradox: “you can see everything, butnderstand very little”. Sequencing a genome, which only a

ew years ago was a tremendous feat, is now routine. Manyxamples of transcript array analysis can be found in the lit-ratures (Chu et al., 1998; Cohen, Pilpel, Mitra, & Church,002; Gardner, di Bernardo, Lorenz, & Collins, 2003; Lee,inaldi, & Robert, 2002; Spellman et al., 1998; Wen et al.,998); an exponential increase since the ground breakingork of Brown and co-workers onS. cerevisiae(DeRisi, Iyer,Brown, 1997). Genomics, proteomics and metabolomics

re now the cutting edge of physiological analysis (Gill etl., 2002; Herrgard, Covert, & Palsson, 2003; Ideker et al.,001; Misra et al., 2002; Oh, Rohlin, Kao, & Liao, 2002;tephanopoulos, Hwang, Schmitt, Misra, & Stephanopoulos,

∗ Corresponding author. Tel.: +1 814 863 9958; fax: +1 814 865 7846.E-mail address:[email protected] (C.D. Maranas).

2002). These technologies when married with physiolcal measurements, genome-wide transcription measureand genetic sequence generate huge tracts of integrateBuried inside this vast amount of interconnected informalay the output of millions of years of evolution. Gene expsion, the process by which a genetic blueprint is turneda working component of the organism through an interdiate mRNA message, is a meticulously managed endeUnderstanding the control of gene expression has longan important challenge because its understanding is cto our ability to steer an organism in directions other thanwhich it was programmed. Expression data alone doehold the answer to this puzzle. However, if no transcrippresent, there will be no protein (the reverse is not alwtrue). Thus, understanding the regulation of gene exprenetworks is a necessary though not sufficient first step towelucidating the flow of information in biological systems

A number of studies have been conducted and framewproposed for the purpose of extracting regulatory netwfrom gene expression data. Most early network infere

098-1354/$ – see front matter © 2004 Elsevier Ltd. All rights reserved.oi:10.1016/j.compchemeng.2004.08.030

Page 2: Large-scale inference of the transcriptional regulation of Bacillus subtilis

566 A. Gupta et al. / Computers and Chemical Engineering 29 (2005) 565–576

methods relied primarily on clustering genes on the basisof their expression profiles (D’Haeseleer, Wen, Fuhrman, &Somogyi, 1999; Dougherty et al., 2002; Eisen, Spellman,Brown, & Botstein, 1998; Wen et al., 1998). Recently, therehas been considerable interest in developing computationaltools that go beyond answering the question of whether twoor more genes have similar expression profiles. Instead, thecentral question that is being raised is whether we can un-cover, hidden within gene expression data, the signature,extent and directionality of interactions between differentgenes. In other words, rather than simply grouping geneswith similar expression profiles, new methods attempt tolearn gene regulatory patterns from expression data. Broadly,these methods can be classified into two distinct categoriesbased on their fundamental treatment of gene interactions.Deterministic model-basedmethods assume there exists adeterministic formalismY = f(X) that captures the effectof expression level of geneX on geneY. Different choicesfor the functionf(·) (e.g., linear, sigmoidal, etc.) give riseto many versions of model-based methods (D’Haeseleer etal., 1999; Gardner et al., 2003; Holter, Maritan, Cieplak,Fedoroff, & Banavar, 2001; Weaver, Workman, & Stormo,1999; Zak, Gonye, & Schwaber, 2003). Conversely,stochas-tic model-basedmethods start by postulating that experimen-tally observed gene expression profiles correspond to samplesd u-t fora intc ex-p 0P

r re-s mplic-i elsm ome-w ares stici e theyt sitivet orkm f cy-c s. Int hes,a r theu ls

eth-o odelt adeb e for-m ismspK i& edt hipsb gical

functions (such as AND, OR, NOT, etc.). More recently,an extension of this approach to account for uncertainty inexpression data has been proposed in the form of probabilis-tic Boolean networks (Akutsu, Miyano, & Kuhara, 2000;Shmulevich, Dougherty, Kim, & Zhang, 2002; Shmulevich,Lahdesmaki, Dougherty, Astola, & Zhang, 2003). How-ever, in most real gene expression settings, Boolean ideal-izations may not be appropriate as genes are expressed atcontinuously varying intermediate expression levels (Jong,2002). Consequently, more general approaches have beenproposed which model mRNA expression level as a contin-uously varying quantity. These include linear weight mod-eling (D’Haeseleer et al., 1999; Weaver et al., 1999), ordi-nary differential equations (Chen, He, & Church, 1999) andS-systems (Akutsu et al., 2000; Kikuchi, Tominaga, Arita,Takahashi, & Tomita, 2003; Maki, Tominaga, Okamoto,Watanabe, & Eguchi, 2001; Savageau, 1998). In this work,we use a continuous description of gene expression since wedo not observe any natural threshold values in our experi-mental data that can be used to unambiguously discretize theexpression states.

2. Systems and methods

2

vedi -t ssiond ionsv s.s n-it teria.I roB n,& , &N ,& ta

em-p e-w ent( nta-t keni ntialg ent( e ofA me-r nalsa mea-s in-v wnt

rawn from an unknown multivariate probability distribion. Bayesian networks provide a popular alternativechieving this objective by postulating a multivariate joonditional probability model that explains the observedression data (Friedman, Linial, Nachman, & Pe’er, 200;e’er, Regev, Elidan, & Friedman, 2001).Both deterministic and stochastic models have thei

pective advantages and disadvantages. The relative sity and computational tractability of deterministic mod

akes them amenable to inference of large-scale genide transcriptional networks. However, these modelsensitive to over-fitting and prediction artifacts. Stochanference models such as Bayesian networks, becausake a probabilistic view of gene expression, are less seno over-fitting. However, the application of Bayesian netwodels is limited due to their cumbersome treatment o

les and their relatively large computational requirementhe light of the advantages/limitations of the two approacdeterministic model-based approach is adopted to infenderlying regulatory network ofBacillus subtilison a globacale.

In addition to classifying gene network inference mds based upon the mathematical formalism used to m

he regulation process, a further distinction can be mased upon how gene expression is handled within thesalisms. Boolean networks were among the first formalroposed to model gene interactions (Akutsu, Miyano, &uhara, 1999; Ideker, Thorsson, & Karp, 2000; SomogySniegoski, 1996). In this approach, genes are assum

o be either ON or OFF and the input–output relationsetween them are modeled through deterministic lo

.1. Scope of investigation

The regulatory network formed by 747 genes involn the central metabolism ofB. subtilisduring fed-batch proease production is resolved from time series gene expreata collected from three different experimental conditia two different deterministic model-based approacheB.ubtilis and relatedBacilli are industrial workhorse orgasms used for, among other things, protein production.B. sub-ilis is arguably the best characterized gram-positive bact has been sequenced (Kunst et al., 1997) and a large numbef quantitative physiological studies ofB. subtilisand relatedacilli are present in the literature (Christiansen, ChristenseNielsen, 2002; Christiansen, Michaelsen, Wumpelmannielsen, 2003; Dauner & Sauer, 2000, 2001; Dauner, BaileySauer, 2001a; Dauner, Storni, & Sauer, 2001b; Dauner e

l., 2002; Sauer et al., 1997).The three experimental conditions used for inference

loy Affymetrix GeneChip® arrays for obtaining genomide expression data for: (i) a cradle-to-grave experim

20 time points taken over the entire course of the fermeion); (ii) an amino acid pulse experiment (9 time points tammediately after an amino acid pulse in mid-exponerowth); and (iii) an exponential growth phase experim5 time points taken during exponential growth). The usffymetrix arrays implies that our expression data are ti

esolved absolute (non-condition scaled) transcript sigs opposed to the relative expression changes typicallyured with cDNA arrays. The gene list includes genesolved in primary metabolism in addition to some knoranscriptional regulators.

Page 3: Large-scale inference of the transcriptional regulation of Bacillus subtilis

A. Gupta et al. / Computers and Chemical Engineering 29 (2005) 565–576 567

Two alternative model formulations, with different levelsof computational complexity, are used to identify regulatoryconnections. The first approach, based upon modeling thedynamics of gene expression as a first-order, linear process(D’Haeseleer et al., 1999), assumes the expression level of agene at a particular time-point is modeled as a linear combi-nation of the concentration of all other genes at the previoustime-point. The network connectivity, encoded by the coeffi-cients in the linear combinations, is determined by minimiz-ing the error between the predicted and experimental geneexpression values. The second methodology generalizes thelinear approach by postulating a non-linear differential equa-tion model (Varner, 2000). In the non-linear framework afeedback law representing the control instructions governingexpression is identified using tools from non-linear controltheory. Putative connections between genej and genek arecaptured using a power-law expansion of the identified con-trol function where the power-law exponents are determineddirectly from time-resolved gene expression data. A robust-ness analysis is carried out on the identified connections ofboth models to determine their level of confidence. The un-derlying principle of the robustness analysis is to assign aconfidence level to each inferred connection by comparingthe likelihood of inferring a regulatory coefficient of a givenmagnitude from real versus randomized data. We first exploret ng an

2

d by( sedb ,2 averep NAs

w iesi f-fiw

w iono iedb nee db

Note that a discrete (forward) difference is employed for es-timating the rate of change of expression (D’Haeseleer etal., 1999; Yeung et al., 2002). Alternatively, we also used aninterpolation procedure for estimating the rate of change ofgene expression. However, no significant differences in theresults were observed over the simpler, forward differenceapproach. Eq.(2.2) is recast into matrix notation to yield

ZN×(T−1) = WN×NZN×(T−1) (2.3)

For most microarray time course experiments, the total num-ber of genes investigatedN, is much larger than the numberof time-pointsT. This implies that the above system of equa-tions is underdetermined because there areN(T − 1) equa-tions andN2 variables leading to multiple solutions. Singularvalue decomposition (SVD) is used to obtain the entire fam-ily of solutions that is consistent with the hypothesized linearmodel (Yeung et al., 2002). To this end, the transpose of Eq.(2.3) is taken in order to recast the system of equations instandard form (i.e.,Ax = b):

(ZT)(T−1)×N(WT)N×N = (ZT)(T−1)×N (2.4)

Subsequently, SVD is applied toZT yielding

(

ws xc gt ularsg

(

w sσ ls ionsr

(

ws rkc dataa s-e cientmi ing

he linear model followed by a complementary study usion-linear model.

.2. Linear model

The linear model considered here was first employeD’Haeseleer et al., 1999) and has been subsequently uy a number of other researchers (Alter, Brown, & Botstein000; Gardner et al., 2003; Holter et al., 2000, 2001; Wet al., 1999; Yeung, Tegner, & Collins, 2002). In this ap-roach, the rate of change of concentration of the mRpeciesi is given by

dzi(t)

dt=

N∑j=1

wijzj(t) ∀i = 1,2, . . . , N (2.1)

herezi(t) (pmol/gdw) is the concentration of mRNA specas measured at time-pointt andwij is the regulatory coecient capturing the regulatory effect of genej on genei. Ifij > 0 then genej up-regulates (activates) genei while ifij < 0 then, genej down-regulates (inhibits) the expressf genei. If wij = 0, then no regulatory connection is impletween genesi andj. Given the discrete nature of the gexpression time series data, Eq.(2.1) can be approximatey the following set of linear algebraic equations:

zi(t + 1) − zi(t)

�t=

N∑j=1

wijzj(t) ∀ i = 1,2, . . . , N,

t = 1,2, . . . , T − 1 (2.2)

ZT)(T−1)×N = U(T−1)×(T−1)�(T−1)×N(VT)N×N (2.5)

here� is a diagonal matrix containing theT− 1 non-zeroingular valuesσ1, σ2, . . . , σT−1 andV an orthogonal matriontaining the singular vectorsv1, v2, . . . , vN correspondinoall (both zero and non-zero) singular values. The particolution for Eq.(2.4) that minimizes the L2-norm is theniven by

WT)N×N = VN×N�−1

N×(T−1)UT(T−1)×(T−1)(Z

T)(T−1)×N

(2.6)

here �−1 is a diagonal matrix with value−11 , σ−1

2 , . . . , σ−1T−1 (Yeung et al., 2002). The genera

olution for the underdetermined system of equatepresented by Eq.(2.4) is given by

WT)N×N = (WT)N×N + CN×(N−T+1)(V

T)(N−T+1)×N

(2.7)

hereV is the null-space matrix andC a matrix of arbitrarycalars (Yeung et al., 2002). All possible alternate netwoonfigurations that are consistent with the experimentalre embedded within Eq.(2.7). From this family, the sparst network is determined by choosing the scalar coeffiatrixC such that the number of zero entries inWT is max-

mized. This is achieved by solving the Linear Programm

Page 4: Large-scale inference of the transcriptional regulation of Bacillus subtilis

568 A. Gupta et al. / Computers and Chemical Engineering 29 (2005) 565–576

(LP) problem:

minimize∑i,j

(w+ij + w−

ij )

subject towij +N−T+1∑k=1

cjkvki = w+ij − w−

ij ∀i = 1,2, . . . , N,

j = 1,2, . . . , N, w+ij ≥ 0, w−

ij ≥ 0

∀i = 1,2, . . . , N, j = 1,2, . . . , N

wherewij is the (i,j)th element ofWT, vki the (k,i)th element

of VT

andw+ij + w−

ij is the (i,j)th element ofWT. The right-hand side of the equality constraint quantifies the absolutedeviation of the (i,j)th element ofWT from zero. The objec-tive function minimizes the total of all such deviations so thatsparsity can be achieved. A key feature of the LP optimiza-tion model as formulated above is that it decomposes over thej index. This implies that instead of solving one large-scaleoptimization problem involving all genes, it can be solvedsequentially for each gene. This decomposable structure ofthe problem can be exploited for (a) parallelizing the solu-tion algorithm and (b) limiting the amount of computationalresources expended if only a sub-network involving a sub-setof all genes needs to be inferred (Yeung et al., 2002).

Using basic LP principles, it can be shown that at theo latedbtt hb ,o ya anyp ing( t fora lesc earlyd ctionw g.

2

con-c(

y

w res-s e ofc ificd s-s andr rateo NA

degradation are unknown and are assumed to be randomlydistributed within physiologically possible ranges. These dis-tributions are sampled through a Monte Carlo procedure (seeSection3). The specific growth rate is measured from fermen-tation data. The quantityyM

zj(t) denotes the raw array signal

for transcriptj at timet andfzj (z(t), k) represents the rela-tionship between intracellular concentration and array signalfor transcriptj, wherek denotes a vector of parameters con-tained in the functionfzj (z(t), k).

The quantityuzj (t) in Eq. (2.8)denotes the control inputput forth by the organism to regulate the expression of genej. If a mechanistic understanding of the regulation of genejwere known, it could be used to capture how the state of theorganism affects the expression of genej. Many examples ofthis approach exist in the literature for both deterministic aswell as stochastic gene expression scenarios (Arkin, Ross, &McAdams, 1998; Bailey et al., 1983; Lee & Bailey, 1984a,1984b; McAdams & Arkin, 1997, 1998, 1999; Rao & Arkin,2001; Rao, Wolf, & Arkin, 2002; Wolf & Arkin, 2002; Wong,Gladney, & Keasling, 1997). However, we assume that we donot have the control mechanism and as such are not able tobuild a mechanistic-based representation ofuzj (t). Rather, weuse the transcript profiling data to identify a feedback controllaw uM

zj(t) that is guaranteed to produce a model estimated

array signal that tracks the measured value (Csete & Doyle,2

l fort ay

s

ε

W neard

w -s

E l in-p nt

-d andm thef ays sesi ntm 5 to1 lysis.T

ptimal solution of the LP model, each gene can be reguy at most (T − 1) genes. This is because for a givenj, the

otal number of equations in the LP model isN, while theotal number of variables is2N + (N − T + 1). Since eacasic feasible solution for the LP must haveNbasic variablesf which (N − T + 1) will be thecjk variables (since there free variables and are not being directly forced inarticular direction by the objective function), the remainT− 1) variables will be the absolute deviations. Note thagiven (i,j)th element, only one of the two deviation variaban be non-zero as otherwise the basis would have linependent columns. Thus, a particular regulatory conneill be inferred exclusively as being activating or inhibitin

.3. Non-linear model

The mass balance equations governing the specificentration of the jth mRNA species, denoted aszj(t)pmol/gdw) is given by

dzj(t)

dt= rT,zj (t)uzj (t) − (rg(t) + βzj )zj(t),

Mzj

(t) = fzj (z(t), k), j = 1,2, . . . , N (2.8)

hererT,zj (t) denotes the maximum specific rate of expion of genej (specific expression rate in the absencontrol input),βzj the rate constant governing the specegradation of transcriptj (specific rate of degradation aumed to be first-order with transcript concentration)

ˆg(t) the specific growth rate. The maximum specificf transcription and the rate constants governing mR

002).Define the error between the experimental array signa

ranscriptj at time t (yEzj

(t)) and the model predicted arr

ignal for transcriptj at timet yMzj

(t) as

j(t) ≡ yMzj

(t) − yEzj

(t) (2.9)

e propose the prediction error be governed by the liynamics:

dεjdt

= −λjεj(t) ∀j (2.10)

hereλj ≥ 0 for everyj. Differentiating Eq.(2.9) and subtituting Eqs.(2.8) and (2.10)yields

N

q=1

∂fj

∂zq[rT,zq (t)uzq (t) − (rg(t) + βzq )zq(t)] =

dyEj

dt

− λjεj(t) ∀j (2.11)

q. (2.11)describes the relationship between the controut governing the expression of genej and the error betwee

he estimated and measured array signals.It is possible to solve foruzj (t) as a function of the pre

iction error if the relationship between the array signalRNA concentration were known for each gene, i.e.,

unction fzj (z(t), k). The exact relationship relating arrignal to concentration is not known. However, in all can vitro transcripts consisting of a cocktail of five differe

RNA species of known concentration ranging from 0.60 pM were added to all samples directly before the anahis internal standard (referred to as IVTs) provides avery

Page 5: Large-scale inference of the transcriptional regulation of Bacillus subtilis

A. Gupta et al. / Computers and Chemical Engineering 29 (2005) 565–576 569

Fig. 1. IVT probe set signals vs. chip concentration for the cradle-to-grave experiment. The open circles denote a probe set signal for a given concentration ateach time point. For example, at 40 pM, a single transcript species was measured at each time point (total of 20 measurements for the cradle-to-grave run). Acorrection factor was introduced to convert chip concentration to intracellular concentration.

approximate means of determining the relationship betweenchip concentration and array signal (once we have an esti-mate of chip concentration it is possible to back calculate anapproximate physiological concentration). As there is con-siderable variance amongst the 5 IVT species at a singleconcentration (seeFig. 1), we use a Monte Carlo approachto estimate a set of possiblefzj (z(t), k)’s. We approximatefzj (z(t), k) as a set of piece-wise linear functions:

yMzj

(t) = m(θzj(t)) + b,

(m, b) =

(m1, b1), 0 ≤ yEzj

(t) ≤ S1...

...(mn, bn), Sn−1 < yE

zj(t) ≤ Sn

(2.12)

where the slope andy-intercept (mj ,bj) are determined byfitting a line between sequential IVT concentration pairs andtheir corresponding signals (one line per signal region periteration) andθ denotes the conversion between physiologicaland chip concentration. Given Eq.(2.12)we can solve Eq.(2.11)for uzq (t)

uMzj

(t) = 1

rT,zj (t)

{δzj (t)

(yMj (t) − b

)+ 1

mθ(yE(t + �t) − yE(t − �t)

)}

δ

A central difference is used to approximate the array sig-nal derivative using interpolated array data (step size is�t =0.1 h). Eq.(2.13)represents the feedback input that is guar-anteed to produce a simulated array signal that will convergeto the experimental signal ast → ∞. Thus, it approximatesthe control input used by the organism to produce the tran-scriptional program captured in the experiment.

To probe the connectivity of the expression network, weexpanduM

zj(t) in terms of the estimated specific mRNA con-

centration vectorz. In contrast with the linear model thatlooks for interactions that lie on or near a hyperplane in ex-pression space, the non-linear model utilizes the power-lawexpansion:

uzj (t) ≡ 1

zmax

N∏q=1

zγjqq (t) (2.15)

which represents a curvilinear hypersurface. As was true inthe linear case, the expansion exponentγ jq denotes the sen-sitivity of the control of expression of genej to the scaledconcentration of transcriptq, where the scaling factorzmax

denotes the maximum estimated concentration determinedover all expressed species and time. Taking the natural log ofEq.(2.15)yields the linear function:

ntityhesitiv-

× j j

2�t− λ(yM

j (t) − yEj (t))

(2.13)

whereδzj (t) is defined as

zj (t) ≡ (rg(t) + βzj ) (2.14)

uzj (t) = η +N∑q=1

γjqzq(t) (2.16)

where ˆ· denotes the natural log of the corresponding quain Eq. (2.15). Eq.(2.16)represents a family of planes in tlog transformed control-transcript space where the sen

Page 6: Large-scale inference of the transcriptional regulation of Bacillus subtilis

570 A. Gupta et al. / Computers and Chemical Engineering 29 (2005) 565–576

ity coefficientsγ jq and interceptη can be determined usingmultivariate regression.

Note that the networks inferred from both the linear andnon-linear models are correlational in nature and not causal.This stems from the fact that the various regulatory relation-ships as inferred by the two models are determined purely byminimizing the error between the predicted expression pro-files (under the particular model definition) and the experi-mentally obtained expression profiles. No mechanistic detailis taken into account in terms of whether a particular regula-tory interaction is biologically feasible at the transcriptionallevel. In the spirit of previous works on inferring gene net-works from microarray data (Alter et al., 2000; D’Haeseleeret al., 1999; Holter et al., 2000, 2001; Tegner, Yeung, Hasty,& Collins, 2003; Weaver et al., 1999; Yeung et al., 2002),our goal is to efficiently infer networks on a global scaleand generate a “rough draft” of the network topology, usingwhich more detailed and local analysis can be conducted inthe future.

2.4. Permutation-based significance analysis

Both modeling approaches, by minimizing the errorbetween the predicted gene expression value and the exper-imental value, generate a list of putative gene–gene regula-t redc f thei e lin-e st (− y ise icalj bye s forw ienti dereds y adh old ise on-n encee prob-a iond ;P eeng ofw

c

w ofr izeddF -t encel thea

Fig. 2. The distributions obtained for the actual and randomized data setsare superimposed on top of each other. The confidence level for a givenregulatory coefficient valuew∗ is then determined by computing the ratio ofthe number of inferred connections from the actual data that have a regulatorycoefficient value greater thanw∗ to the total (sum of actual and randomized)number of regulatory connections with value greater thanw∗.

inferred transcriptional network would thus be characterizedby regulatory connections having different confidence levels.The confidence level metric can be used as an additional de-gree of freedom for studying the topological properties of theinferred network. Increasing the confidence level and analyz-ing the regulatory connections that survive yields hypothesesregarding the biological processes being captured in the ex-pression data on the time-scale of the sampling frequency.

3. Discussion

The network topology inferred for the three different ex-perimental data sets using the linear model is shown inFig. 3.These plots correspond to the sparsest regulatory networks.Rows corresponding to genes with no regulatory effect havebeen eliminated. The most striking feature of the inferred net-works is the existence of regulatory bands, a characteristic,which indicates the existence of a small number of globalregulators or “hubs” that influence a large number of othergenes. Such “hub-and-spoke” topologies (also referred to asscale-free) have been observed for metabolic networks (Fell& Wagner, 2000; Jeong, Tombor, Albert, Oltvai, & Barabasi,2000) and protein–protein interaction networks (Jeong,Mason, Barabasi, & Oltvai, 2001). Two key features of suchn e oft be-tt lysisf entsw nfi-d ofa ex-p rvedi vent evelr net-

ory interactions. However, it is unclear which of the inferonnections are real and which are simply artifacts onference process. For instance, it can be shown for thar model, that each gene will be regulated by at moT1). Clearly, this property arises from the way sparsit

nforced through the LP model and there is no biologustification for it. Such model bias is typically addressednforcing threshold (cut-off) values whereby connectionhich the absolute value of the inferred regulatory coeffic

s higher than some pre-specified threshold are consiignificant. The choice of the threshold value is usualloc and symmetric in the sense that the same threshnforced on both activating and inhibiting regulatory cections. To address this issue, we calculate a confidstimate on the inferred connections based upon thebility that the connection will arise in random expressata, more exactly, row-column permuted data (Good, 2000esarin, 2001). Thus, the confidence of a connection betwenesi andj with an inferred regulatory coefficient value∗, denoted ascij(w∗), is defined as

ij(w∗) = Na(wij ≥ w∗)

Na(wij ≥ w∗) + Nr(wij ≥ w∗)(2.17)

hereNa(wij ≥ w∗), Nr(wij ≥ w∗) denote the numberegulatory connections inferred from the actual/randomata with a regulatory coefficient value greater thanw∗ (seeig. 2). For instance, if for a given value ofw∗, no connec

ions are inferred from the randomized data, then a confidevel of 100% is assigned to all connections inferred fromctual data with a regulatory coefficient higher thanw∗. The

etworks are (i) their robustness to the random failurhe nodes of the network and (ii) relatively short pathsween any two nodes in the network.Figs. 4 and 5highlighthe results of the permutation-based significance anaor the linear and non-linear model. Regulatory coefficiith higher absolute value are inferred with higher coence as shown inFig. 4. Fig. 5 indicates the numberrcs that are inferred for a given confidence level. Asected, the fraction of inferred connections that are prese

s inversely proportional to confidence level. However, ehough fewer number of arcs survive strict tolerance lequirement, the underlying scale-free topology of the

Page 7: Large-scale inference of the transcriptional regulation of Bacillus subtilis

A. Gupta et al. / Computers and Chemical Engineering 29 (2005) 565–576 571

Fig. 3. Network topology inferred for the three experiments using the linear model. White signifies activation or positive regulation whereas grey denotesinhibition or negative regulation. Note that rows corresponding to genes not regulating any other genes are eliminated for clarity of presentation.

Fig. 4. Permutation-based significance analysis results for the linear model. For a given confidence level, asymmetric thresholds are obtained for activation andinhibition. For instance, for a minimum confidence level of 90% for the cradle-to-grave experiment, the activation and inhibition regulation thresholds are 0.416and−0.324, respectively. Such thresholds are expected to be different for different experimental conditions as shown in the figure. However, the inhibitionregulation threshold is seen to be smaller in absolute value than the activation regulation threshold for all three experiments.

work is still preserved as shown inFig. 6. Also, the scale-freetopology that is observed in the real expression data is notpreserved when the data is randomized. The degree distribu-tions for the randomized data sets for all three experimen-tal conditions are closer to uniform distributions rather thanexponential distributions implying that the observed scale-

Fig. 5. Fraction of regulatory connections preserved as minimum confidencelevel is increased. A relatively small fraction of the total inferred connections(15% for cradle-to-grave, 8% for amino acid pulse and 3% for exponentialgrowth phase) survive when a high level of confidence (>90%) is imposed.

free topologies are not artifacts of the proposed inferenceprocedure.

Table 1lists some of the global regulators identified bythe linear model from the cradle-to-grave study. All the reg-ulator genes identified in the base case regulatory networkare retained even at higher confidence levels though with re-duced out-degrees. A number of genes involved in amino acidmetabolism are identified as regulator genes. These includegbsA, argG, ytcF, trpE, rocF, ysiB and cysK. In addition,genes participating in carbohydrate metabolism, in particu-lar glycolysis/gluconeogenesis and the citrate cycle (TCA)are also inferred as regulatory genes. The glycolysis genesincludepdhC, gap, andacoLwhile the TCA cycle genes in-cludesucC, citB, sdhA, odhBandpckA. A number of genesthat are involved in more than one functional category are alsouncovered by the linear model. For instance, thegap geneis involved in both glycolysis and amino acid metabolismwhile acoL is involved in the TCA cycle as well as aminoacid metabolism in addition to glycolysis. Energy metabolismgenes involved in oxidative phosphorylation such asctaE,atpB and sdhAare also inferred as regulator genes. Vari-ous genetic information processing genes are also inferred asregulator genes. These includesecE(protein export, sorting

Page 8: Large-scale inference of the transcriptional regulation of Bacillus subtilis

572 A. Gupta et al. / Computers and Chemical Engineering 29 (2005) 565–576

Fig. 6. Number of connectionsK vs. the normalized probability of at leastK connections for the network inferred from the cradle-to-grave expressiontime series using the linear model (top panel) and non-linear model (bottom panel). The parameterK is the total-degree (sum of in- and out-degrees) of aparticular node andp(K) is the probability of finding a node with at leastK connections (total-degree). The linear inference appears approximately linear ona log–log scale for all confidence levels suggesting a power-law relationship that is characteristic of scale-free networks. The non-linear distribution exhibitsagreement with a power-law distribution for <10 connections after which the probability plateaus and then decays for a large number of connections. Theprobability of finding a connection in the non-linear regime is almost always lower than the sparse linear model, with the majority of genes having lessthan10 connections.

Table 1Identified global regulators inferred from the cradle-to-grave data set by thelinear model

Gene Function

secE Preprotein translocase SecE subunitphrE Regulator of the activity of phosphatase rapEctaE Cytochrome caa3 oxidase subunit IIIgbsA Glycine betaine aldehyde dehydrogenasesucC Succinyl-CoA synthetase beta chainargG Argininosuccinate synthaseatpB ATP synthase a chainpdhC Pyruvate dehydrogenasesigW RNA polymerase ECF-type sigma factorytcF S-adenosylmethionine decarboxylasegapA Glyceraldehyde 3-phosphate dehydrogenasetrpE Anthranilate synthase component 1citB Aconitate hydratasesdhA Succinate dehydrogenase flavoprotein subunitcsn ChitosanaserocF ArginaseodhB 2-Oxoglutarate dehydrogenasepckA Phosphoenolpyruvate carboxykinasehpr Transcriptional regulator for sporulation initiationacoL Acetoin dehydrogenase E3 componentrbsK RibokinaseysiB Enoyl-CoA hydratasephrA Phosphatase rapA inhibitorald Stage V sporulation protein NcysK Cysteine synthetase ArapA Response regulator aspartate phosphatasesigH Sporulation-specific sigma factor

and degradation),sigW(RNA polymerase sigma factor),hpr(transcriptional regulator for peptide transport and sporula-tion initiation) andald (stage V sporulation protein N).

Comparison of the networks inferred for the three experi-mental conditions identified several consensus regulatory re-lationships. Specifically, the two genesyloH andphrA areidentified as a global activator and inhibitor, respectively,from all three datasets by the linear model.yloH is a key com-ponent of the transcription machinery as it encodes for the�-subunit of the RNA polymerase. This gene is found to ac-tivate 22 other genes belonging to a wide range of functionalcategories including amino acid metabolism, carbohydrateand complex lipid metabolism and oxidative phosphoryla-tion. In addition, this gene is found to up-regulate other ge-netic information processing genes, particularly those codingfor aminoacyl-tRNA synthetases (serS,gltXandthrS) that arerequired for translation and DNA polymerase subunits (dnaEandyorL) that are required for DNA replication and repair.ThephrAgene encodes for a 44 amino acid signaling proteininvolved in extracellular signaling that is required for timingthe cell’s decision to choose a particular physiological statesuch as growth or sporulation (Jiang, Grau, & Perego, 2000;McQuade, Comella, & Grossman, 2001; Phillips & Strauch,2002). This gene is found to down-regulate 18 other genes,of which 4 genes (kinA, sigF, spo0Aandspo0F) are key par-t ,1 h the

icipants in the initiation of sporulation (Stragier & Losick996). These consensus results indicate that even thoug
Page 9: Large-scale inference of the transcriptional regulation of Bacillus subtilis

A. Gupta et al. / Computers and Chemical Engineering 29 (2005) 565–576 573

Table 2Identified global regulators of known function inferred from the cradle-to-grave data set by the non-linear model

Gene Function

spo0J Stage 0 sporulation proteinspo0B Sporulation initiation phosphotransferasespo0F Two-component response regulator involved in the

initiation of sporulationspoIIID Transcriptional regulator of sigma-E and sigma-K

dependent genesynzD Hypothetical protein similar tospo0EtnrA Transcriptional pleiotropic regulator involved in

global nitrogen regulationphoR/phoP Two-component sensor histidine kinase. Potential

cognate response regulator is PhoPcggR Repressor ofgapAexpressionccpA Transcriptional regulator mediating carbon

catabolite repression (Lacl family)kina Two-component sensor histidine kinase involved in

the initiation of sporulationkinC Two-component sensor histidine kinase involved in

the initiation of sporulationsigY RNA polymerase ECF-type sigma factorrapG(F,K,D,I) Response regulator aspartate phosphatasegerE Transcriptional regulator required for the expression

of late spore coat geneslexA Transcriptional repressor of the SOS regulonglnR Transcriptional repressor of the glutamine

synthetase geneoppC(D,F,A) Oligopeptide transport system proteinshprP P-Ser-HPr phosphatasesinR Transcriptional regulator of post-exponential-phase

responses genesald Alanine dehydrogenase (stage V sporulation protein

N)secE(F) Preprotein translocase subunit

linear model takes a relatively simplistic view of gene regu-lation, it can indeed uncover biologically relevant regulatoryrelationships.

The network topology recovered from the cradle-to-graveexperiment using the non-linear model as a function of con-fidence level is shown inFig. 6. As in the linear case, theinferred network is banded (indicative of a “hub-and-spoke”architecture). The non-linear inference produces a networkin which the bulk of genes have less than 10 connections.

Fig. 7. Average concentrations of the reg

There is approximately a 10% chance of finding a gene withmore than 10 connections. The chance of finding a gene witha total degree of >100 is small (seeFig. 6). The non-linearmodel is able to capture regulatory interactions involved inthe initiation of sporulation as well as global nitrogen andcarbon metabolism. Some prominent regulatory genes in-ferred from the cradle-to-grave experiment via the non-linearmodel are listed inTable 2. Several known sporulation con-trol genes,spo0J, spo0Band kinC are inferred as regula-tors with high confidence (Sonenshein, 2000). In additionto sporulation control, several potential nutritional regulatorsare also identified such astnrA, a transcriptional pleiotropicregulator involved in global nitrogen metabolism (Beier,Nygaard, Jarmer, & Saxild, 2002; Brandenburg et al., 2002;Ferson, Wray, & Fisher, 1996; Fisher, 1999; Fisher, Bran-denburg, & Wray, 2002; Fisher & Wray, 2002; Robichonet al., 2000; Wray, Ferson, & Fisher, 1997; Wray, Ferson,Rohrer, & Fisher, 1996; Wray, Zalieckas, Ferson, & Fisher,1998; Wray, Zalieckas, & Fisher, 2001) andahrC, a tran-scriptional regulator involved in the metabolism of arginine(Czaplewski, North, Smith, Baumberg, & Stockley, 1992;Dennis, Glykos, Parsons, & Phillips 2002; Holtham et al.,1999; Klingel, Miller, North, Stockley, & Baumberg, 1995;Miller, Baumberg, & Stockley, 1997; Stockley et al., 1998).

The non-linear model is able to capture known regula-t Con-sn edw . Oft minoa otherpc lismg Ap ns ton con-nB 00Wpk .,

ulator genes inferred by the two models.

ory interactions at a confidence level above random.ider the nitrogen metabolism regulatorstnrAandahrC. Theon-linear model estimates thattnrAexpression is connectith 61 genes (out-degree) at a confidence level of 60%

he 61, 20 genes are involved in nitrogen metabolism/acid biosynthesis or peptidoglycan biosynthesis, 7 arerobable regulatory genes (rapG, rapF, spo0E, oppC, lexA,omPandcomX) and the remainder are carbon metaboenes, for example,glcK, gapB, PTS system genes or DNolymerases and ribosomal genes. Of the connectioitrogen/amino acid biosynthetic genes, three predictedections are known (ureB, gltA and glnA (Fisher, 1999;elitsky, Wray, Fisher, Bohannon, & Sonenshein, 20;ray et al., 1996, 1997, 1998)) with a fourth gene,gabD,

utatively linked togabP, a permease downstream ofgabD,nown to be very strongly regulated bytnrA (Ferson et al

Page 10: Large-scale inference of the transcriptional regulation of Bacillus subtilis

574 A. Gupta et al. / Computers and Chemical Engineering 29 (2005) 565–576

1996; Wray et al., 1996, 1998). Consistent with the litera-ture,ahrC is predicted to act as both an activator and repres-sor of arginine metabolism but also appears to be involvedwith the regulation of the metabolism of other amino acids(Czaplewski et al., 1992; Dennis et al., 2002; Holtham et al.,1999; Klingel et al., 1995; Miller et al., 1997; Stockley et al.,1998). ahrC is predicted to up-regulateargD, pheA, hisBandgltA and down-regulateargG, tyrA, hisF, cysE, hisH, glyQandtrpA.

Both models arrive at the same putative banded or “hub-and-spoke” architecture for the gene expression network ofB. subtilis. However, the genes identified as hubs in the net-work by the respective models are quite different. The linearmodel seems much better able to capture interactions thatoccur among highly expressed genes (seeFig. 7). Moreover,the linear model is better able to capture self-regulatory in-teractions. By contrast, the non-linear model is better ableto capture developmental regulation as well as specific car-bon and nitrogen regulatory interactions, most of which takeplace by genes expressed at much lower levels. Hence, thereseems to be gene expression regions where each inferenceapproach performs best.

4. Conclusions

rip-t ngt ddi-t a onw istedo ackedo ed datas ge ofs

ap-p sys-t iza-t pledt neare lvingt ition( rdert net-w dataa y hy-p LP)-b tion.I ackt theB ingn forv be-t yinga ism

was decomposed into the transcriptional program throughthe application of systems-theoretic tools and multivariateregression.

A robustness analysis was introduced for assigning con-fidence levels to all inferred regulatory connections. The un-derlying idea of this analysis was that by randomizing theexpression data and then using the scrambled data in the in-ference procedure, any underlying model bias could be de-tected and eliminated. This bias elimination was achieved byimposing systematic, as opposed to arbitrary, cut-offs on thevalues of the inferred regulatory interactions.

Both inference methodologies were shown to result intranscriptional networks that exhibited scale-free, “hub-and-spoke” topologies. This corresponded to the existence of arelatively small number of global regulator genes that regu-lated the expression of a large number of target genes. Thescale-free topology was found to be preserved even when veryhigh confidence level requirements were imposed. The twomodeling approaches were identified to be complementarywith respect to their applicability in different gene expressionregimes. Specifically, the linear model was able to identifyinteractions between highly expressed genes while the non-linear model was able to resolve the interactions betweenlow expression genes. This observation highlights the factthat a number of alternative inference methodologies shouldb reg-u riousc , wea ilityo mesb inga thatt levelo inearm s oft t.

R

A eticer theon

A gu-

A om-eling.

A nal-cted

B oficro-

B rip-n

In this work, the large-scale inference of the transcional regulatory network ofB. subtiliswas addressed usiwo alternative computational methodologies; a linear, aive model and a non-linear, power-law model. The dathich the two inference techniques were applied consf a selected set of 747 genes whose expression was trver time using Affymetrix GeneChip® arrays under threifferent experimental conditions. The three time seriesets with 5, 9 and 20 time points, spanned a wide ranampling frequencies.

The linear model extracted network connectivity byroximating the gene expression dynamics with a linear

em of ordinary differential equations (ODEs). Discretion of this system of ODEs at the experimentally samime points resulted in an underdetermined system of liquations. A two-step procedure was employed for so

his system of equations. First, singular value decomposSVD) was performed on the gene expression matrix in oo construct a general formalism satisfied by all possibleorks that were consistent with both the experimentalnd the chosen linear model. Subsequently, a sparsitothesis was enforced through a linear programming (ased model for obtaining a particular network configura

n contrast, a power-law model was used to explicitly trhe non-linearities and their evolution through time in. subtilis transcriptional system. In addition to modelon-linearities, the power-law methodology accountedariability in the kinetic parameters and the relationshipween array signal and transcript concentration by emploMonte Carlo procedure. The control input of the organ

e used in tandem for uncovering the wide spectrum oflatory interactions that can be expected to exist at vaoncentration/temporal scales. In terms of future workre currently investigating the reason(s) for the applicabf the two models in the two distinct concentration regiy systematically probing the two model formalisms usrtificially generated data. Preliminary results indicate

he non-linear model is able to magnify the expressionf lowly expressed genes to a larger extent than the lodel. Detailed computational and theoretical analyse

he two models will form the focus of a future manuscrip

eferences

kutsu, T., Miyano, S., & Kuhara, S. (1999). Identification of gennetworks from a small number of gene expression patterns undBoolean network model.Proceedings of the Pacific SymposiumBiocomputing (PSB 1999), 4, 17.

kutsu, T., Miyano, S., & Kuhara, S. (2000). Inferring qualitative relations in genetic networks and metabolic pathways.Bioinformatics,16, 727.

lter, O., Brown, P. O., & Botstein, D. (2000). Singular value decposition for genome-wide expression data processing and modPNAS, 97, 10101.

rkin, A., Ross, J., & McAdams, H. H. (1998). Stochastic kinetic aysis of developmental pathway bifurcation in phage lambda-infeEscherichia colicells.Genetics, 149, 1633–1648.

ailey, J. E., Hjortso, M., Lee, S. B., & Srienc, F. (1983). Kineticsproduct formation and plasmid segregation in recombinant mbial populations.Annals of New York Academic Science, 413, 71–87.

eier, L., Nygaard, P., Jarmer, H., & Saxild, H. H. (2002). Transction analysis of theBacillus subtilisPucR regulon and identificatio

Page 11: Large-scale inference of the transcriptional regulation of Bacillus subtilis

A. Gupta et al. / Computers and Chemical Engineering 29 (2005) 565–576 575

of a cis-acting sequence required for PucR-regulated expression ofgenes involved in purine catabolism.Journal of Bacteriology, 184,3232–3241.

Belitsky, B. R., Wray, L. V., Jr., Fisher, S. H., Bohannon, D. E., &Sonenshein, A. L. (2000). Role of TnrA in nitrogen source-dependentrepression ofBacillus subtilis glutamate synthase gene expression.Journal of Bacteriology, 182, 5939–5947.

Brandenburg, J. L., Wray, L. V., Jr., Beier, L., Jarmer, H., Saxild, H. H.,& Fisher, S. H. (2002). Roles of PucR, GlnR, and TnrA in regulat-ing expression of theBacillus subtilisure P3 promoter.Journal ofBacteriology, 184, 6060–6064.

Chen, T., He, H. L., & Church, G. M. (1999). Modeling gene expressionwith differential equations.Proceedings of the Pacific Symposium onBiocomputing (PSB 1999), 4, 29.

Christiansen, T., Christensen, B., & Nielsen, J. (2002). Metabolicnetwork analysis ofBacillus clausii on minimal and semirichmedium using (13)C-labeled glucose.Metabolic Engineering, 4, 159–169.

Christiansen, T., Michaelsen, S., Wumpelmann, M., & Nielsen, J. (2003).Production of savinase and population viability ofBacillus clausiidur-ing high-cell-density fed-batch cultivations.Biotechnology and Bio-engineering, 83, 344–352.

Chu, S., DeRisi, J., Eisen, M. B., Mulholland, J., Botstein, D., Brown, P.O., & Herskowitz, I. (1998). The transcriptional program of sporula-tion in budding yeast.Science, 282, 699.

Cohen, B. A., Pilpel, Y., Mitra, R. D., & Church, G. M. (2002). Dis-crimination between paralogs using microarray analysis: Applicationto the Yap1p and Yap2p transcriptional networks.Molecular Biologyof the Cell, 13, 1608.

Csete, M. E., & Doyle, J. C. (2002). Reverse engineering of biological

C ley,The

D with

D pidly

D forr-

D , K.,rbon-y

Dulture.

D 02).froms-

D olic

D nearand(PSB

D , Y.,ap-l

Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein, D. (1998). Clusteranalysis and display of genome-wide expression patterns.PNAS, 95,14863.

Fell, D. A., & Wagner, A. (2000). The small world of metabolism.NatureBiotechnology, 18, 1121.

Ferson, A. E., Wray, L. V., Jr., & Fisher, S. H. (1996). Expression oftheBacillus subtilisgabP gene is regulated independently in responseto nitrogen and amino acid availability.Molecular Microbiology, 22,693–701.

Fisher, S. H. (1999). Regulation of nitrogen metabolism inBacillus sub-tilis: vive la difference!.Molecular Microbiology, 32, 223–232.

Fisher, S. H., Brandenburg, J. L., & Wray, L. V., Jr. (2002). Mutations inBacillus subtilisglutamine synthetase that block its interaction withtranscription factor TnrA.Molecular Microbiology, 45, 627–635.

Fisher, S. H., & Wray, L. V., Jr. (2002). Mutations in theBacillus sub-tilis glnRA operon that cause nitrogen source-dependent defects inregulation of TnrA activity.Journal of Bacteriology, 184, 4636–4639.

Friedman, N., Linial, M., Nachman, I., & Pe’er, D. (2000). UsingBayesian networks to analyze expression data.Journal of Compu-tational Biology, 7, 601.

Gardner, T. S., di Bernardo, D., Lorenz, D., & Collins, J. J. (2003).Inferring genetic networks and identifying compound mode of actionvia expression profiling.Science, 301, 102–105.

Gill, R. T., Katsoulakis, E., Schmitt, W., Taroncher-Oldenburg, G., Misra,J., & Stephanopoulos, G. (2002). Genome-wide dynamic transcrip-tional profiling of the light-to-dark transition inSynechocystissp.strain PCC 6803.Journal of Bacteriology, 184, 3671–3681.

Good, P. I. (2000).Permutation tests: A practical guide to resamplingmethods for testing hypotheses. New York, NY: Springer Verlag.

Herrgard, M. J., Covert, M. W., & Palsson, B. O. (2003). Reconcilingtwork

H R.

H do-ssion

H rg,otic

I Eng,, L.tically

I torysign.000)

J ality

J 0).

J of

J sys-

K 3).and

K g,

K G.,ert,

complexity.Science, 295, 1664–1669.zaplewski, L. G., North, A. K., Smith, M. C., Baumberg, S., & Stock

P. G. (1992). Purification and initial characterization of AhrC:regulator of arginine metabolism genes inBacillus subtilis. MolecularMicrobiology, 6, 267–275.

auner, M., Bailey, J. E., & Sauer, U. (2001). Metabolic flux analysisa comprehensive isotopomer model inBacillus subtilis. Biotechnologyand Bioengineering, 76, 144–156.

auner, M., & Sauer, U. (2000). GC–MS analysis of amino acids raprovides rich information for isotopomer balancing.BiotechnologyProgress, 16, 642–649.

auner, M., & Sauer, U. (2001). Stoichiometric growth modelriboflavin-producingBacillus subtilis. Biotechnology and Bioengineeing, 76, 132–143.

auner, M., Sonderegger, M., Hochuli, M., Szyperski, T., WuthrichHohmann, H. P., Sauer, U., & Bailey, J. E. (2002). Intracellular cafluxes in riboflavin-producingBacillus subtilisduring growth on twocarbon substrate mixtures.Applied and Environmental Microbiolog,68, 1760–1771.

auner, M., Storni, T., & Sauer, U. (2001).Bacillus subtilismetabolismand energetics in carbon-limited and excess-carbon chemostat cJournal of Bacteriology, 183, 7308–7317.

ennis, C. C., Glykos, N. M., Parsons, M. R., & Phillips, S. E. (20The structure of AhrC, the arginine repressor/activator proteinBacillus subtilis. Acta Crystallographica Section D: Biological Crytallography, 58, 421–430.

eRisi, J. L., Iyer, V. R., & Brown, P. O. (1997). Exploring the metaband genetic control of gene expression on a genomic scale.Science,278, 680.

’Haeseleer, P., Wen, X., Fuhrman, S., & Somogyi, R. (1999). Limodeling of mRNA expression levels during CNS developmentinjury. Proceedings of the Pacific Symposium on Biocomputing1999), 4, 41.

ougherty, E. R., Barrera, J., Brun, M., Kim, S., Cesar, R. M., ChenBittner, M., & Trent, J. M. (2002). Inference from clustering withplications to gene-expression microarrays.Journal of ComputationaBiology, 9, 105–126.

gene expression data with known genome-scale regulatory nestructures.Genome Research, 13, 2423–2434.

olter, N. S., Maritan, A., Cieplak, M., Fedoroff, N. V., & Banavar, J.(2001). Dynamic modeling of gene expression data.PNAS, 98, 1693.

olter, N. S., Mitra, M., Maritan, A., Cieplak, M., Banavar, J. R., & Feroff, N. V. (2000). Fundamental patterns underlying gene expreprofiles: Simplicity from complexity.PNAS, 97, 8409.

oltham, C. A., Jumel, K., Miller, C. M., Harding, S. E., BaumbeS., & Stockley, P. G. (1999). Probing activation of the prokaryarginine transcriptional regulator using chimeric proteins.Journal ofMolecular Biology, 289, 707–727.

deker, T. E., Thorsson, V., Ranish, J. A., Christmas, R., Buhler, J.,J. K., Bumgarner, R., Goodlett, D. R., Aebersold, R., & Hood(2001). Integrated genomic and proteomic analyses of a systemaperturbed metabolic network.Science, 292, 929.

deker, T. E., Thorsson, V., & Karp, R. M. (2000). Discovery of regulainteractions through perturbation: Inference and experimental deProceedings of the Pacific Symposium on Biocomputing (PSB 2,5, 302.

eong, H., Mason, S. P., Barabasi, A. L., & Oltvai, Z. N. (2001). Lethand centrality in protein networks.Nature, 411, 41.

eong, H., Tombor, B., Albert, R., Oltvai, Z. N., & Barabasi, A. L. (200The large-scale organization of metabolic networks.Nature, 407, 651.

iang, M., Grau, R., & Perego, M. (2000). Differential processingpropeptide inhibitors of Rap phosphatases inBacillus subtilis. Journalof Bacteriology, 182, 303.

ong, H. D. (2002). Modeling and simulation of genetic regulatorytems: A literature review.Journal of Computational Biology, 9, 67.

ikuchi, S., Tominaga, D., Arita, M., Takahashi, K., & Tomita, M. (200Dynamic modeling of genetic networks using genetic algorithmS-system.Bioinformatics, 19, 643–650.

lingel, U., Miller, C. M., North, A. K., Stockley, P. G., & BaumberS. (1995). A binding site for activation by theBacillus subtilisAhrCprotein, a repressor/activator of arginine metabolism.Molecular &General Genetics, 248, 329–340.

unst, F., Ogasawara, N., Moszer, I., Albertini, A. M., Alloni,Azevedo, V., Bertero, M. G., Bessieres, P., Bolotin, A., Borch

Page 12: Large-scale inference of the transcriptional regulation of Bacillus subtilis

576 A. Gupta et al. / Computers and Chemical Engineering 29 (2005) 565–576

S., Borriss, R., Boursier, L., Brans, A., Braun, M., Brignell, S. C.,Bron, S., Brouillet, S., Bruschi, C. V., Caldwell, B., Capuano, V.,Carter, N. M., Choi, S. K., Codani, J. J., Connerton, I. F., Danchin,A., et al. (1997). The complete genome sequence of the gram-positivebacteriumBacillus subtilis. Nature, 390, 249–256.

Lee, S. B., & Bailey, J. E. (1984a). A mathematical model for lambdadv plasmid replication: Analysis of copy number mutants.Plasmid,11, 166–177.

Lee, S. B., & Bailey, J. E. (1984b). A mathematical model for lambdadv plasmid replication: Analysis of wild-type plasmid.Plasmid, 11,151–165.

Lee, T. I., Rinaldi, N. J., & Robert, F. (2002). Transcriptional regulatorynetwork inSaccharomyces cerevisiae. Science, 298, 799.

Maki, Y., Tominaga, D., Okamoto, M., Watanabe, S., & Eguchi, Y. (2001).Development of a system for the inference of large scale geneticnetworks.Proceedings of the Pacific Symposium on Biocomputing(PSB 2001), 6, 446.

McAdams, H. H., & Arkin, A. (1997). Stochastic mechanisms in geneexpression.Proceedings of the National Academic Science USA, 94,814–819.

McAdams, H. H., & Arkin, A. (1998). Simulation of prokaryotic geneticcircuits.Annual Review of Biophysics and Biomolecular Structure, 27,199–224.

McAdams, H. H., & Arkin, A. (1999). It’s a noisy business! Geneticregulation at the nanomolar scale.Trends in Genetics, 15, 65–69.

McQuade, R. S., Comella, N., & Grossman, A. D. (2001). Control of afamily of phosphatase regulatory genes (phr) by the alternate sigmafactor Sigma-H ofBacillus subtilis. Journal of Bacteriology, 183,4905–4909.

Miller, C. M., Baumberg, S., & Stockley, P. G. (1997). Operator interac-velvator

M ulos,oar-

O ionl

P ub-

P in

P

R la-.

R nd

R M.,newe

S ., &

S8)

S ba-ene

Shmulevich, I., Lahdesmaki, H., Dougherty, E. R., Astola, J., & Zhang,W. (2003). The role of certain post classes in Boolean network modelsof genetic networks.PNAS, 100, 10734–10739.

Somogyi, R., & Sniegoski, C. A. (1996). Modeling the complexity ofgenetic networks: Understanding multigenic and pleitropic regulation.Complexity, 1, 45.

Sonenshein, A. L. (2000). Control of sporulation initiation inBacillussubtilis. Current Opinion in Microbiology, 3, 561–566.

Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K.,Eisen, M. B., Brown, P. O., Botstein, D., & Futcher, B. (1998). Com-prehensive identification of cell cycle-regulated genes of the yeastSaccharomyces cerevisiaeby microarray hybridization.Molecular Bi-ology of the Cell, 9, 3273.

Stephanopoulos, G., Hwang, D., Schmitt, W., Misra, J., & Stephanopou-los, G. (2002). Mapping physiological states from microarray expres-sion measurements.Bioinformatics, 18, 1054–1063.

Stockley, P. G., Baron, A. J., Wild, C. M., Parsons, I. D., Miller, C. M.,Holtham, C. A., & Baumberg, S. (1998). Dissecting the moleculardetails of prokaryotic transcriptional control by surface plasmon res-onance: The methionine and arginine repressor proteins.Biosensors& Bioelectronics, 13, 637–650.

Stragier, P., & Losick, R. (1996). Molecular genetics of sporula-tion in Bacillus subtilis. Annual Review of Genetics, 30, 297–341.

Tegner, J., Yeung, M. K. S., Hasty, J., & Collins, J. J. (2003). Reverseengineering gene networks: Integrating genetic perturbations with dy-namical modeling.PNAS, 100, 5944–5949.

Varner, J. D. (2000). Large-scale prediction of phenotype: Concept.Biotechnology and Bioengineering, 69, 664–678.

Weaver, D. C., Workman, C. T., & Stormo, G. D. (1999). Modelingc

W aker,pres-

W of

W el ofiauxic

W n ofu-

W A, a

W 98).-

Wgh a

Y eringegres-

Z 03).ion infromh

tions by theBacillus subtilisarginine repressor/activator, AhrC: Nopositioning and DNA-mediated assembly of a transcriptional actiat catabolic sites.Molecular Microbiology, 26, 37–48.

isra, J., Schmitt, W., Hwang, D., Hsiao, L., Gullans, S., StephanopoG., & Stephanopoulos, G. (2002). Interactive exploration of micrray gene expression patterns in a reduced dimensional space.GenomeResearch, 12, 1112–1120.

h, M., Rohlin, L., Kao, K. C., & Liao, J. C. (2002). Global expressprofiling of acetate-grownEscherichia coli. The Journal of BiologicaChemistry, 277, 13175–13183.

e’er, D., Regev, A., Elidan, G., & Friedman, N. (2001). Inferring snetworks from perturbed expression profiles.Bioinformatics, 17(S1),215.

esarin, F. (2001).Multivariate permutation tests with applicationsbiostatistics. New York, NY: Wiley.

hillips, Z. E. V., & Strauch, M. A. (2002).Bacillus subtilissporulationand stationary phase gene expression.Cellular and Molecular LifeSciences, 59, 392–402.

ao, C. V., & Arkin, A. P. (2001). Control motifs for intracellular regutory networks.Annual Review of Biomedical Engineering, 3, 391–419

ao, C. V., Wolf, D. M., & Arkin, A. P. (2002). Control, exploitation atolerance of intracellular noise.Nature, 420, 231–237.

obichon, D., Arnaud, M., Gardan, R., Pragai, Z., O’Reilly,Rapoport, G., & Debarbouille, M. (2000). Expression of aoperon fromBacillus subtilis, ykzB–ykoL, under the control of thTnrA and PhoP–phoR global regulators.Journal of Bacteriology, 182,1226–1231.

auer, U., Hatzimanikatis, V., Bailey, J. E., Hochuli, M., Szyperski, TWuthrich, K. (1997). Metabolic fluxes in riboflavin-producingBacillussubtilis. Nature Biotechnology, 15, 448–452.

avageau, M. A. (1998). Rules for the evolution of gene circuitary.Pro-ceedings of the Pacific Symposium on Biocomputing (PSB 199, 3,54.

hmulevich, I., Dougherty, E. R., Kim, S., & Zhang, W. (2002). Probilistic Boolean networks: A rule based uncertainty model for gregulatory networks.Bioinformatics, 18, 261.

regulatory networks with weight matrices.Proceedings of the PacifiSymposium on Biocomputing (PSB 1999), 4, 112.

en, X., Fuhrman, S., Michaels, G. S., Carr, D. B., Smith, S., BJ. L., & Somogyi, R. (1998). Large-scale temporal gene exsion mapping of central nervous system development.PNAS, 95,334.

olf, D. M., & Arkin, A. P. (2002). Fifteen minutes of fim: Controltype 1 pili expression inE. coli. Omics, 6, 91–114.

ong, P., Gladney, S., & Keasling, J. D. (1997). Mathematical modthe lac operon: Inducer exclusion, catabolite repression, and dgrowth on glucose and lactose.Biotechnology Progress, 13, 132–143.

ray, L. V., Jr., Ferson, A. E., & Fisher, S. H. (1997). Expressiothe Bacillus subtilisureABC operon is controlled by multiple reglatory factors including CodY, GlnR, TnrA, and Spo0H.Journal ofBacteriology, 179, 5494–5501.

ray, L. V., Jr., Ferson, A. E., Rohrer, K., & Fisher, S. H. (1996). Tnrtranscription factor required for global nitrogen regulation inBacillussubtilis. Proceedings of the National Academic Science of USA, 93,8841–8845.

ray, L. V., Jr., Zalieckas, J. M., Ferson, A. E., & Fisher, S. H. (19Mutational analysis of the TnrA-binding sites in theBacillus subtilis nrgAB and gabP promoter regions.Journal of Bacteriology, 180,2943–2949.

ray, L. V., Jr., Zalieckas, J. M., & Fisher, S. H. (2001).Bacil-lus subtilisglutamine synthetase controls gene expression throuprotein–protein interaction with transcription factor TnrA.Cell, 107,427–435.

eung, M. K. S., Tegner, J., & Collins, J. J. (2002). Reverse enginegene networks using singular value decomposition and robust rsion.PNAS, 99, 6163.

ak, D. E., Gonye, G. E., Schwaber, J. S., & Doyle, F. J., III. (20Importance of input perturbations and stochastic gene expressthe reverse engineering of genetic regulatory networks: Insightsan identifiability analysis of anin silico network.Genome Researc,13, 2396–2405.