8/9/2019 Biotechnol J 2010 Song
1/13
BiotechnologyJournal
DOI 10.1002/biot.201000059 Biotechnol. J. 2010, 5, 768780
768 2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
1 Introduction
Mathematical modeling of signal transduction andgene expression programs is an emerging tool forunderstanding disease mechanisms. Kitano [1]suggested that analysis of molecular networks us-ing predictive computer models will play an in-creasingly important role in biomedical research.However, conventional wisdom suggests that the
data requirement to identify and validate complexmechanistic models is too large.Molecular network
models often exhibit complex behavior [2].Typical-ly, it is not possible to uniquely identify model pa-rameters, even with extensive training data andperfect models [3]. Thus, despite identificationstandards [4] and the integration of model identifi-cation with experimental design [5],parameter es-timation remains challenging even with structural-ly complete models. This reality has brought intothe foreground a number of interesting questions.
For example, do we actually need exact parameterknowledge to predict qualitatively important prop-erties of a molecular network? Or can we estimatewhich components and connections are central tonetwork function given only limited parameter in-formation?
Two schools of thought have emerged on howuncertain models can be used to understand mo-lecular network function. Bailey hypothesized thatqualitative properties of metabolic or signalingnetworks could be determined using networkstructure without parameter knowledge [6]. Cer-
Research Article
Ensembles of signal transduction models using Pareto Optimal
Ensemble Techniques (POETs)
Sang Ok Song, Anirikh Chakrabarti and Jeffrey D. Varner
School of Chemical and Biomolecular Engineering, Cornell University, Ithaca, NY, USA
Mathematical modeling of complex gene expression programs is an emerging tool for under-
standing disease mechanisms. However, identification of large models sometimes requires train-
ing using qualitative, conflicting or even contradictory data sets. One strategy to address this chal-
lenge is to estimate experimentally constrained model ensembles using multiobjective optimiza-
tion. In this study, we used Pareto Optimal Ensemble Techniques (POETs) to identify a family of
proof-of-concept signal transduction models. POETs integrate Simulated Annealing (SA) with
Pareto optimality to identify models near the optimal tradeoff surface between competing training
objectives. We modeled a prototypical-signaling network using mass-action kinetics within an or-
dinary differential equation (ODE) framework (64 ODEs in total). The true model was used to gen-
erate synthetic immunoblots from which the POET algorithm identified the 117 unknown model
parameters. POET generated an ensemble of signaling models, which collectively exhibited popu-
lation-like behavior. For example, scaled gene expression levels were approximately normally dis-
tributed over the ensemble following the addition of extracellular ligand. Also, the ensemble re-
covered robust and fragile features of the true model, despite significant parameter uncertainty.
Taken together, these results suggest that experimentally constrained model ensembles could
capture qualitatively important network features without exact parameter information.
Keywords: Mathematical modeling Robustness and fragility Systems biology
Correspondence: Professor Jeffrey D. Varner, School of Chemical and
Biomolecular Engineering, 244 Olin Hall, Cornell University, Ithaca,
NY 14853, USA
E-mail: [email protected]
Fax: +1-607-255-9166
Abbreviations: ODE, ordinary differential equation; POET, Pareto Optimal
Ensemble Technique; SA, Simulated Annealing
Received 11 May 2010
Revised 14 June 2010
Accepted 21 June 2010
Supporting information
available online
8/9/2019 Biotechnol J 2010 Song
2/13
2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 769
Biotechnol. J. 2010, 5, 768780 www.biotechnology-journal.com
tainly, there is literature evidence supporting theBailey hypothesis in metabolic networks [7]. Stud-ies exploring network modularity [8] have alsoidentified recurrent motifs that betray natural de-sign principles. Alternatively, ensemble approach-es, which use uncertain model families, have alsoemerged to deal with uncertainty in systems biolo-gy and other fields like weather prediction [913].Their central value has been the ability to quantifysimulation uncertainty and to constrain model pre-dictions. For example, Gutenkunst et al. [14]showed that predictions were possible using en-sembles of signal transduction models despitesometimes only order of magnitude parameter es-timates. Beyond their ability to robustly describedata, uncertain deterministic ensembles might be acourse-grained strategy to explore population dy-namics when stochastic simulation is too expen-sive.There are several techniques to generate pa-rameter ensembles. Battogtokh et al. [10] and laterBrown et al. [12] generated experimentally con-strained parameter ensembles using a Metropolis-type random walk through parameter space. Moleset al. [15] contrasted evolutionary and determinis-tic optimization techniques,any one of which couldbe adapted for ensemble generation. However, theunifying component of these previous identifica-tion strategies has been the minimization of a sin-gle objective function.
In this study, we used Pareto Optimal EnsembleTechniques (POETs) to identify a family of proof-of-concept signal transduction models. Our objec-tives were to test a modification to the originalPOET algorithm published by Song et al. [9] and tomore deeply explore the properties of model en-sembles.The motivation for POETs is practical.Theidentification of models with hundreds, thousandsor even tens of thousands of parameters requiresthat we use measurements from multiple laborato-ries or even different cell lines.These training datacan contain conflicts or can sometimes even becontradictory.Thus,a central challenge when iden-tifying large models is the ability to balance con-flicts in diverse training data. POETs, which inte-
grate Simulated Annealing (SA) and multiobjectiveoptimization through the notion of Pareto rank,find solutions that optimally balance these trade-offs. The modified POETs strategy described hereimproved the performance of the original algo-rithm using a local parameter refinement step. In-terestingly, the model ensemble generated usingPOET exhibited coarse-grained heterogeneity,suggesting that deterministic ensembles could per-haps be used to model heterogeneous populations.A secondary challenge was the subsequent charac-terization of network features in a family of mod-
els, using sensitivity analysis. Sensitivity analysishas enabled the investigation of robustness andfragility in molecular networks (see [9, 1619]).Sensitivity analysis has also been crucial to modelidentification,discrimination and experimental de-sign [3, 2023]. However, sensitivity analysis, usingfirst-order sensitivity coefficients, is a function ofthe model parameters.Thus, another open questionexplored here was whether qualitative propertiesestimated by sensitivity analysis were recovered bythe ensemble. We demonstrate that model ensem-bles recovered highly robust and fragile features ofthe true model, despite significant parameter un-certainty.
2 Materials and methods
2.1 Formulation, solution and analysis of themodel equations
We identified a family of models describing agrowth factor-induced three-gene transcriptionalprogram (Fig. 1). The model is available in SBMLformat in the supplemental materials. The modelwas formulated as a set of coupled ordinary differ-ential equations (ODEs):
(1)
where x denotes the species concentration vector(64 1), k denotes the parameter vector (117 1)and r(x,k) denotes the vector of reaction rates(117 1). The symbol S denotes the stoichiometricmatrix (64 117).The (i,j) element ofS, denoted byij, described the relationship between protein iand ratej. Ifij 0, protein iwas produced byrj.Lastly, ifij = 0, protein iwas not involved in ratej.The symbol y denotes the model output vector,where Y denotes the measurement selection ma-trix.
We assumed mass-action kinetics for each in-
teraction in the network. The rate expression forreaction qwas given by:
(2)
The quantity {Rq} denotes the set of reactantsfor reaction q, while kq denotes the rate constantgoverning reaction q. The symbols jq denote thestoichiometric coefficients (elements ofS) for thereactants involved with reaction q. All reversibleinteractions were split into two irreversible steps;thus, every interaction in the model was non-neg-
r k k xq q q jj
jq
q
( , )
x
R
={ }
d
dtto o
xS r x k x x
y Y
( , ) ( )= =
( ) =t xx ( )t
8/9/2019 Biotechnol J 2010 Song
3/13
BiotechnologyJournal
Biotechnol. J. 2010, 5, 768780
770 2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
ative. Inactive or infrastructure proteins andmacromolecules (R1, A1, A2, iTF, iK, EXPORT, IM-PORT, PH and PH-TF),RNAP and ribosomes wereassumed to have zero-order production rates andfirst-order degradation rates. These rate constantswere estimated along with the binding and catalyt-ic model parameters. All initial conditions werezero except gene 1, 2, and 3 (1 if present, 0 if ab-sent). We accounted for membrane, cytosolic andnuclear proteins and mRNA by explicitly definingseparate species in each of these compartments.
Mass-action kinetics, while expanding the di-mension of the model, regularized its mathematicalstructure.This allowed automatic generation of themodel code using the UNIVERSAL code genera-tion tool. UNIVERSAL, an open source Java code-generator, supports the generation of model codefrom text and SBML files. UNIVERSAL currentlysupports multiple code types (Matlab/Octave-M,Octave-C, Sundials-C, GSL-C and Scilab) and it isextensible with a simple plugin API. UNIVERSALis freely available as a Google Code project. Modelcode was generated as a C++ Octave module and
solved using the LSODE routine of Octave (www.octave.org). When calculating the response of themodel to ligand, we ran the model to steady-stateand then simulated the addition of ligand. Thesteady-state was estimated numerically by repeat-edly solving the model equations and estimatingthe difference between subsequent time points:
(3)
The quantities x(t) and x(t +t) denote the sim-ulated concentration vector at time tand t+t, re-
x x( ) ( )t t t+ 2
spectively.TheL2vector-norm was used as the dis-tance metric.We used t= 100 s and = 0.01 for allsimulations.
Sensitivity analysis was used to estimate whichnetwork components were fragile or robust. First-order sensitivity coefficients at time tq:
(4)
were computed by solving the kinetic-sensitivityequations [24]:
(5)
subject to the initial condition sj(t0)= 0.The quanti-tyj denotes the parameter index, P denotes thenumber of parameters in the model, A denotes the Jacobian matrix, and bj denotes the j
th column ofthe matrix of first derivatives of the mass balances with respect to the parameters. Sensitivity coeffi-cients were calculated by repeatedly solving theextended kinetic-sensitivity system for each pa-rameter using the LSODE routine of OCTAVE
(www.octave.org) over a sparse sampling (approxi-mately 10%) of the ensemble (see Fig. 3). The Jaco-bian A and the bj vector were calculated at eachtime step using their analytical expressions gener-ated by UNIVERSAL.The resulting sensitivity co-efficients were then scaled and time-averaged(Trapezoid rule):
(6)
where Tdenotes the final simulation time and ij =1 (unscaled) or ij(t) = kj/xi(t) (scaled). The scaled
Nij ij
T
Tdt t t ( ) ( ) 1 0 sij
ds dt d dt
tj j//
( )x
A bS r( ) = +( ) st j x k( , )
= , , ,j P1 2
s tx
kij qi
jtq
( ) =
Cytosol
Nucleus
Extracellular
Import
Export
R1L
Adaptor
gene 1
gene 2
gene 3
R1L
R1 L
53
mRNA
53
Translation
Protein
53
mRNA
P1
P2P3
P2
Px Px
P1
P2
P3
P1
aTF
aTF
aTF
iTF
iK
aK
PH
TF-PH
Figure 1. Schematic of the proto-
typical signaling network used in
this study. Extracellular ligand L
binds surface receptor R1 driving the
phosphorylation of transcription
factor TF. TFup-regulates gene 1
expression. Gene 1 then initiates a
cascade resulting in the expression
of gene 2 and gene 3. Gene 3 down-
regulates the expression of gene 1.
The model is available in SBML for-
mat in the supplemental materials.
8/9/2019 Biotechnol J 2010 Song
4/13
2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 771
Biotechnol. J. 2010, 5, 768780 www.biotechnology-journal.com
time-averaged sensitivity coefficients were thenorganized into an array for each ensemble mem-ber:
(7)
where denotes the index of the ensemble mem-ber,P denotes the number of parameters, N de-notes the number of ensemble samples andM de-notes the number of model species. The Bi matrixcontained the time-averaged sensitivities for a sin-gle species for each parameter (rows) as a functionof the ensemble (columns):
(8)
To estimate the relative fragility or robustnessof species and reactions in the network,we decom-posed the N() or the Bi matrices using SingularValue Decomposition (SVD):
(9)
Coefficients of the left (right) singular vectorscorresponding to largest singular values ofN()
were rank-ordered to estimate important species(reaction) combinations. Only coefficients withmagnitude greater than a threshold ( =0.1) wereconsidered.The fraction of the vectors in which areaction or species index occurred was used to rankits importance.Similarly, the left singular vectors ofBi showed which reaction combinations were im-portant for species i, while the right singular vec-tors rank-ordered which ensemble members con-tributed most significantly to the sensitivity of
species i.
2.2 POETs
POETs integrate SA with Pareto optimality to esti-mate parameter sets on or near the optimal trade-off surface between competing training objectives(Fig. S1). Here, we modified the original algorithm[9] to improve its convergence properties.Denote acandidate parameter set at iteration i +1 as ki+1.Thesquared error for ki+1 for training set j was definedas:
N
B U S V
( ) ( ) ( ) , ( )
( ) ( ) , (
= =
U VT
i i i iT ))
B
N N N N
N Ni
i i i i
N
i i=
( ) ( ) ( ) ( )
( ) (1
1
1
2
1 1
21
2
222 2
1 2
) ( ) ( )
( ) ( ) ( )
N N
N N N N
i iN
iP iP iP
iiPN
i M
( )
, , ,
= 1 2
N
N N N N
N N( )
( ) ( ) ( ) ( )
( )
=
11 12 1 1
21 22
j P(( ) ( ) ( )
( ) ( ) ( )
N N
N N N N
2 2
1 2
j P
M M Mj MMP
N
( )
, , ,
= 1 2
(10)
The symbol Mij denotes scaled experimentalobservations (from training setj), while the symbolyij denotes the scaled simulation output (from train-ing setj). The quantityi denotes the sampled timeindex and Tj denotes the number of time points forexperimentj. We assumed only immunoblots wereavailable for training with the exception of a singleqRT-PCR or ELISA measurement of the highest in-tensity band. The first term in the objective func-tion quantified the relative simulation error. Theread-out from the training immunoblots was bandintensity where we assumed intensity was onlyloosely proportional to concentration. Suppose wehave the intensity for speciesx at time i = {t1,t2,..,tn}in condition j. The scaled-value measurementwould then be given by:
(11)
Under this scaling, the lowest intensity bandequaled zero, while the highest intensity bandequaled one. A similar scaling was defined for thesimulation output.The second term in the objectivefunction quantified the error in the estimated con-centration scale. We assumed only the highest in-tensity bands were quantified absolutely (denotedbyMij) and compared with the simulation. Howev-er, if these measurements were not available, thesecond term could be adjusted to ensure the mod-el operated on physiologically relevant concentra-tion scales.
We computed the Pareto rank of ki+1 by com-paring the simulation error at iteration i +1 againstthe simulation archive Ki.We used the Fonseca andFleming ranking scheme [25]:
rank (ki+1|Ki) =p (12)
wherep denotes the number of parameter sets thatdominate parameter set ki+1. Parameter sets on or
near the optimal trade-off surface have small rank(
8/9/2019 Biotechnol J 2010 Song
5/13
BiotechnologyJournal
Biotechnol. J. 2010, 5, 768780
772 2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
was discretized into 10 quanta between To and Tfand adjusted according to the schedule Tk =
kT0where was defined as:
(14)
The epoch-counter kwas incremented after theaddition of 50 members to the ensemble. Thus, asthe ensemble grew, the likelihood of accepting pa-rameter sets with a large Pareto rank decreased.Togenerate parameter diversity, we randomly per-turbed each parameter by 50%. However, in ad-dition to a random-walk strategy (previous algo-rithm), we performed a local pattern search everyqsteps to minimize the residual for a single random-ly selected objective.The local pattern-search algo-rithm has been described previously [26, 27]. Theparameter ensemble used in the simulation andsensitivity studies was generated from the low-rank parameter sets in Ki.
3 Results
3.1 Summary
We identified and analyzed a family of canonicalsignal transduction models using POETs and sen-sitivity analysis. POET has previously been used toidentify molecular models of pain signaling [9].Wemodified the original algorithm by integrating a lo-cal pattern-search routine, which better controlledthe absolute error in the ensemble identification.The original and modified algorithms were used toestimate an ensemble of signaling models. Themodel, which was assumed to have a known net-work structure, described the integration of extra-cellular signals with kinase activation, the phos-phorylation of transcription factors, and the up-regulation of an associated transcriptional program(Fig. 1). Thus, while not specific to a particulargrowth factor, signaling cascade or expression pro-gram, it contained many of the general features en-
countered when identifying specific models. Wemodeled the molecular interactions in the proto-typical-signaling network using mass-action kinet-ics within an ODE framework. ODEs and mass-ac-tion kinetics are common methods of modeling bi-ological pathways [9, 1618, 2832]. We assumedspatial homogeneity but differentiated between cy-tosolic,membrane and nuclear localized processes.The true model (known parameters) was used togenerate synthetic data from which we tested thePOET algorithm. Each synthetic measurement wasassumed to be a Northern or Western blot.Thus,we
T
Tf
o
/
=
1 10
knew only relativeamounts of protein or mRNA forany specific condition or time.To constrain the ab-solute concentration scale, we assumed a singleELISA or qRT-PCR measurement for the highestintensity band in each case. Lastly, we limited ourtraining data to 20 samples per experiment (an up-per limit on the lanes available on a Western blot).
The modified POET algorithm performed betterthan the original implementation and generated anensemble that collectively exhibited population-like behavior. First, the ODE model used here wasdeterministic and did not describe stochastic geneexpression fluctuations. However, because manydifferent parameter sets were sampled, the deter-ministic ensemble exhibited population-like be-havior. For example, scaled gene expression levelswere approximately normally distributed followingthe addition of extracellular ligand. Thus, whilegene expression was not described at a single-celllevel, the ensemble captured coarse-grained ex-pression heterogeneity. This suggested that deter-ministic ensembles could perhaps be used to mod-el heterogeneous populations. Second, the modelensemble captured the robust and fragile featuresof the true model, despite significant parameteruncertainty. Edge (interactions between species)and node (species) ranks computed over the en-semble using sensitivity analysis were consistent with the true rankings, at least for highly fragileand robust network components. This suggestedthat, in practice, results from sensitivity analysisobtained by analyzing model ensembles could rep-resent true behavior to a high degree of certainty, atleast for highly fragile or robust network features.The true model is available in SBML format in thesupplemental materials.
3.2 Estimating an ensemble of models usingmultiobjective optimization
We estimated an ensemble of signal transductionmodels from synthetic data sets using POET(Fig.S1).The canonical model had 117 unknown ki-netic constants, primarily of three types (associa-
tion, dissociation or catalytic rate constants). Be-cause we used mass-action kinetics, every networkinteraction was governed by a single parameter.Using the true model, we generated 24 syntheticdata sets using a (3,2,2,2)-level factorial design.Thedesign variables considered were the level of lig-and stimulation (L = 0,L = 10 andL = 50) and thepresence and absence of gene 1, 2 and 3. In eachdata set, we assumed inactivated/activated kinase(cytosol), inactivated/activated transcription factor(cytosol), mRNA for protein 1 (cytosol) and the cy-tosolic level of protein 1 were measured at 20 points
8/9/2019 Biotechnol J 2010 Song
6/13
2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 773
Biotechnol. J. 2010, 5, 768780 www.biotechnology-journal.com
equidistant over the time-course of the experiment(approximately 3 h). Each synthetic dataset be-came an objective in the optimization calculationfrom which we estimated the model ensemble (24objectives in total).
The POET algorithm with local parameter re-finement performed better than the original imple-mentation (Fig. 2). Both implementations startedfrom the same randomized parameter seed, usedthe same software libraries and were run over a72-h period on the same hardware. Both imple-mentations used a maximum acceptable Paretorank of three or less.The modified algorithm gen-erated 2882 ranked sets, of which 1062 had a Pare-to rank equal to zero (Fig. 2, black circles). On theother hand, the original POET implementationgenerated 20645 ranked sets, where 1538 had aPareto rank equal to zero (Fig.2, grey circles).Whilelocal refinement required additional function eval-uations, the median training residuals were lessthan the original implementation (Fig. S2). Thequality of the resulting ensemble generated withlocal refinement was also higher. Approximately47% of the model parameters (55 of 117) were con-
strained with a coefficient of variation (CV) of lessthan or equal to one (Fig. 3A). In comparison, theminimum CV produced by the original implemen-tation was 1.7 (Fig. 3B). The top five constrainedparameters were protein 1 (cytosol), RNAP andEXPORT degradation (all 0.64), the degradation ofmRNA for gene 3 (0.65;negative regulator of P1 ex-pression) and the constitutive expression of gene 1(0.67). The top five least-constrained parameterswere associated with kinase regulation or regulat-ed gene 1 expression (CV >2).Well-constrained pa-rameters were pseudo-normally distributed with a
strong positive skew, while parameters with a highCV were approximately exponentially distributed(Fig. S3). Analysis of the residuals produced byPOET gave insight into relationships in the train-ing data (Fig. 2). For example, O6 O2 and similar-
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
0 20 40 60 80 100 120
Sorted Parameter Index
ParameterCoefficientofVariation(CV)
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100 120
Paramete
rCoefficientofVariation(CV)
A
B
Figure 3. Coefficient of variation (CV) of model parameters estimated us-
ing POET with local parameter refinement (A) and the original implemen-
tation (B). The solid line denotes the mean CV calculated over the
entire ensemble, while the points denote the CV of the ensemble sample
used in the sensitivity analysis calculations. Approximately 47% or 55 of
117 parameters had CV 1 for POET with local refinement. The minimumCV obtained using the original POET implementation was 1.7.
Figure 2. Objective function array for pa-
rameter sets with Pareto rank = 0 for the
original POET implementation (gray cir-
cles) and POET with local parameter re-
finement (black circles). Eight objectivesare shown from the 24 objectives used in
the model identification. The symbol
Ojindicates the jth objective function.
Points indicate the error associated with
ensemble parameter sets. Objectives
were defined using a (3,2,2,2)-level facto-
rial design (ligand,gene1,gene2,gene3):
O1 = (2,2,1,1), O2 = (2,2,1,2), O3 =
(2,2,2,1), O4 = (2,2,2,2), O5 = (3,2,1,1),
O6 = (3,2,1,2), O7 = (3,2,2,1) and O8 =
(3,2,2,2). Design levels: ligand (1,2,3) =
(0,10,50) and genej(1,2) = (deleted,
present).
8/9/2019 Biotechnol J 2010 Song
7/13
8/9/2019 Biotechnol J 2010 Song
8/13
2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 775
Biotechnol. J. 2010, 5, 768780 www.biotechnology-journal.com
parameter ensemble for two key catalytic reactionsin our network, namely, the activation of kinase byactivated receptor and the phosphorylation of tran-scription factor by activated kinase to determine ifthe MichaelisMenten assumption was valid. Weconsidered parameter sets from the locally refinedparameter ensemble with Pareto rank 3. Forthese reactions,the on- and catalytic rate constantshad a CV 1, while the off-rates were not well con-strained (CV > 2). On average, the Michaelis-Menten assumption was violated by 35% of theensemble,suggesting that we could possibly reducemodel complexity by changing the kinetics. How-ever, mass-action kinetics have the advantages ofregularized mathematical structure and simplicity,which offsets the added complexity.
3.3 Rank-based assessment of nodes and edges wasconserved by the ensemble
A key question when using model ensembles iswhether the rank-based assessment of critical net-
work components is correct, given significant para-metric diversity. Previously, we approached thisquestion by comparing the nodes or edges predict-ed to be important in a variety of models with liter-ature [9, 17, 19, 33]. However, these comparisonswere imperfect. Many factors were likely differentbetween the experimental and modeling studies.Moreover, these comparisons were only as reliableas the underlying literature search, which was notexhaustive. In this study, we validated the classifi-cation of nodes and edges as fragile or robust bycomparing the true model with models from theensemble.
Local processes such as transcription factorregulation and global infrastructure like RNAP, nu-clear transport and translation were the most frag-ile components of the prototypical-signaling net-work.First-order sensitivity coefficients were com-puted for the true parameters and the ensemble.These coefficients were then time-averaged toform the Nand Barrays (see Materials and meth-ods). The magnitude of the coefficients of the left
0
1
2
3
4
5
6
7
0 0.5 1 1.5 2 2.5 3
0
2
4
6
8
10
12
0 0.5 1 1.5 2 2.5 30
1
2
3
4
5
6
7
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 0.5 1 1.5 2 2.5 3
Time
Time Time
Time
Protein1cytosol
(p1C)(A.U)
P
rotein2cytosol
(p2C)(A.U)
Protein3cytosol
(p3C)(A.U)
mRNAProtein2
Cytosol(A.U)
(A) (B)
(C) (D)
O7
O8
O8
O8
Figure 5. Model predictions following the addition of ligand L versus modified synthetic data. The dashed lines denote the mean simulated value over the
ensemble; the gray region denotes the 95% confidence interval. The points denote the mean synthetic data used to validate the model. The validation data
was generated from the training data by adding a background level of the ligand (L =1) and by considering species not used for training (with the exception
of protein 1). (A) Cytosolic levels of protein 1 versus time. Points denote the O7 = (3,2,2,1) data set in the presence of background ligand (L =1). (B) Cytoso-
lic levels of protein 3 versus time. Points denote the O8 = (3,2,2,2) data set in the presence of background ligand (L =1). (C) Cytosolic levels of protein 2 ver-
sus time. Points denote the O8 = (3,2,2,2) data set in the presence of background ligand (L =1). (D) Cytosolic levels of mRNA for protein 2 versus time.
Points denote the O8 = (3,2,2,2) data set in the presence of background ligand (L =1).
8/9/2019 Biotechnol J 2010 Song
9/13
BiotechnologyJournal
Biotechnol. J. 2010, 5, 768780
776 2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
(right) singular vectors corresponding to largest singular values ofNwere used to rank-order theimportance of the nodes (edges) in the model(Fig.7).The most sensitive node combinations with=1 involved the regulation of activated transcrip-tion factor (aTF) and the transport of aTF into thenucleus (Fig. 7, top). Similarly, the most sensitiveedges involved PH-TF regulation of aTF, the pro-duction, degradation and regulation of the specifickinase for TF (iK/aK), the production and degrada-
tion of iTF and the production/degradation of PH-TF. Analysis of additional singular vectors (in-creased ) highlighted the role of global infrastruc-ture like RNAP, nuclear transport (IMPORT/EX-PORT) and translation (Fig. 7, middle and bottom).Analysis of the left singular vectors of the BaTFma-trix also supported these findings. On the otherhand, the most robust species and reaction combi-nations involved the assembly of the adaptor com-plex and the basal expression of gene 1, 2 and 3.Subpopulations in the ensemble behaved differ-ently. Analysis of the right singular vectors ofBaTF
suggested which ensemble elements most influ-enced a particular species. For example, examina-tion of the top and bottom three ranked ensemblemembers,estimated from the right singular vectorsof BaTF , showed the highest ranked ensemblemembers had similar aTF trajectories (Fig. S5, sol-id-lines). Conversely, the lowest three had widely varying aTF levels (Fig. S5, dashed-lines). Thus,subpopulations with qualitatively distinct behaviorwere present in the ensemble and decomposing theB
array could identify these elements.Edge and node ranks computed over the en-semble recovered the true rankings for highly frag-ile and highly robust network components (Fig. 8).We compared the node (species) and edge (inter-action) ranks computed using sensitivity analysisfor the true parameter set with the ensemble (=1).The Kendall and Spearman rank correlations wereused to quantify the agreement between the trueand estimated ranked lists (Table 1).The Spearmanand Kendall correlation coefficients were approxi-mately normally distributed for both node and edge
0
200
400
600
00
50
100
150
200
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
0
40
80
120
-0.2 0 0.2 0.4 0.6 0.8 10
40
80
120
0 0.2 0.4 0.6 0.8 1 1.2
0
40
80
120
0.2 0.4 0.6 0.8 1 1.20
50
100
150
200
0.2 0.4 0.6 0.8 1 1.2
0
100
200
300
400
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.20
100
300
500
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
t = 0.10 hr t = 0.20 hr
t = 0.25 hr t = 0.30 hr
t = 0.40 hr t = 0.50 hr
t = 0.75 hr t = 1.0 hr
NumberofCells
Scaled protein concentration (A.U)
Figure 6. Distributions for the scaled cytosolic protein 1 concentration as a function of time following the addition of extracellular ligand L. Bars denote
expression bins for protein 1 (expression levels were sub-divided into 10 bins). The solid line denotes a normal distribution fit to the histogram (histfit
function of Octave). Initially, the ensemble was synchronized with low scaled protein 1 expression (upper left-hand plot). After the addition of the ligand L
the distribution of cells expressing protein 1 shifted to the right (progressing through an approximately normal distribution during active expression of pro-
tein 1). After t =1.0 h, the bulk of the cells reached their maximum cytosolic levels of protein 1 (lower right-hand corner).
8/9/2019 Biotechnol J 2010 Song
10/13
2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 777
Biotechnol. J. 2010, 5, 768780 www.biotechnology-journal.com
fragility over the model ensemble (data not shown).Ranks estimated using unscaled sensitivity coeffi-cients gave the best correlation with the true pa-rameter values. The Kendall correlation betweenthe true node rank and that estimated from the en-semble was 0.57 0.15, while the mean edge rankcorrelation was 0.72 0.09. The mean Spearmanrank correlation for node rank was 0.73 0.16, while the mean correlation for edge rank was0.87 0.08.Additionally, if we computed the corre-lation between the true rank and the meannode/edge rank (mean rank calculated over the en-semble before the rank correlation test), the Spear-man correlation for nodes and edges increased to0.91 and 0.97, respectively. Both correlation metricsand visual inspection (Fig. 8, control versus POET)suggested that edge rank was recovered better thannode rank. In addition to the rank correlation, wecalculated the fraction of the ensemble in which an
edge or node was ranked the same as the true pa-rameter set (Fig. 8, bottom). Interestingly, bothhighly fragile and highly robust network featureswere recovered for edges (Fig. 8, bottom left) andnodes (Fig. 8, bottom right). For example, the high-est and lowest ranked edges were recovered inmore than 95% of the ensemble. However, minornetwork features were not similarly recovered(worst case recovery of only 20%). This suggestedthat we could expect to recover at least highly frag-ile or robust network features when using para-metrically uncertain ensembles.
4 Discussion
Mathematical modeling of complex gene expres-sion programs is an emerging tool for understand-ing disease mechanisms. However, identification oflarge models with many unknown parameters re-quires that we use diverse training data. Trainingdata taken from many sources can contain con-flicts, for example different time scales, or cansometimes even be contradictory. Parameter esti-mation techniques that balance these conflictsmight lead to robust model performance. POET haspreviously been used to identify molecular modelsof pain signaling [9].We modified the original algo-rithm by incorporating a local parameter refine-ment step which generated candidate parametersets with better error properties. Using the modi-fied POET algorithm,we identified an ensemble of
Control ensemble
(perfect information)POET ensemble
(uncertain parameters)
= 1
= 10
= 20
Fragile
Robust
Fragile
Robust
Fragile
Robust
Figure 7. Comparison of the species (node)
fragility estimated from the ensemble versus
the true parameter set for different values
(= 0.1). The fraction of the top-modes in
which a species was present was calculated
for the true model (left) and the model
ensemble (right).
Table 1. Summary of the rank correlation for node and edge ranking
between the ensemble and true parameter set
Method Node Edge
Scaled
Kendall 0.51 0.18 0.36 0.11
Spearman 0.65 0.22 0.51 0.15
Unscaled
Kendall 0.57 0.15 0.72 0.09
Spearman 0.73 0.16 0.87 0.08
8/9/2019 Biotechnol J 2010 Song
11/13
BiotechnologyJournal
Biotechnol. J. 2010, 5, 768780
778 2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim
parameter sets from synthetic data generated usingthe true parameters.We assumed that immunoblottraining data (Western or Northern blots) wereavailable to estimate the model ensemble. We in-troduced a systematic procedure to incorporate
these types of experimental measurements intomodel identification. We characterized the param-eter ensemble generated by POET by exploring thebehavioral diversity of models in the ensemble andby examining how the fragility of nodes or edgesvaried over the ensemble.
The deterministic ensemble exhibited hetero-geneous population-like behavior. In this study, wesuggested that deterministic ensembles could beused to model heterogeneous populations in situa-tions where stochastic computation was not feasi-ble.There is a rich and growing literature exploring
the role of stochastic fluctuations in biologicalprocesses such as gene expression [34].Today, sto-chastic gene expression models are not computa-tionally feasible except for small networks. Howev-er, as stochastic simulation algorithms continue to
improve, for example with hybrid [35] or leapingstrategies [36], then fully stochastic simulationswill become tractable. Currently, the simulation ofmoderate to large problems typically relies on thepopulation-averaged descriptions provided byODEs.Within an ODE framework,we showed pop-ulation-like effects using model ensembles. Popu-lation heterogeneity using deterministic modelfamilies was also recently explored for bacterialgrowth in batch cultures [37]. Distributions weregenerated because the model parameters variedover the ensemble, i.e., extrinsic noise led to popu-
Robust Fragile
100%
20%
40%
60%
80%
ReactionIndex
ReactionIndex
Ensemble Index
Perc
entagecorrectclassification
SpeciesIndex
Spe
ciesIndex
Ensemble Index
Sorted Species Index
100%
20%
40%
60%
80%
Perce
ntagecorrectclassification
Control ensemble(perfect information)
POET ensemble(uncertain parameters)
Reactions Species
Robust Fragile
Fragile RobustSorted Reaction Index
Fragile Robust
Figure 8. Comparison of the reaction (edge) and species (node) rank estimated from the ensemble versus the true parameter set for =1. The ordinal rank
of the magnitude of the left (right) singular vector corresponding to the largest singular value was computed for true model (top) and the model ensemble
(middle). The fraction of trials in which a species or reaction was ranked exactly correctly was used to calculate the correct classification percentage.
8/9/2019 Biotechnol J 2010 Song
12/13
2010 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 779
Biotechnol. J. 2010, 5, 768780 www.biotechnology-journal.com
lation heterogeneity. Parameters controlling physi-cal interactions, such as disassociation rates, or therate of assembly or degradation of macromolecularmachinery, such as ribosomes, were widely distrib-uted over the ensemble. However, population het-erogeneity can also arise from intrinsic noise [38].Thus, deterministic ensembles, which do not cap-ture intrinsic thermal fluctuations, provide acoarse-grained or extrinsic-only ability to simulatepopulation diversity. Taken together, these studiesmotivate a deeper question as to whether a uniqueparameter set exists in biology. These results sug-gest that not just variation in the copy number ofinfrastructure like ribosomes or RNAP but ratherdistributions in the strength of biophysical interac-tions could also drive population heterogeneity.More studies are required to explore these ques-tions and to test the notion that ensembles canmodel population heterogeneity. One concrete nextstep could be to try and recapitulate experimental-ly measured distributions,for example, flow cytom-etry measurements of protein markers. Longerterm, coarse-grained deterministic ensemblesmight be a strategy to explore drug effects acrosscell populations [1].
Sensitivity-based metrics, calculated from un-certain models, are often used to estimate whichcomponents of networks are fragile or robust.Thus,a reasonable question is whether the classificationof nodes (species) and edges (interactions) as frag-ile or robust in uncertain models is correct.We ex-plored this question by comparing nodes or edgesestimated to be fragile or robust in the true modelwith those of the model ensemble.We showed thatboth locally and globally important network fea-tures were conserved across the ensemble. Themost important local feature of our canonical net-work was transcription factor activation.Transcrip-tion factor regulation is a well-known integrationlayer in gene-expression architectures. For exam-ple, Bhardwaj et al. [39] showed in a range of net-works that midlevel regulators, such as transcrip-tion factors,have the highest collaborative propen-sity.Thus, transcription factor regulation is perhaps
one of the bow-ties described by Csete and Doyle[40]. Sensitivity analysis suggested that global in-frastructure such as RNAP, nuclear transport andtranslation initiation were also fragile.The fragilityof transcription and translation infrastructure hasalso been reported by Stelling et al. [16] exploringthe robustness properties ofDrosophila clock ar-chitectures, in cell-cycle architectures [19], and ingrowth factor signaling in LNCaP sub-clones [33],to cite just a few examples. Interestingly, highlyfragile or robust network features were conservedacross the ensemble. This suggested, as Bailey hy-
pothesized, that analysis of experimentally con-strained model ensembles could generate a rea-sonable estimate of what was important in a net- work without detailed parametric knowledge [6].However, sensitivity analysis does not evaluate net- work performance following structural or opera-tional perturbations [41]. Thus, an open question(yet to be explored) is whether an ensemble ofmodels captures the fault tolerance or disturbancerejection properties of molecular networks.
The project described was supported by Award Num-ber #U54CA143876 from the National Cancer Insti-
tute.The content is solely the responsibility of the au-thors and does not necessarily represent the officialviews of the National Cancer Institute or the Nation-al Institutes of Health.We also acknowledge the gen-erous support of the Office of Naval Research#N000140610293 to J.V. for the support of S.S.
The authors have declared no conflict of interest.
5 References
[1] Kitano, H.,A robustness based approach to systems-orient-
ed drug design.Nat.Rev.Drug Discov. 2007, 6, 202210.
[2] Hornberg,J.J,Binder, B.,Bruggeman,F. J.,Schoeberl,B. et al.,
Control of mapk signalling: from complexity to what really
matters. Oncogene 2005,24, 55335542.
[3] Gadkar, K. G., Varner, J., Doyle, F. J., Model identification of
signal transduction networks from data using a state regu-
lator problem. Syst.Biol. (Stevenage) 2005,2, 1730.
[4] Gennemark, P., Wedelin, D., Benchmarks for identification
of ordinary differential equations from time series data.
Bioinformatics 2009,25, 780786.
[5] Bandara, S., Schlder, J., Eils,R., Bock, H. G., Meyer,T., Opti-
mal experimental design for parameter estimation of a cell
signaling model.PLoS Comput.Biol. 2009,5, e1000558.
[6] Bailey, J. E., Complex biology with no parameters. Nat.
Biotechnol. 2001, 19, 503504.
[7] Covert, M.,Knight, E.,Reed,J.,Herrgard, M.,Palsson, B., In-
tegrating high-throughput and computational data eluci-
dates bacterial networks.Nature 2004,429, 9296.
[8] Shen-Orr, S. S., Milo, R., Mangan, S.,Alon, U., Network mo-
tifs in the transcriptional regulation network ofEscherichia
coli.Nature 2002,31, 6468.
[9] Song, S. O., Varner, J., Modeling and analysis of the molecu-
lar basis of pain in sensory neurons. PLoS One 2009, 4,
e6758.
[10] Battogtokh, D., Asch, D. K., Case, M. E.,Arnold, J., Schuttler,
H. B., An ensemble method for identifying regulatory cir-
cuits with special reference to the qa gene cluster ofNeu-
rospora crassa. Proc. Natl. Acad. Sci. USA 2002, 99,
1690416909.
[11] Kuepfer, L., Peter, M., Sauer, U., Stelling, J., Ensemble mod-
eling for analysis of cell signaling dynamics.Nat.Biotechnol.
2007,25, 10011006.
[12] Brown, K. S., Sethna, J. P., Statistical mechanical approach-
es to models with many poorly known parameters. Phys.
Rev.E Stat.Nonlin. Soft Matter Phys. 2003, 68, 021904.
8/9/2019 Biotechnol J 2010 Song
13/13
BiotechnologyJournal
Biotechnol. J. 2010, 5, 768780
780 2010 Wiley VCH Verlag GmbH & Co KGaA Weinheim
[13] Palmer, T., Shutts, G., Hagedorn, R., Doblas-Reyes, F. et al.,
Representing model uncertainty in weather and climate
prediction.Annu. Rev.Earth Planetary Sci. 2005,33, 163193.
[14] Gutenkunst, R. N.,Waterfall, J. J., Casey, F. P., Brown, K. S. et
al., Universally sloppy parameter sensitivities in systems
biology models.PLoS Comput.Biol. 2007,3, 18711878.
[15] Moles, C. G., Mendes, P., Banga, J. R., Parameter estimation
in biochemical pathways: a comparison of global optimiza-
tion methods. Genome Res. 2003, 13, 24672474.
[16] Stelling, J., Gilles, E. D., Doyle, F. J., Robustness properties of
circadian clock architectures. Proc. Natl. Acad. Sci. USA
2004, 101, 1321013215.
[17] Luan, D., Zai, M., Varner, J. D., Computationally derived
points of fragility of a human cascade are consistent with
current therapeutic strategies.PLoS Comput.Biol. 2007,3,
e142.
[18] Chen, W. W., Schoeberl, B., Jasper, P. J., Niepel, M. et al., In-
put-output behavior of erbb signaling pathways as revealed
by a mass action model trained against dynamic data.Mol.
Syst.Biol. 2009,5, 239.
[19] Nayak, S., Salim, S., Luan, D., Zai, M., Varner, J. D., A test of
highly optimized tolerance reveals fragile cell-cycle mech-
anisms are molecular targets in clinical cancer trials.PLoS
One 2008,3, e2016.
[20] Kholodenko, B. N., Kiyatkin,A., Bruggeman, F. J., Sontag, E.
et al., Untangling the wires:a strategy to trace functional in-
teractions in signaling and gene networks.Proc.Natl.Acad.
Sci. USA 2002, 99, 1284112846.
[21] Kremling, A., Fischer, S., Gadkar, K. G., Doyle. F. J. et al., A
benchmark for methods in reverse engineering and model
discrimination:Problem formulation and solutions.Genome
Res. 2004, 14, 17731785.
[22] Gutenkunst, R. N.,Waterfall, J. J., Casey, F. P., Brown, K. S. et
al., Universally sloppy parameter sensitivities in systems
biology.PLoS Comput.Biol. 2007,3, e198.
[23] Casey, F. P., Baird, D., Feng, Q., Gutenkunst, R. N. et al., Opti-
mal experimental design in an EGFR signaling and down-regulation model.IET Syst.Biol. 2007, 1, 190202.
[24] Dickinson, R. P., Gelinas, R. J., Sensitivity analysis of ordi-
nary differential equation systems A direct method. J.
Comp.Phys. 1976,21, 123143.
[25] Fonseca, C., Fleming, P. J., Genetic algorithms for multiob-
jective optimization: Formulation, discussion and general-
ization,in:Proceedings of the 5th International Conference on
Genetic Algorithms, Morgan Kaufmann, San Mateo 1993, pp.
416423.
[26] Gadkar, K. G., Doyle, F. J. 3rd, Crowley, T. J.,Varner, J. D., Cy-
bernetic model predictive control of a continuous bioreac-
tor with cell recycle.Biotechnol.Prog. 2003, 19, 14871497.
[27] Varner, J. D., Large-scale prediction of phenotype: Concept.
Biotechnol.Bioeng. 2000, 69, 664678.
[28] Fussenegger, M., Bailey, J.,Varner, J., A mathematical model
of caspase function in apoptosis.Nat.Biotechnol. 2000, 18,
768774.
[29] Schoeberl, B., Eichler-Jonsson, C., Gilles, E. D., Mller, G.,
Computational modeling of the dynamics of the map kinase
cascade activated by surface and internalized egf receptors.
Nat.Biotechnol. 2002,20, 370375.
[30] Li, H., Ung,C.Y.,Ma, X.H., Liu,X. H. et al., Pathway sensi-
tivity analysis for detecting pro-proliferation activities of
oncogenes and tumor suppressors of epidermal growth fac-
tor receptor-extracellular signal-regulated protein kinase
pathway at altered protein levels. Cancer 2009, 115,
42464263.
[31] Stites, E.C.,Trampont,P. C., Ma,Z.,Ravichandran,K. S., Net-
work analysis of oncogenic ras activation in cancer. Science
2007,318, 463467.
[32] Helmy, M.,Gohda, J., Inoue, J. I.,Tomita, M. et al., Predicting
novel features of toll-like receptor 3 signaling in
macrophages.PLoS One 2009,4, e4661.
[33] Tasseff, R.,Nayak, S., Salim,S., Kaushik, P. et al.,Analysis of
the molecular networks in androgen dependent and inde-
pendent prostate cancer revealed fragile and robust sub-
systems.PLoS One 2010,5, e8864.
[34] Elowitz, M. B., Levine, A. J., Siggia, E. D., Swain, P. S., Sto-
chastic gene expression in a single cell. Science 2002,297,
11831186.
[35] Iyengar, K. A., Harris, L. A., Clancy, P., Accurate implemen-
tation of leaping in space: The spatial partitioned-leaping
algorithm.J. Chem.Phys. 2010, 132, 094101.
[36] Cao, Y., Petzold, L. R., Rathinam, M., Gillespie, D. T.,The nu-
merical stability of leaping methods for stochastic simula-
tion of chemically reacting systems.J. Chem.Phys. 2004, 121,
1216912178.
[37] Lee, M.W.,Vassiliadis,V. S.,Park,J. M., Individual-based and
stochastic modeling of cell population dynamics consider-
ing substrate dependency. Biotechnol. Bioeng. 2009, 103,
891899.[38] Swain,P. S., Elowitz, M. B., Siggia, E. D., Intrinsic and extrin-
sic contributions to stochasticity in gene expression.Proc.
Natl.Acad. Sci. USA 2002, 99, 1279512800.
[39] Bhardwaj,N.,Yan,K. K., Gerstein,M. B., Analysis of diverse
regulatory networks in a hierarchical context shows consis-
tent tendencies for collaboration in the middle levels.Proc.
Natl.Acad. Sci. USA 2010, 107, 68416846.
[40] Csete, M.,Doyle,J.,Bow ties, metabolism and disease. Trends
Biotechnol. 2004,22, 446450.
[41] Shoemaker, J. E., Doyle, F. J., Identifying fragilities in bio-
chemical networks:Robust performance analysis of fas sig-
naling-induced apoptosis.Biophys.J. 2008, 95, 26102623.