Meta-stochastic Simulation of Biochemical Models for Systems …ico2s.org/data/papers/Sanassy2014.pdf · 2014. 9. 2. · Infobiotics Workbench (1) and have been used for theoretical

Meta-stochastic Simulation of BiochemicalModels for Systems and Synthetic Biology

Daven Sanassy, Pawe l Widera, and Natalio Krasnogor∗

Interdisciplinary Computing and Complex BioSystems (ICOS) research group, School ofComputing Science, Newcastle University, UK

E-mail: [email protected]

Abstract

Stochastic simulation algorithms (SSAs) areused to trace realistic trajectories of biochemi-cal systems at low species concentrations. Asthe complexity of modelled bio-systems in-creases, it is important to select the best per-forming SSA. Numerous improvements to SSAshave been introduced but they each only tend toapply to a certain class of models. This makesit difficult for a systems or synthetic biologistto decide which algorithm to employ when con-fronted with a new model that requires simula-tion. In this paper, we demonstrate that it ispossible to determine which algorithm is bestsuited to simulate a particular model, and thatthis can be predicted a priori to algorithm exe-cution. We present a web based tool ssapredictthat allows scientists to upload a biochemicalmodel and obtain a prediction of the best per-forming SSA. Furthermore, ssapredict gives theuser the option to download our high perfor-mance simulator ngss preconfigured to performthe simulation of the queried biochemical modelwith the predicted fastest algorithm as the sim-ulation engine. The ssapredict web applicationis available at http://ssapredict.ico2s.org.It is free software and its source code is dis-tributed under the terms of the GNU AfferoGeneral Public Licence.

∗To whom correspondence should be addressed

Keywords

stochastic simulation, web application, modelproperties, algorithm performance, systems bi-ology, synthetic biology

1 Introduction

Simulation of mathematical and computationalmodels of reaction networks is an invaluabletool for scientists aiming to understand the dy-namic behaviour of complex biochemical sys-tems. In the fields of Systems and SyntheticBiology, repeated rounds of model-driven hy-pothesis generation, validated or refuted by wetlab experimentation, lead to refined quantita-tive and predictive models, and ultimately bet-ter bio-designs. In silico experimentation withthese models is cheaper, faster, and more re-producible than its real world counterpart (i.e.wet lab). In order to maximise the potentialof in silico experimentation, it is important toensure that (1) simulation algorithms can scaleefficiently with the class and size of problem and(2) optimised reference code is readily available,well documented and easy to reuse.

Stochastic Simulation Algorithms (SSAs) arethe primary means of simulating naturally dis-crete cellular systems affected by stochasticnoise. They generate multiple realistic trajec-tories of molecular quantities over time given aninitial state (e.g. species counts), a set of reac-tions (with associated stochastic rate constants)and stopping criteria. SSAs are an important

1

tool in systems biology software suites such asInfobiotics Workbench (1 ) and have been usedfor theoretical work (2–4 ).Exact SSAs, introduced by Gillespie (5 ), gen-

erate trajectories that are demonstrably equiva-lent to the Chemical Master Equation and mustsimulate each and every reaction in the sys-tem. SSA algorithmic complexity being O(M)(where M is the set of reactions), and concomi-tant generation of pseudo-random numbers toemulate stochasticity for each reaction event,makes simulating ever larger reaction networksincreasingly intractable despite continued ad-vances in computational power. This has ledto many studies addressing point (1) above,namely how to improve the scalability of theSSA. For example, approximate SSAs have beenintroduced that conditionally apply multiple re-actions at each step (6 ) as well as optimised ex-act variants that use improved data structuresto accelerate computations (7–12 ).

However, the availability of multiple variantsof the SSA comes at the cost of a lack ofclarity as to which one to use for a particu-lar biochemical model. More specifically, manypublished SSAs are tested with an insufficientnumber of models, mostly tailored to proper-ties of the newly introduced algorithm. Conse-quently, it is hard to extrapolate or compareperformance between algorithms as each willoften be benchmarked against competitors’ al-gorithms using only biochemical models thatperform favourably with the newly introducedalgorithm. When considering these variants,a scientist may wish to know which SSA willbe the fastest for simulating their particularmodel(s).

Currently, it is common to execute reactionnetworks with a single SSA implementation, forexample Next Reaction Method (NRM) (7 ).Due to the lack of model and algorithmic anal-ysis available, scientists are unaware that a dif-ferent algorithm may perform orders of magni-tude faster than their “default” algorithm. Tocompound this issue, whilst several stochasticsimulators are freely available (13–16 ), theirselection of algorithms is limited. Such a sit-uation would result in a scientist limiting thecomplexity of their model to obtain a tractable

simulation time. Therefore, it is preferable thatscientists are provided with tools that matchthe best algorithm for their model and allowfor better performing simulations. If simula-tion time can remain tractable in spite of in-creasing model complexity, the development offiner grained biological knowledge is possible.

The cost of simulating a system with oneSSA variant or another depends on the prop-erties of the underlying network of the modeland the states reached during the simulation.Each biological model exhibits characteristicsthat may be suited to a particular simulation al-gorithm. Thus effective discrimination betweenSSAs should be based on matching model char-acteristics to algorithmic performance. Net-work analysis is an important aspect of Systemsand Synthetic Biology, but Roy (17 ) notes thatusually only a few “handpicked” network prop-erties are considered to determine the influenceof the network structure. As an example, theStochKit2 (13 ) simulation software selects algo-rithm based on model “properties” but actuallyonly considers the number of reactions in themodel and is therefore very limited in discrimi-nation. A prediction based on a comprehensiveevaluation of network properties is required todiscriminate between the significant number ofSSA variants that exist as a function of theirperformance with specific biochemical models.

To address these problems, we have createda complete meta-stochastic simulation solutionimplemented as a user-friendly web applicationcalled ssapredict. Our tool allows a scientist toupload their model and automatically predictsthe best algorithm to use. Once a predictionhas been made, the user can download ngss, ourportable high performance simulator, precon-figured to run their model with the predictedfastest algorithm. Our web application tacklesthree issues: (a) it simplifies the decision mak-ing process for the scientist, (b) it produces anacceptably accurate prediction of SSA-to-modelmatch, and (c) it standardises SSA source codecontributing to reproducibility of in silico ex-periments.

The ngss simulator and ssapredict web appli-cation are under continuous development andwe are open to integrating new algorithmic

2

methods as well as to extend the set of mod-els we use for training. In the long term, wewould like to lead an effort, in collaborationwith other scientists in the field, to construct anopen benchmark composed of both algorithmsand models. This will not only result in im-proved prediction accuracy but also increasedsimulation speed for a wide range of models.

2 Results and Discussion

2.1 Stochastic Simulation

Our experimental analysis included eight exactSSA formulations that we had implemented.These are Direct Method (DM) (18 ) and FirstReaction Method (FRM) (5 ), Next ReactionMethod (NRM) (7 ), Optimised Direct Method(ODM) (10 ), Sorting Direct Method (SDM)(8 ), Logarithmic Direct Method (LDM) (9 ),Partial Propensities Direct Method (PDM)(12 ) and Composition Rejection (CR) (11 ). Anapproximate algorithm, Tau Leaping (TL) (6 )was also implemented and analysed. A briefdescription of these algorithms is provided inSection 3.1.

In our experiments, we used models takenfrom the BioModels database (19 ) and theperformance (reactions executed per second ofCPU time) of each of the algorithms was mea-sured for every model. The distribution ofbest performing algorithms across all models isshown in Table 1.

Table 1: Distribution of best performing algo-rithms for all 380 models from the BioModelsdataset. The first row shows how many times aparticular algorithm was the fastest for a modelin the dataset. The same value is also displayedas a percentage of the total (second row).

CR DM NRM FRM LDM SDM TL ODM PDM

0 1 1 2 9 43 75 87 1620.00% 0.26% 0.26% 0.53% 2.37% 11.32% 19.74% 22.9% 42.63%

To put this result into perspective, we com-pared the performance of each algorithm to thebest algorithm in the group for each of the mod-els. Figure 1 shows that three frequent winners

PDM, ODM and SDM, have very similar per-formance profiles (they are all improved vari-ants of DM) with the notable exception of afew models for which PDM performs badly. TL,another algorithm in the top 4, performs excep-tionally well for about 20% of the models, but isoutperformed by other algorithms for the restof the dataset. For the worst performers, CRand FRM, there is a clearly visible gap thatseparates their performance from the best.

In Table 2, we show how consistent the top 4algorithms are. For each algorithm in the top4, we measure how many times it was rankedbelow the top 4. ODM was the algorithm thatmost consistently remained in the top 4 (378out of 380 models), but was closely followed bySDM (368 out of 380 models). On the otherhand, PDM and TL were ranked below the top4 many times, including being ranked as theworst algorithm for some models. TL in partic-ular remained in the top 4 for only 80 models.

Table 2: Number of times one of the 4 bestperforming algorithms (See Table 1) was rankedbelow the top 4 for any of the 380 models fromthe BioModels dataset. Each row shows thetotal number of models for which an algorithmwas ranked under a given threshold. The lowestpossible rank was 9.

rank ODM SDM PDM TL

> 4 2 12 42 300> 5 0 1 22 287> 6 0 0 17 205> 7 0 0 7 51= 9 0 0 6 8

2.2 Performance Prediction

We define meta-stochastic simulation as an au-tomatic methodology for selecting the best al-gorithm for a given model. Different SSAs havevarying performance profiles dependent on aparticular model’s properties. Stochastic mod-els can be represented as graphs of dependen-cies between reactions or species. Using thesegraphs we are able to build a topological profile

3

104

105

106

107

108 DM

105

106

107

108 LDM

104

105

106

107

108 FRM

105

106

107

108 ODM

105

106

107

108 SDM

105

106

107

108 NRM

105

106

107

108 CR

105

106

107

108 PDM

104

105

106

107

108 TL

models

reac

tions

per

sec

ond

Figure 1: Comparison of the performance of each algorithm against the best performance for eachmodel. The red data points are the algorithm performance values for each model. The grey datapoints show the best performance for each model. The models on the horizontal axis are orderedby best performance.

of each model. We use the topological prop-erties to learn how to predict the fastest SSA(from a pool of nine algorithms) for a givenmodel.

An example of what we include in the modeltopological profile is the number of nodes orthe number of connections in the reaction orspecies dependency graph. Other propertiescontain information about the connections ofindividual nodes (node degree), the entire graph(graph density) or the existence of mutual de-pendencies (reciprocity). All topological prop-erties used in our analysis are discussed in Sec-tion 3.3 and listed in Table 5.

To provide the reader with an intuition ofwhich graph topological properties might beuseful in algorithm performance estimation, we

performed a small case study. For two top 4 al-gorithms, PDM and TL, we selected 5 modelsfor which their performance is the worst and 5models for which it is the best. We then mea-sured a median value of each topological prop-erty for the two sets (worst and best) of models.Finally, we selected 10 properties for which theabsolute difference between median(worst) andmedian(best) is the highest. We found that theworst models for PDM had high reaction graphdensity and total degree, high species graph mindegree and medium species reciprocity. Inter-estingly for TL, the worst models had low reac-tion graph density and low species graph mindegree, which is the opposite of the worst mod-els for PDM.

In Figure 2, we show the distribution of nor-

4

-2 -1 0 1 2 4 8 16

density (reaction)

reciprocity (reaction)

mean out degree (species)

mean in degree (species)

min total degree (species)

mean total degree (species)

density (species)

reciprocity (species)

Figure 2: Distribution of values of selected topological properties. For all properties, the meanvalue is equal to 0 and the standard deviation is equal to 1. The whiskers represent the 1.5 IQR(interquartile range) past the closest quartile (left/right edge of the box). Observations outside thisrange are marked as outliers. Horizontal axis is on a logarithmic scale.

malised values of properties selected for both al-gorithms. The species graph min degree standsout as having a dominating value (here 1) withmany outliers. For other properties the range ofvalues is much broader but outliers remain fre-quent, indicating possible hard to predict edgecases.

2.2.1 Performance Estimation

Linear regression (ordinary least squaresmethod) was used to fit a linear model esti-mating the performance. Regression was per-formed on a per algorithm basis with resultsshown in Table 3. The coefficient of determina-tion (R2) is close to 1 (perfect fit) for 7 out of9 algorithms. PDM and TL algorithms are theexception with R2 of 0.71 and 0.6 respectively,which indicates that their performance is moredifficult to estimate with a linear function.

Table 3: The linear regression fit of algorithmperformance on all models measured with thecoefficient of determination (R2).

TL PDM NRM CR FRM SDM ODM DM LDM

0.600 0.712 0.971 0.973 0.979 0.985 0.986 0.989 0.991

2.2.2 Prediction of the Fastest SSA

Employing 10-fold cross-validation, we aimedto determine the quality of algorithmic perfor-mance predictors that could be generated fromthe BioModels dataset using model topologicalproperties.

Some of the analysed model properties areimpractical for use in a prediction tool. Thisis because with large reaction networks thoseproperties can require even more computationaltime to compute than the simulation time ofthe model. A topological features set thatwas fast to compute (as these would be thestrongest candidates to be used in a tool thatcould predict algorithm performance of a par-ticular model a priori to simulation) was there-fore identified (see Table 5 in Section 3.3).

We used two variants of a random predictorand four predictors based on well known meth-ods (linear regression, logistic regression, sup-port vector classifier and a nearest neighbourclassifier). For each predictor we performeda 10-fold cross-validation experiment and mea-sured the mean accuracy and standard devia-tion of the predictions. The accuracy of eachpredictor is tested with four tolerance thresh-olds (ε = [0%, 1%, 5%, 10%]). In this scenario,

5

0 0.2 0.4 0.6 0.8 1

random (blind)

random (informed)

linear regression

logistic regression

k-NN vote (k=5)

linear SVC

ε=0%

0 0.2 0.4 0.6 0.8 1

ε=1%

0 0.2 0.4 0.6 0.8 1

ε=5%

0 0.2 0.4 0.6 0.8 1

ε=10%

prediction accuracy

Figure 3: Results of the 10-fold cross-validation classification experiment using a reduced set ofproperties (computational complexity ≤ O(V + E)) with increasing tolerance threshold ε.

0 0.2 0.4 0.6 0.8 1

random (blind)

random (informed)

linear regression

k-NN vote (k=5)

logistic regression

linear SVC

ε=0%

0 0.2 0.4 0.6 0.8 1

ε=1%

0 0.2 0.4 0.6 0.8 1

ε=5%

0 0.2 0.4 0.6 0.8 1

ε=10%

prediction accuracy

Figure 4: Results of the 10-fold cross-validation classification experiment using the total set ofavailable properties with increasing tolerance threshold ε.

if the performance of a best algorithm selectedby a predictor for a given model lies within thetolerance threshold of the performance of theactual best algorithm, the prediction is scoredas correct because the performance of the pre-dicted best SSA only differed from the actualbest SSA performance by an acceptable amount(e.g. 10%).

For the first experiment, we decided to as-sess the performance of predictors when giventhe full set of 32 computationally inexpensive(fast) properties. Results displayed in Figure 3demonstrate that all classifiers perform betterthan a random selection. k-Nearest Neighbourand linear SVC had the same prediction accu-racy (63% with ε = 0%), which is a signifi-cant improvement over a blind random selec-tion (12% with ε = 0%). However, from eval-uating the prediction quality at differing toler-ance thresholds, it appears that linear SVC wasthe strongest predictor (85% with ε = 10%).For comparison, a blind random selection atε = 10% results in half the accuracy at 42%.

For a comparative analysis, we subsequently

tested the prediction accuracy with the entireset of 100 slow and fast topological features,results are shown in Figure 4. Prediction ac-curacy overall was similar to the previous (fastproperties) experiment but demonstrated someslight improvement. Linear SVC was still thestrongest predictor and had improved from 63%to 65% (85% to 86% with ε = 10%). As ex-pected, higher quality predictions occur whenmore properties are introduced in the cross-validation experiments (effectively testing onthe training set). However, it is important tonote that using just the computationally inex-pensive properties produced results of similarquality to the full set of properties. Therefore,the strong performance of the predictors withfast properties means that we can create a toolthat produces an accurate prediction quickly.

2.2.3 Prediction Inaccuracies

Often the failures of a tool are as important asits successes. To demonstrate the consequencesof mispredictions of ssapredict, we ran another

6

10-fold cross-validation experiment. This timewe measured the related relative performanceloss for each inaccurate prediction. Figure 5shows the distribution of these values for thebest predictor in the group (linear SVC) com-pared to the random predictors. The medianvalue for the best predictor was only 5.4% andthe interquartile range was lower than 20%.This means that most of the mispredictionscorrespond to less than a 20% performanceloss. However, we also found several outliersfor which the performance drop was large (upto 86%).

Random (blind) Random (informed) LinearSVC0.0

0.2

0.4

0.6

0.8

1.0

Figure 5: Distribution of relative performanceloss caused by mispredictions for the best (lin-ear SVC) and worst predictors (two variants ofrandom choice). The whiskers represent the 1.5IQR (interquartile range) past the closest quar-tile (top/bottom edge of the box). Observationsoutside this range are marked as outliers.

2.2.4 Large Scale Model Experiments

In our final experiment, we investigatedwhether the predictors trained on small modelswould be accurate for large models. Figure 6shows the algorithmic performance for the Es-cherichia coli quorum sensing model instan-tiated on 1 × 1, 10 × 10 and 100 × 100 two-dimensional lattices (see Table 4 in Section 3.2for the model network sizes).

The linear SVC predictor selected TL for thesingle cell lattice and PDM for the 10× 10 and100× 100 lattices. Whilst the TL prediction is

accurate, for the large lattices CR is the fastestalgorithm. The PDM algorithm is 13% slowerthan CR for the 10× 10 lattice and 82% slowerfor 100× 100.

2.3 Web Application

Our meta-stochastic simulation solution, ss-apredict, is implemented as a web application.Ssapredict is designed to be easy to use and re-ceiving a prediction only requires the user topress an upload button and select the biochem-ical model of interest (in SBML (20 ) format)that resides on their computer. Once the pre-diction has been made, the user can downloadour portable high performance simulator, ngss(next generation stochastic simulator) which ispreconfigured to run the model with the pre-dicted fastest algorithm.

Figure 7 shows a screen shot of the results af-ter ssapredict model analysis. Model propertiesare displayed along with the predicted fastestalgorithm. The simulate model button allowsthe user to download the preconfigured ngss bi-nary for their platform.

2.4 Discussion

Our proof-of-concept SSA performance predic-tor has demonstrated that even using a limitedset of topological properties it is possible to picka fast algorithm for a given model. The pre-diction accuracy was found to be much higherthan a random choice. In addition, this workhas highlighted that no single SSA is superiorto all others for all models. For example, formodels with high min degree in species graphTL outperforms PDM, but when that degree islow, PDM outperforms TL.

Looking back at the distribution of fastestalgorithms (Table 1 and Figure 1), some re-sults are to be expected. For example, FRMbeing a weak performer, and a more sophisti-cated algorithm such as PDM being a strongperformer. However, we see some unexpectedresults too, such as one of the most sophisti-cated algorithms, CR, never winning and NRMbeing a less frequent winner than FRM. Table 2demonstrates that whilst an algorithm may be

7

FRM NRM LDM SDM PDM DM CR ODM TL102

103

104

105

106re

actio

ns p

er s

econ

d1x1 lattice

NRM FRM SDM LDM DM ODM TL PDM CR

10x10 lattice

NRM FRM LDM ODM SDM TL DM PDM CR

100x100 lattice

Figure 6: Average performance for the Escherichia coli quorum sensing circuit. The model wasinstantiated on a square lattice with 1, 100 and 10 000 points. The algorithms on the horizontalaxis are sorted from the slowest to the fastest and uniquely coloured. The vertical axis showsalgorithmic performance in reactions per second and is on a logarithmic scale. The error bar showsthe standard deviation across 10 runs.

Figure 7: Screenshot displaying the results of analysis performed on a model by ssapredict. Modeltopological properties can be inspected and a prediction of fastest algorithm is displayed. There isan option to simulate the model.

8

considered the fastest for a significant numberof models, there are also models for which itmay perform poorly. TL was the fastest algo-rithm for 75 out of 380 models, but was alsoranked below the top 4 algorithms for 300 mod-els. This result highlights the importance of se-lecting an algorithm on a per model basis ratherthan using a single algorithm for all models.

The large scale model prediction experimentresults (see Figure 6) show that CR is thefastest algorithm for these larger models. Dueto lack of models of this scale in our trainingset, ssapredict is unable to predict that result.However, it does select PDM, the second fastestalgorithm in the group. This is one of the lim-itations in predictor accuracy — it is highlydependent on the size and variability of mod-els used for training. In order to train betterquality predictors, a larger number of models isneeded, preferably containing special instancesfor which each of the algorithms was designedto perform best with.

While there are many biochemical modelsavailable in the literature and online, there arefew that are specifically intended for stochasticsimulation. Most of the models we used fromthe BioModels dataset were deterministic mod-els and to simulate them stochastically we fixedrate constants. Whilst there are many com-plete deterministic models available from on-line databases, few complete curated stochasticmodels are freely available. Therefore, a futureanalysis featuring complete stochastic modelswill have to be preceded by the creation of adataset with a large number of curated stochas-tic models.

Initial species amounts were set to a constant100 to complement the static topological anal-ysis. A major limitation of our current anal-ysis is that it will not be able to account fortransient changes that occur within a simulatedstochastic model. This is important becausehigh copy numbers of chemical species can cre-ate intractable simulation conditions for exactalgorithms.

With a larger and more diverse training set,we should be able to minimise the cost of mis-predictions. Although currently it is often small(median value of the relative performance drop

was 5.4%), there are edge cases where the pre-dictor is choosing an algorithm with 15% of theperformance of the best algorithm in the group.

We believe that future work in this area mustinclude a comprehensive benchmark of algo-rithms with a large set of models that exhibitvaried characteristics. This work can be consid-ered a precursor to producing a meta-stochasticsimulator that can dynamically switch algo-rithms when transient changes that affect per-formance occur in a biochemical system. AsSynthetic Biology aims to compose large sys-tems from smaller biological components, it isexpected that the size of models will increaserapidly in the future. These large systems willbe intractable for simulation without furtheralgorithmic innovation. In our opinion, meta-stochastic simulation is one of the ways to keepup with the growing requirements of SyntheticBiology.

We invite scientists to contribute biologicalmodels and stochastic simulation algorithmsthat we could use to improve ssapredict. Sub-mission instructions are available at the ssapre-dict website: http://ssapredict.ico2s.org.We intend to periodically re-train the predictorwith a growing set of models and to add morealgorithms to the ngss simulator.

3 Methods

3.1 Simulation Algorithms

In this section we will give a brief review of thealgorithms available for prediction in ssapredictand consequent simulation in ngss. FRM wasthe first SSA introduced by Gillespie in 1976.FRM is a simple algorithm that requires a ran-dom number generated for each reaction in thesystem every algorithmic iteration (5 ). DM wasintroduced to replace FRM and only requiredtwo random numbers generated per algorithmiciteration. DM is the de facto standard SSA for-mulation and is still commonly used for simu-lation (18 ).

NRM (based on FRM) used a large numberof optimisations including the use of a reactiondependency graph to minimise propensity recal-

9

culation and a priority queue to find the nextreaction to fire. Furthermore, NRM only re-quired one random number generated per iter-ation (7 ). ODM builds on DM with a reactiondependency graph and reduces reaction searchdepth by performing a pre-run and sorting reac-tions by their likelihood of firing (10 ). SDM issimilar to ODM but dynamically sorts reactionsduring algorithm execution. This means thatSDM performs well with simulations that ex-perience significant transient fluctuations (8 ).LDM is similar to ODM and SDM but uses abinary search to reduce reaction search depth(9 ).

PDM builds on DM but uses a species depen-dency graph and factors out reaction propensi-ties by species. PDM claims to be superior forhighly coupled reaction networks (12 ). CR isanother DM variant which uses rejection sam-pling to select the next reaction and thus claimsconstant time scaling. Therefore, in theory, CRshould outperform other exact algorithms whenpresented with large reaction networks. CRuses composition (reaction propensity group-ing) to reduce the number of rejections per al-gorithmic iteration (11 ).

In addition to the exact SSAs, there are manyapproximate and hybrid algorithms (6 , 21–23 ).At this stage, we focused on exact methods andhave only included one approximate method(explicit TL). We selected TL, as it is the firsthybrid method introduced by Gillespie and canbe considered as the de facto standard hybridSSA formulation. TL is an approximate algo-rithm that applies many reactions per algorith-mic step (rather than one per step as in theexact formulations). This is dependent on thepropensities of the system being relatively sta-ble, otherwise the algorithm reverts to DM (6 ).We used an updated version of TL that incor-porates optimised step size selection (24 ).

As more algorithms are added to the ngsssimulator in the future, we will also retrain ourpredictor to offer the best simulation option forthe user’s model.

3.2 Biochemical Models

We used 380 models retrieved from the BioMod-els database (19 ) in SBML (20 ) format. Thesemodels describe various different biological pro-cesses and are curated from peer reviewed liter-ature. This gives us access to a large number ofmodels used by scientists with variation in re-action network sizes. In Figure 8, a histogramis shown displaying the spread of model sizewithin the dataset, quantified by the number ofreactions in the model. It can be seen from thehistogram that the vast majority of BioModelshave a reaction network size of 50 reactions orless, but there is also a small number of largermodels (up to 1800 reactions).

0 200 400 600 800 1000 1200 1400 1600 1800size of reaction network

1

2

4

8

16

32

64

128

256

512

frequ

ency

Figure 8: Distribution of the size of model re-action network across the BioModels dataset.The bin size is 25. The vertical axis is on alogarithmic scale.

Our proof-of-concept algorithmic perfor-mance predictors only consider static struc-tural model topological properties to test if wecould make accurate predictions with that data.Therefore, alterations were made to models tofocus on the model structure analysis. In orderto simplify models and remove extra variablesthat cannot be captured by the dependencygraph analysis, the amounts of all species wereset to 100 and remain constant throughoutsimulation. The BioModels usually contain de-terministic rate functions instead of stochasticrate constants, and thus, a decision was madeto set the stochastic rate constants of all re-actions to 1.0. This is technically justified as

10

it turns the deterministic models into stochas-tic ones, and because the property analysis isperformed upon the unweighted dependencygraphs derived from these models (i.e. depen-dency graphs independent of reaction rates).

As the BioModels are limited in size, we wereinterested to test the accuracy of the predictorsalgorithmic performance when presented witha much larger model. We used the Escherichiacoli AI-2 quorum signal circuit from Li et al.(25 ). Whilst this model only contains 25 reac-tions and 21 species, we scaled it up by instan-tiating it on each point of a hypothetical two-dimensional lattice. We generated 10× 10 and100 × 100 models, adding transport reactionsbetween adjacent lattice points to incorporatequorum signal molecule transfer (see Table 4).

Table 4: Reaction and species graph sizes fordifferent versions of the stochastic Escherichiacoli AI-2 quorum signal circuit model from Liet al. The model size is the number of pointson a 2D lattice the model was instantiated on.

model size reactions species

1× 1 25 2110× 10 2860 2100

100× 100 289600 210000

3.3 Topological Analysis

Ssapredict performs a model property analysisand uses the results to predict the fastest SSAfor that model. After parsing the model, a reac-tion dependency graph and species dependencygraph are generated. In a reaction dependencygraph, each vertex corresponds to a unique re-action, hence the number of vertices in a reac-tion dependency graph is equal to the numberof reactions in the model. A directed edge isplaced from vertex Vi to vertex Vj if the firingof reaction Ri changes the propensity of reac-tion Rj. In a species dependency graph, eachvertex corresponds to a unique species, and sothe number of vertices in a species dependencygraph is equal to the number of species in themodel. A directed edge is drawn from vertexVi to vertex Vj if for any reaction species Si is

a reactant and species Sj is a product. Anyduplicate edges are removed from each graph.

54 properties were generated for each of thereaction and species dependency graphs of amodel. An additional property, reaction stiff-ness ratio, was also calculated bringing the to-tal number of properties to 109. For some mod-els certain properties were not possible to com-pute, for example, when a division by zero oc-curred. We replaced all missing values withzeros. Nine model properties were found tobe constant for all models and therefore wouldbe of no use as performance indicators andthus were removed from the data set, resultingin 100 model properties available for analysis.Ssapredict analyses a restricted set of 32 fast(computational complexity ≤ O(V +E)) prop-erties ensuring that the graph analysis com-pletes in a timely manner (see Table 5).

3.4 Algorithm Performance

The performance metric used to measure algo-rithmic computational speed was reactions persecond (rps) of CPU time. Rps allows one tocompare algorithm performance in a mannerthat ignores simulation run time. This meansthat algorithm performance can be comparedbetween two models that take vastly differingamounts of time to execute. Using rps alsoimproves comparative accuracy, if we wish torun an algorithm for x seconds, and measurehow many reactions are executed, the amountof time elapsed would almost certainly not beexactly x seconds, but a number extremely closeto x seconds. If this was compared to anotherrun of x seconds, neither run would be exactlythe same amount of time, and thus a compari-son in this manner would lose accuracy. Divid-ing the number of reactions executed by the ex-act simulation time to get a result in rps leavesus with a value that is appropriate for compar-ison. All runs were executed on a single core ofan otherwise idle benchmarking machine thatpossessed an Intel i7 2600K CPU with 16GBRAM and was running Ubuntu 11.04. The largeamount of RAM available and size of models in-volved meant that all simulations could be runin memory and thus avoid performance deteri-

11

Table 5: Summary of model topological properties analysed. Complexity relates to worst case timecomplexity for the computation of the propensity, where V is vertices, E is edges, and d is theaverage node degree. Properties marked with † have constant or linear scaling and are used for therestricted set of fast properties.

computational complexity graph property

O(1)† number of edges, number of vertices, density of graph

O(V )† min/mean/max outgoing edges, min/mean/max incomingedges, min/mean/max all edges

O(V + E)† weakly connected components, articulation points, bi-connected components, reciprocity of directed graph

O(V E) average geodesic length, longest geodesic length,min/mean/max outgoing closeness, min/mean/max incomingcloseness, min/mean/max closeness in undirected graph,min/mean/max betweenness, min/mean/max betweennessin undirected graph, min/mean/max edge betweenness,min/mean/max edge betweenness undirected graph

O(V (V + E)) min/mean/max shortest path in undirected graph,min/mean/max shortest incoming path, min/mean/maxshortest outgoing path

O((V + E)2) girth of undirected graph

O(V d2) transitivity of graph vertices, average local transitivity

O(V 4) min edge connectivity

O(V 5) min vertex connectivity

orations caused by memory paging.Each of the BioModels were executed for 10

seconds of CPU time for all 9 algorithms. Eachalgorithm was run 10 times on each model,hence a total of 90 rps values for each model.Because the BioModels were void of parameterssuch as simulation time and species amounts,it was decided that 10 seconds of CPU timefor each model/algorithm combination wouldbe enough to determine an accurate result foralgorithm performance.

3.5 Prediction Methods

We used two variants of a random predictor, aclassifier based on a set of linear regression es-timators trained separately for each algorithm,logistic regression (26 ), support vector classifierwith linear kernel (27 ) and a nearest neighbourclassifier (28 ) using a vote of 5 nearest mod-

els. For each predictor we performed a 10-foldcross-validation experiment and measured themean accuracy and standard deviation.

The two random predictors used differentamounts of information. First was a blind ran-dom predictor which assumed that each algo-rithm is equally probable to perform best. Theprobability of such a blind guess to be correctis equal to 1

9(see Equation 1).

p =∑i

wipi =1

9

∑i

pi =1

9(1)

The second random predictor assumed thateach algorithm is as probable as it was observedin the training set. Then, roulette wheel selec-tion was used to make a prediction. In the idealcase, the informed random guess would assigna weight equal to the true probability of win-ning for each algorithm (see Equation 2). Given

12

NGSS

MODEL

CONFIGMODEL

PREDICTOR

TOPOLOGICALANALYSIS

properties

SSAPREDICT ZIP ARCHIVE

input output

Figure 9: Structural diagram of ssapredict architecture and work-flow.

the distribution of winners in our benchmark weexpect the probability of a correct guess to bethree times greater than in the case of the blindguess.

p =∑i

wipi =∑i

p2i ' 0.27 (2)

3.6 Application Architecture

The ssapredict web application is built withpython using the web2py framework (29 ). Fig-ure 9 visualises the structure and work-flow ofssapredict. Models are uploaded to ssapredict inSBML (20 ) format and parsed using libSBML(30 ). After producing dependency graphs fromthe model data, a model topological analysis isperformed. The model property analysis soft-ware is written in C++ with performance asa design priority and calls functions from theigraph library (31 ) to compute the graph prop-erties.

The application then performs fast propertyanalysis of the supplied model using the setof 32 computationally inexpensive properties.The result of this model analysis is delivered toa linear SVC classifier which makes a predictionof the fastest algorithm to simulate the model.Linear SVC was the best performing classifica-tion method identified by the cross-validationexperiments and is trained with the algorithmicperformance data and fast set of properties forall models. The linear SVC classifier is writtenin Python using scikit-learn (32 ).

After a prediction has been made, ssapredictallows users to simulate their models with the

predicted fastest algorithm. The simulator ngssis written in C++ with an emphasis on per-formance and released under the terms of theGNU General Public Licence. Ngss is portableand will operate on GNU/Linux, Mac OS Xand Windows platforms. Ngss is distributedsuch that no extra dependencies need to be in-stalled by the user. Ngss supports OpenMP,allowing parallel runs (stochastic trajectories)on multi-core computers. Ssapredict simulationfunctionality is provided to the user through azip archive that contains the uploaded modeland simulator executable for their platform. Anauto-generated simulation parameter file thatconfigures ngss is also contained in the archive.By simply extracting the files and entering aone line command (instructions provided inarchive), the simulator will generate stochastictrajectories as time-series in CSV format.

Acknowledgement The authors thankJamie Twycross and Jonathan Blakes for theircollaboration on models processing. This workwas supported by the Engineering and Physi-cal Sciences Research Council [EP/I031642/1,EP/J004111/1, EP/L001489/1].

4 Author Contributions

D. Sanassy developed the ngss stochastic sim-ulator. D. Sanassy and P. Widera developedthe ssapredict web application. P. Widera, D.Sanassy and N. Krasnogor analysed the data.N. Krasnogor conceived the meta-stochasticsimulator concept. D. Sanassy, P. Widera andN. Krasnogor wrote the manuscript.

13

References

1. Blakes, J., Twycross, J., Romero-Campero, F. J., and Krasnogor, N. (2011)The Infobiotics Workbench: an integratedin silico modelling platform for Systemsand Synthetic Biology. Bioinformatics 27,3323–3324.

2. Twycross, J., Band, L., Bennett, M.,King, J., and Krasnogor, N. (2010) Stochas-tic and deterministic multiscale models forsystems biology: an auxin-transport casestudy. BMC Systems Biology 4, 34.

3. Cao, H., Romero-Campero, F., Heeb, S.,Camara, M., and Krasnogor, N. (2010)Evolving cell models for systems and syn-thetic biology. Systems and Synthetic Biol-ogy 4, 55–84.

4. Romero-Campero, F. J., Twycross, J.,Camara, M., Bennett, M., Gheorghe, M.,and Krasnogor, N. (2009) Modular Assem-bly of Cell Systems Biology Models UsingP Systems. International Journal of Foun-dations of Computer Science 20, 427–442.

5. Gillespie, D. T. (1976) A general methodfor numerically simulating the stochastictime evolution of coupled chemical reac-tions. Journal of Computational Physics22, 403–434.

6. Gillespie, D. T. (2001) Approximate accel-erated stochastic simulation of chemicallyreacting systems. The Journal of ChemicalPhysics 115, 1716–1733.

7. Gibson, M. A., and Bruck, J. (2000) Effi-cient Exact Stochastic Simulation of Chem-ical Systems with Many Species and ManyChannels. The Journal of Physical Chem-istry A 104, 1876–1889.

8. McCollum, J. M., Peterson, G. D.,Cox, C. D., Simpson, M. L., and Sam-atova, N. F. (2006) The sorting directmethod for stochastic simulation of bio-chemical systems with varying reaction ex-ecution behavior. Computational Biologyand Chemistry 30, 39–49.

9. Li, H., and Petzold, L. Logarithmic directmethod for discrete stochastic simulation ofchemically reacting systems.; 2006.

10. Cao, Y., Li, H., and Petzold, L. (2004) Ef-ficient formulation of the stochastic sim-ulation algorithm for chemically reactingsystems. The Journal of Chemical Physics121, 4059–4067.

11. Slepoy, A., Thompson, A. P., and Plimp-ton, S. J. (2008) A constant-time kineticMonte Carlo algorithm for simulation oflarge biochemical reaction networks. TheJournal of Chemical Physics 128, 205101.

12. Ramaswamy, R., Gonzalez-Segredo, N.,and Sbalzarini, I. F. (2009) A new classof highly efficient exact stochastic simula-tion algorithms for chemical reaction net-works. The Journal of Chemical Physics130, 244104.

13. Sanft, K. R., Wu, S., Roh, M., Fu, J.,Lim, R. K., and Petzold, L. R. (2011)StochKit2: software for discrete stochas-tic simulation of biochemical systems withevents. Bioinformatics 27, 2457–2458.

14. Hoops, S., Sahle, S., Gauges, R., Lee, C.,Pahle, J., Simus, N., Singhal, M., Xu, L.,Mendes, P., and Kummer, U. (2006) CO-PASI — a COmplex PAthway SImulator.Bioinformatics 22, 3067–3074.

15. Mauch, S., and Stalzer, M. (2011) Efficientformulations for exact stochastic simulationof chemical systems. IEEE/ACM Transac-tions on Computational Biology and Bioin-formatics 8, 27–35.

16. Chandran, D., Bergmann, F., and Sauro, H.(2009) TinkerCell: modular CAD tool forsynthetic biology. Journal of Biological En-gineering 3, 19.

17. Roy, S. (2012) Systems biology beyond de-gree, hubs and scale-free networks: the casefor multiple metrics in complex networks.Systems and Synthetic Biology 6, 31–34.

14

18. Gillespie, D. T. (1977) Exact stochasticsimulation of coupled chemical reactions.The Journal of Physical Chemistry 81,2340–2361.

19. Li, C., Donizelli, M., Rodriguez, N.,Dharuri, H., Endler, L., Chelliah, V.,Li, L., He, E., Henry, A., Stefan, M. I.,Snoep, J. L., Hucka, M., Le Novere, N., andLaibe, C. (2010) BioModels Database: Anenhanced, curated and annotated resourcefor published quantitative kinetic models.BMC Systems Biology 4, 92.

20. Hucka, M. et al. (2003) The systems biologymarkup language (SBML): a medium forrepresentation and exchange of biochemicalnetwork models. Bioinformatics 19, 524–531.

21. Cao, Y., Gillespie, D. T., and Petzold, L. R.(2005) The slow-scale stochastic simula-tion algorithm. The Journal of ChemicalPhysics 122, 014116.

22. Rathinam, M., Petzold, L. R., Cao, Y., andGillespie, D. T. (2003) Stiffness in stochas-tic chemically reacting systems: The im-plicit tau-leaping method. The Journal ofChemical Physics 119, 12784–12794.

23. Gillespie, D. T. (2007) Stochastic simula-tion of chemical kinetics. Annual Review ofPhysical Chemistry 58, 35–55.

24. Cao, Y., Gillespie, D. T., and Petzold, L. R.(2006) Efficient step size selection for thetau-leaping simulation method. The Jour-nal of Chemical Physics 124, 044109.

25. Li, J., Wang, L., Hashimoto, Y., Tsao, C.-Y., Wood, T. K., Valdes, J. J., Zafiriou, E.,and Bentley, W. E. (2006) A stochas-tic model of Escherichia coli AI-2 quorumsignal circuit reveals alternative synthesispathways. Molecular systems biology 2, 67.

26. Yu, H.-F., Huang, F.-L., and Lin, C.-J.(2011) Dual coordinate descent methods forlogistic regression and maximum entropymodels. Machine Learning 85, 41–75.

27. Chang, C.-C., and Lin, C.-J. (2011) LIB-SVM: A library for support vector ma-chines. ACM Transactions on IntelligentSystems and Technology 2, 27:1–27:27.

28. Wu, X., Kumar, V., Ross Quinlan, J.,Ghosh, J., Yang, Q., Motoda, H., McLach-lan, G., Ng, A., Liu, B., Yu, P., Zhou, Z.-H.,Steinbach, M., Hand, D., and Steinberg, D.(2008) Top 10 algorithms in data mining.Knowledge and Information Systems 14, 1–37.

29. Di Pierro, M. (2011) Web2Py for scien-tific applications. Computing in Scienceand Engineering 13, 64–69.

30. Bornstein, B. J., Keating, S. M.,Jouraku, A., and Hucka, M. (2008)LibSBML: an API Library for SBML.Bioinformatics 24, 880–881.

31. Csardi, G., and Nepusz, T. (2006) Theigraph software package for complex net-work research. InterJournal Complex Sys-tems, 1695.

32. Pedregosa, F. et al. (2011) Scikit-learn:Machine Learning in Python. Journal ofMachine Learning Research 12, 2825–2830.

15

Documents

Meta-stochastic Simulation of Biochemical Models for Systems …ico2s.org/data/papers/Sanassy2014.pdf · 2014. 9. 2. · Infobiotics Workbench (1) and have been used for theoretical