Cross species modeling of bacterial metabolism reveals new ... · Cross species modeling of bacterial metabolism reveals new insights about their intra-community and inter-host interactions

i

The Raymond and Beverly Sackler Faculty of Exact Sciences

The Blavatnik School of Computer Science

Cross species modeling of bacterial metabolism

reveals new insights about their intra-

community and inter-host interactions

Thesis submitted for the degree of Doctor of Philosophy

by

Raphy Zarecki

This work was carried out under the supervision of

Professor Eytan Ruppin

Submitted to the Senate of Tel Aviv University

November 2013

This work is dedicated to the pursuit after the understanding

of the interplay between living species.

Acknowledgements

I thank my advisor Professor Eytan Ruppin for his help, endless patience and

friendship.

I thank my Mother for her endless love.

I thank my family and specially my wife for their support in this adventure.

I thank all those people who shared this scientific road with me, specifically Shiri

Freilich, Matthew A. Oberhardt and Omer Eilam.

I thank my luck for living in exciting times.

This work was supported by the Edmond J. Safra Program in Tel-Aviv University.

This work was also supported by grants from the Israeli Ministry of Science and

Technology and the James McDonnell Foundation.

Abstract

Genome-scale metabolic models have proven their value in predicting organism

phenotypes from genotypes. But despite their numerous potential applications,

efforts to develop new models have failed to keep pace with genome sequencing.

This, together with the lack of standard nomenclatures for metabolites and reactions

in manually curated models, has restricted the focus of most research to date to

isolated, non-interacting species. However, it is well known that prokaryotes live and

thrive in dense communities, and the interactions of community members with each

other as well as with the environment determine much of the functionality,

adaptability, and capabilities of the whole group. While individual genome-scale

models are adequate to predict the behavior of cells in pure cultures, most natural

systems on earth require modeling of metabolic interactivity between species in order

to capture the most relevant biology.

The appearance of semi automatic tools for the generation of metabolic models, and

the subsequent availability of models for thousands of sequenced prokaryotes, have

recently opened up the possibility of tackling this issue. Although automatically

generated models are still not of equivalent quality to manually curated ones, they do

open new opportunities in the types of questions we may ask. Importantly, these

models solve the problem of differing metabolite and reaction nomenclature between

models -- a major technical hurdle in the past -- and thus enable seamless comparison

and modeling at the community level.

The research presented in this dissertation harnesses the newly emerging automatic

models of bacterial species to address some of the fundamental challenges involved

in modeling bacterial communities: The work includes a study of interactions

between different microbial species, an extraction of prevalent features across

species that are hidden when each species is examined alone, and a study of

interactions between bacterial communities and their hosts. In sum, this work

demonstrates the potential of multi-species metabolic modeling in advancing our

understanding of these issues, at the level of basic science and at the level of its

potential bio-engineering ramifications. Specifically some of the major results are a

large scale ecological based validation of a method for predicting the

cooperation/completion relationships between bacteria community members based

on their metabolic models, a novel easy to calculate method of predicting bacterial

growth rate based on their metabolic models and a novel algorithm for predicting the

glycan metabolism of the mammalian gut microbiota.

v

v

Contents

1. INTRODUCTION .................................................................................. 16

1.1. Systems biology ..................................................................................................................... 16

1.2. Metabolic networks analysis ................................................................................................ 17

1.2.1. Biological background ....................................................................................................... 17

1.2.2. Mathematical metabolic modeling ..................................................................................... 18

1.2.3. Construction of a metabolic model .................................................................................... 21

1.2.4. Potential application of metabolic modeling ..................................................................... 22

1.3. Focusing on modeling bacterial metabolism ...................................................................... 23

1.3.1. Semi automatic tools for metabolic model generation ....................................................... 23

1.3.2. Large scale analysis of a collction of prokaryote metabolic models .................................. 25

1.3.3. Metabolic simulating of prokaryotes communities ............................................................ 25

1.3.4. Finding hidden qualities from large scale analysis of metabolic models ........................... 26

2. COMPETITIVE AND COOPERATIVE METABOLIC INTERACTIONS IN

BACTERIAL COMMUNITIES ...................................................................... 28

2.1. Introduction .......................................................................................................................... 28

2.2. Results.................................................................................................................................... 30

2.2.1. In silico and in vivo description of co-growth patterns ...................................................... 30

2.2.2. Systematic predictions of the competitive potential .......................................................... 32

2.2.3. Systematic predictions of the cooperative potential........................................................... 35

vi

2.2.4. Patterns of interactions across ecological samples............................................................. 38

2.3. Discussion .............................................................................................................................. 42

2.4. Methods ................................................................................................................................. 43

2.4.5. Calculating the resource overlap within a species pair ...................................................... 47

3. MAXIMAL SUM OF METABOLIC EXCHANGE FLUXES

OUTPERFORMS BIOMASS YIELD AS A PREDICTOR OF GROWTH

RATE OF MICROORGANISMS .................................................................. 51

3.1. Introduction .......................................................................................................................... 51

3.2. Results and Discussion ......................................................................................................... 53

3.3. Conclusions ........................................................................................................................... 64

3.4. Materials and Methods ........................................................................................................ 64

3.4.1. Models ............................................................................................................................... 64

3.4.2. Implementation of growth rate predictors .......................................................................... 64

3.4.3. Building NCI60 cancer cell models ................................................................................... 65

3.4.4. Growth experiments of 6 organisms on 3 defined IMM media (ds18) .............................. 66

4. GLYCAN DEGRADATION (GLYDE) ANALYSIS PREDICTS

MAMMALIAN GUT MICROBIOTA ABUNDANCE AND HOST DIET-

SPECIFIC ADAPTATIONS .......................................................................... 67

4.1. Introduction .......................................................................................................................... 67

4.2. Results.................................................................................................................................... 69

4.2.1. The construction of the Glycan Degradation (GlyDe) pipeline ......................................... 69

4.2.2. The usage of the GlyDe pipeline ....................................................................................... 70

4.2.3. Validating the GlyDe pipeline ........................................................................................... 72

vii

4.2.4. Characterization of glycan degradation patterns across the major gut bacterial phyla ...... 73

4.2.5. Glycan degradation patterns can be used to predict bacterial abundance .......................... 74

4.2.6. Glycan degradation profiles of mammalian species are associated with their diet ............ 76

4.3. Discussion .............................................................................................................................. 79

4.4. Methods ................................................................................................................................. 82

4.4.1. Data Retrieval .................................................................................................................... 82

4.4.2. Construction of the CAZyme table (a key step in the GlyDe pipeline) ............................. 90

4.4.3. Data Analysis ..................................................................................................................... 92

5. DISCUSSION ........................................................................................ 98

5.1. Answering new types of questions with the large number of available bacterial

metabolic models ................................................................................................................................. 99

5.1.1. Metabolic Interaction within Bacterial communities ....................................................... 100

5.1.2. Extracting cell qualities from a large scale metabolic analysis across a large number of

species 101

5.1.3. Investigating the relations between gut bacterial communities and their hosts ............... 102

5.2. Future directions ................................................................................................................ 103

5.2.1. New methods for simulating communities ...................................................................... 104

5.2.2. Specialized bacterial communities ................................................................................... 104

5.2.3. Examples of organisms‟ traits that can be extracted using the large number of prokaryotes

metabolic models ........................................................................................................................... 105

5.2.4. Investigating relationships of specific communities of bacteria and their host ................ 105

5.3. Summary ............................................................................................................................. 105

APPENDIX 1. SUPPLEMENTARY DATA FOR CHAPTER 2 ................. 107

viii

A.1.1 Supplementary Figures ...................................................................................................... 107

A.1.2 Supplementary Tables ........................................................................................................ 109

A.1.3 Supplementary Methods .................................................................................................... 119

A.1.3.1 Computing the Maximal Biomass Production Rate (MBR) of species ....................... 119

A.1.3.2 Generation of a multi-species system metabolic model .............................................. 121

A.1.3.3 Computing a Competition inducing Medium (COMPM) for single species and multi-

species systems .............................................................................................................................. 121

A.1.3.4 Computing a Cooperation-inducing Medium (COOPM) ............................................ 122

A.1.3.5 Experimental and computational co-growth analysis ................................................. 124

A.1.3.6 Finding close cooperative loops in real and random networks of give-take interactions

and in real and randomly drawn communities ............................................................................... 126

A.1.4 Supplementary Notes ......................................................................................................... 127

A.1.4.1 Supplementary Note 1: Experimental and computational co-growth analyses for 10

bacterial pairs in interaction-specific media. .................................................................................. 127

A.1.4.2 Supplementary Note 2: using systematic data sources for estimating the ecological

relevance of win-lose predictions ................................................................................................... 132

A.1.4.3 Supplementary Note 3: Simulating co-growth of Salinibacter ruber and

Haloquadratum walsbyi. ................................................................................................................ 133

A.1.4.4 Supplementary Note 4: Relating the designed media to true ecological conditions ... 135

A.1.4.5 Supplementary Note 5: The use of various thresholds for determining a feasible growth

solution in minimal, cooperation-inducing, media ......................................................................... 137

A.1.4.6 Supplementary Note 6: Frequency of directional give-take relationships across

bacterial families (top 10 combinations). ....................................................................................... 141

A.1.4.7 Supplementary Note 7: Experimental and computational growth analyses of Listeria

innocua and Agrobacterium tumefaciens across pre-designed media ............................................ 142

ix


A.2.1 List of Abbreviations .......................................................................................................... 147

A.2.2 Supplementary results ........................................................................................................ 147

A.2.2.1 Sensitivity analysis of SUMEX and Biomass ............................................................. 147

A.2.2.2 Expanded analysis of obligate fermenters and respirers in ds66 ................................. 150

A.2.2.3 The relationship between flux and molecular weight in SUMEX............................... 151

A.2.2.4 Ranging of biomass% lower bound ............................................................................ 151

A.2.2.5 Network flexibility in SUMEX as biomass lower bound approaches 100%............... 153

A.2.2.6 Summing exchange fluxes in the optimal biomass solution space predicts growth rate

156

A.2.2.7 Gene Expression of pathways contributing to SUMEX .............................................. 158

A.2.2.8 Correlation of SUMEX and Biomass: ......................................................................... 160

A.2.3 Supplementary Methods: ................................................................................................... 161

A.2.3.1 Models ........................................................................................................................ 161

A.2.3.2 General methods ......................................................................................................... 161

A.2.3.3 Reactions constraints and optimal environment setting .............................................. 162

A.2.3.4 Building NCI60 cancer cell models ............................................................................ 162

A.2.3.5 Computation of metrics ............................................................................................... 163

A.2.3.6 Growth experiments of 6 organisms on 3 defined IMM media (ds18) ....................... 171

A.2.4 Tables ................................................................................................................................... 172


A.3.1 Figures ................................................................................................................................. 175

x

A.3.2 Tables ................................................................................................................................... 181

xi

xi

List of Tables

Chapter 4::Table 1: Mammalian host diet prediction by GlyDe profiles. ................. 78

Appendix 1::Supplementary Table S1. Description of model species and selected

properties. ................................................................................................................ 109

Appendix 1::Supplementary Table S6. The list of EnvO niches used in the analysis

and the number of assigned samples. ....................................................................... 117

Appendix 1::Table SM-1. IMM defined medium and its in silico representation. .. 125

Appendix 1::Table SN1-1. Predicted and observed co-growth shifts. .................... 128

Appendix 1::Table SN1-2 Calculated values for predicted and observed co-growth

shifts. ........................................................................................................................ 129

Appendix 1::Table SN3-1. Interactions between Salinibacter ruber and

Haloquadratum walsbyi across different media. ...................................................... 134

Appendix 1::Table SN4-1. Characterization of species-specific metabolic

computationally-designed environments ................................................................. 136

Appendix 1::Table SN4-2. Characterization of pair-specific metabolic environments

................................................................................................................................. 136

Appendix 1::Table SN5-1. Frequency of symmetrical interaction events under

minimal growth media with different thresholds for biomass production of the

system. ..................................................................................................................... 139

Appendix 1::Table SN5-2. Frequency of symmetrical interaction events under

minimal growth media with different thresholds for biomass production of the

system and the compartments in the system. ........................................................... 140

xii

Appendix 1::Table SN6-1. frequency of inter-family give-take interactions .......... 141

Appendix 1::Table SN7-1. Computational predictions for the effect of reducing and

removing computationally-predicted limiting factors from IMM media. ............... 144

Appendix 1::Table SN7-2. Predicted and observed growth and co-growth shifts. . 145

Appendix 1::Table SN7-3. Observed growth and co-growth shifts. Values indicate

the maximal OD in the experiments. ....................................................................... 145

Appendix 2::Table S1. Statistics of GEM sensitivity analysis. .............................. 150

Appendix 2::Table S2: Analysis of ds24. ................................................................ 156

Appendix 2::Table S4: Description of ds66. ........................................................... 172

Appendix 2::Table S5: in vitro growth experiments (i.e.,ds 18). ............................ 172

Appendix 2::Table S6: IMM defined medium. ....................................................... 174

Appendix 3::Supplementary Table 1: CAZymes degredation rules. ....................... 181

Appendix 3::Supplementary Table 2: Comarison between KEGG and non KEGG

Glyde Scores. ........................................................................................................... 181

Appendix 3::Supplementary Table 3: The GlyDe outputs for all the HMP taxa. ... 181

Appendix 3::Supplementary Table 4: The GlyDe scores of 8 Human Milk

Oligosaccharides (HMOs) available in KEGG. ....................................................... 182

Appendix 3::Supplementary Table 5: The CAZyme table. ..................................... 182

Appendix 3::Supplementary Table 6: A detailed account of the glycans used

throughout this analysis. .......................................................................................... 182

xiii

Appendix 3::Supplementary Table 7: The full list of monosaccharides and other

basic chemical entities used as nodes in the graphs representing glycan structures in

KEGG and incorporated into our system. ................................................................ 182

Appendix 3::Supplementary Table 8: An OTU table representing HMP taxa. ....... 182

Appendix 3::Supplementary Table 9: The HMP bacterial reference genomes glycan

degradation (GlyDe) matrix. .................................................................................... 182

Appendix 3::Supplementary Table 10: The Muegge et. al. samples CAZymes matrix.

................................................................................................................................. 182

Appendix 3::Supplementary Table 11: The Yatsunenko et. al. samples CAZymes

matrix. ...................................................................................................................... 182

Appendix 3::Supplementary Table 12: The Muegge et. al. samples GlyDe matrix. 182

Appendix 3::Supplementary Table 13: The Yatsunenko et. al. samples GlyDe matrix.

................................................................................................................................. 182

Supplementary Table 14: The Muegge et. al. samples GlyDe output report. .......... 182

Appendix 3::Supplementary Table 15: The Yatsunenko et. al. samples GlyDe output

report. ....................................................................................................................... 182

Appendix 3::Supplementary Table 16: Host diet predictions for 18 human samples

taken from Muegge et. al. ........................................................................................ 182

Appendix 3::Supplementary Table 17: Taxa with highly predictable abundance. .. 182

xiv

List of figures

Chapter2::Figure 1: Metabolic modeling in a multi-species system. ........................ 31

Chapter2::Figure 2: Metabolic modeling of pairwise growth on a competition-

inducing media (COMPM). ....................................................................................... 34

Chapter2::Figure 3: Distribution of competition and cooperation values. ............... 37

Chapter2::Figure 4: Predicted competitive and cooperative interactions across

different ecological groups. ....................................................................................... 40

Chapter 3::Figure 1: Correlation of different metrics to growth rate. ...................... 56

Chapter 3::Figure 2: Component-wise analysis of SUMEX ..................................... 59

Chapter 3::Figure 3: Prediction of growth in Respirers vs. Fermenters in ds66 ....... 61

Chapter 3::Figure 4: NCI60 cancer cell line growth rates predicted by SUMEX. .... 63

Chapter 4::Figure 1: The Glycan Degradation (GlyDe) platform. ............................ 72

Chapter 4::Figure 2: Glycan Degradation of the gut microbiota reference genomes. 76

Chapter 4::Figure 3: The connection between glycan degradation and diet. ............. 79

Appendix 1::Supplementary Figure S1. Cooperation and competition levels of the

ecological groups at different levels of competition and resource overlap. ............ 107

Appendix 1::Supplementary Figure S2: The frequency of resource overlap values

between ecologically associated (black) and non-associated (white) species pairs. 108

xv

Appendix 1::Figure NS1-1. Growth curves of individual and pair-wise combinations

across different media. ............................................................................................. 130

Appendix 2::Figure S1. Sensitivity analysis of GEM bounds. ............................... 149

Appendix 2:Figure S2: Effect of biomass lower bound on SUMEX. ..................... 153

Appendix 2::Figure S3: Flux variability in SUMEX solution as function of biomass

lower bound. ............................................................................................................ 155

Appendix 2::Figure S4: Extrapolating bounds for biomass..................................... 157

Appendix 2::Table S3: Association of global gene expression with SUMEX. ....... 160

Appendix 2::Figure S5: Correlation of Biomass with SUMEX. ............................. 160

Appendix 2::Figure S6: Schematic of SUMEX. ...................................................... 165

Appendix 3::Supplementary Figure 1: KEGG Glycans. ......................................... 177

Appendix 3::Supplementary Figure 2: Glycan Degradation of the gut microbiota

reference genomes. .................................................................................................. 179

Appendix 3::Supplementary Figure 3: The connection between glycan degradation

and diet. .................................................................................................................... 181

16

Chapter 1

1. Introduction

This chapter presents general background and reviews previous work related to the

studies described in this thesis. In particular, it includes an overview of systems

biology approaches for modeling of metabolic and protein networks. Further detailed

introductions precede each study in the following chapters.

1.1. Systems biology

For a very long time, the use of computers in biology was mostly dedicated to

analysis of collected biological data, which was then returned to biologists for

analysis.

The introduction of the concept of „Systems Biology‟ drastically changed the

relationships between computer scientists and biologists. In this paradigm,

sophisticated computational models are used to simulate biological phenomena in

great detail, and biologists are then asked to validate these models. Systems biology

thus involves an iterative interplay between high-throughput and high-content wetlab

experiments, technology development, theory, and computational modeling. The

involvement of computational modeling in the process sets systems biology apart

from the more traditional and more reductionist approaches of molecular biology,

which dominated biological study for the second half of the last century [1].

17

17

1.2. Metabolic networks analysis

This dissertation belongs to the field of Systems Biology. More specifically, it

focuses on the modeling of prokarytoe metabolic networks. Using the Systems

Biology approach, models describing the entire metabolic network of an organism of

interest are built. This approach is well defined, and computer readable generalized

representations of metabolic and biochemical model schemas such as SMBL[2] exist

and are stable. There are also tools and methods to measure concentrations of a broad

range of metabolites (metabolomics), as well as to collect other large-scale „omics

type data (e.g., proteomics, transcriptomics, fluxomics), which enables us to validate

some of our predictions[1].

The understanding of metabolic processes within living cells is of great potential

economic importance, and holds industrial, biomedical, and bio-remediation

potential. In industry, metabolic processes are relevant to the production of foods,

fuel, antibiotics, and amino-acids (as food supplements). In the context of healthcare,

metabolism plays a central role in many human diseases, especially with the

emergence of metabolic diseases such as diabetes and obesity as top sources of

morbidity and mortality.

1.2.1. Biological background

Cellular metabolism refers to the set of chemical transformations of substances that

take place within a cell. Most of the chemical reactions within the cell are catalyzed

by specific proteins called enzymes. These reactions typically convert several

metabolites, called reactants (or substrates), into several other product metabolites.

These collected reactions form a highly complex metabolic network.

18

18

1.2.2. Mathematical metabolic modeling

Mathematical modeling of metabolism can be presented in different ways.

Kinetic models

The kinetic approach is focused on the description of stationary states and time

courses[3]. Kinetic models are commonly formulated as a set of differential

equations that compute the time derivative of metabolite concentrations depending

on reaction rates (which, in turn, depend on the concentrations of some of the

metabolites and enzymes). The major limitation of such kinetic models is that

reaction rate equations contain many parameters whose values are unknown and are

experimentally hard to collect. This is why the applicability of kinetic models is still

limited to relatively small-scale systems.

Constraint Based Modeling

An alternative approach to kinetic modeling, when handling large scale metabolic

networks, is called Constraint Based Modeling (CBM). This approach is the one used

in this work. CBM is based on the observation that cells are subject to various

physical constraints that limit their behavior [4]. By enforcing these constraints on

the space of possible metabolic behaviors, it is possible to determine which

metabolic states are valid and which are not in a large-scale model and also possible

to select states according to defined criteria, and find what constraints do they imply

on the model. CBM has been shown to successfully predict gene essentiality, growth

rate, nutrient uptake rates and product secretion rates in a variety of studies [4].

In CBM, the metabolic network is represented as a matrix:

Sm x n ⋲ Rmm x n

,

in which n is the number of reactions in the model and m is the number of

metabolites that participate in the reactions of the model. Each column j represents

the linear equation representing reaction j (for example, the column for a reaction

19

19

consuming 2 metabolites of type A and producing 1 metabolite of type B would have

a -2 and +1 (respectively) for the matrix rows representing the metabolites A and B).

Each cell Si,j represents the stoichiometric coefficient of metabolite i in reaction j.

Reactants have negative stoichiometric values and the products have positive values.

Using linear programming methods along with this network representation, we can

predict metabolic states as represented by the feasible flux distribution through all

reactions in the network. The constraints imposed on the matrix that represent the

metabolic model are:

Mass balance – We assume quasi-steady state, i.e., that there is neither

accumulation nor depletion of metabolites within the metabolic network

(exchanged metabolites are allowed to deplete or accumulate outside of the

system). This means that the production rate of each metabolite is equal to its

consumption rate. This balance is formulated mathematically based on a

stoichiometric matrix. The mass balance constraint is enforced by the equation:

S · V = 0, in which V is the set of all metabolic reaction fluxes that fall within a

sub-space of Rn. As noted above, the model allows for uptake and secretion of

metabolites in and out of the model. This is done via exchange reactions that are

added to the model for this purpose, and which, unlike reactions within the

model, do not need to adhere to mass balance, as they represent ultimate external

sinks or sources.

Thermodynamic limitations - The model supports the directionality of reactions.

Directionality of many biochemical reactions is limited based on thermodynamic

constraints. For these reactions flux can only go in one direction, while for other

reactions flux can go from reactants to products or vice versa.

Flux rates - For some reactions, the maximum possible flux rate can be estimated

based on cell physiology data. These constraints are imposed by setting upper

and lower bounds on the rate of specific reactions. We represent these limitations

this way: ∀ 𝑣 ∈ 𝑉 𝛼 ≤ 𝑣 ≤ 𝛽 , where V is the set of all reactions, and 𝛼,𝛽 are

the lower and upper bounds of the reactions fluxes.

20

20

Growth media – In order to predict the behavior of an organism under given

growth conditions, the model allows the definition of available external

metabolites existing in the organism‟s growth environment. This is achieved by

constraining the fluxes through exchange reactions, which represent the

availability of extra-cellular metabolites within the growth medium.

The set of constraints form a convex solution space to the matrix S. The analysis of

these types of convex solution spaces is commonly done via linear programming

(LP) computational optimization methods. Examples include:

Flux Balance Analysis (FBA) [5], which searches for an optimal flux distribution

via a linear programming optimization. FBA assumes that the metabolism of an

organism is optimized for maximization of a certain objective function. The most

commonly used objective function for micro-organisms is biomass production[6].

Biomass production is calculated by adding a new pseudo-reaction representing

the production of essential biomass compounds from known metabolites acting

as reactants. The stoichiometric values for this reaction (Vbiomass) are based on

experimentally derived proportions of metabolic precursors to the parts of a cell

(e.g., lipid, sugar, amino acid), which can be measured as proportions of the dry

weight of a pure culture. Aside from biomass optimization, FBA can also be

used to examine production of other compounds of interest, such as ATP or

important external metabolites (or, indeed, any metabolite in the system). The

basic mathematical formulation of FBA is:

, m in m a x

:

0

m in / m a x _

j jj

j

S u b je c t to

v V

O b je c t iv e fu n c t io n

S V

v v v

Where S is the stoichiometric matrix reprenting the metabolic model at head, V is

the vector of reaction fluxes. Vj,min and Vj,max represent the lower and upper

bounds on the possible fluxes of reaction Vj.

21

21

It should be noted that in many cases the flux distribution solution is not unique;

finding a unique solution requires flux variability analysis (described below).

Flux Variability Analysis (FVA), which searches for the range of alternative flux

solutions for a given set of constraints [7]. This methods find the possible

minimal and maximal fluxes for a given set of reactions, and learns from it about

the size of the fluxes solution space based given a set of constraints.

Sampling methods. Randomly sampling the solution space for a given set of

constraints may reveal patterns in the distribution of allowed solutions [8].

Flux coupling, which can identify dependencies between sets of fluxes in the

solution space (such as „always occur together‟ or „are totally

independent/uncoupled‟) [9, 10].

In addition to LP, other optimization methods are often used with regard to metabolic

modeling. Quadratic Programming (QP) was used, for example, in a method that

tried to predict the minimization of metabolic adjustment (MOMA), after a

perturbation in the metabolic model[11]. Mixed Integer Linear Programming (MILP)

method was used, for example, in an algorithm for predicting the regulatory on/off

minimization of metabolic flux changes after genetic perturbations (ROOM) [12].

1.2.3. Construction of a metabolic model

The construction of a large-scale metabolic network model is based on various

biological data sources, including genomic, biochemical, and physiological data. It

involves the definition of a set of biological functions, termed ontology, and the

association of the gene products (enzymes) with ontology terms. The most

comprehensive and commonly used ontology is the Gene Ontology (GO), consisting

of over 20,000 terms and numerous associated gene products [13]. Some genes

encode proteins called enzymes. Biochemical reactions are catalyzed by enzymes,

and together a triplet of Gene-Protein-Reaction (GPR) is annotated. These GPR sets

are the core of metabolic models. Well known repositories for GPRs are KEGG[14] ,

MetaCyc [15] and „The Seed‟[16].

22

22

Model construction involves a well defined protocol [17] containing a series of

iterations in which a model‟s predictions are experimentally tested are then used to

improve it.

Manual construction of metabolic models

Up until a few years ago, genome-scale metabolic models were only constructed

manually. This was a labor-intensive process that required a lot of time (~1-2 years

work of a few people) to produce a working model. Many of the manually curated

large-scale models were constructed in Bernhard Palsson‟s lab in UCSD

(http://sbrg.ucsd.edu/Downloads), or in labs of his former students. These include

large-scale models for the bacterium E. coli [18], the yeast S. cerevisiae [19], and the

first large-scale human metabolic network model[20]. As of now there are less than

200 manually constructed models.

1.2.4. Potential application of metabolic modeling

Having a predictive model, is essential for engineering new products. Metabolic

modeling have already proved its predicting capabilities in gene deletion & addition,

gene over and under expression, prediction of phenotypes based on changes of

media, and prediction of possible growth media optimal for different phenotypic

goal[21]. Many health related application belong to the bio-markers family, where

the models can predict high/low concentrations of certain metabolites when certain

deseases occur[22, 23]. A bio-remediation usage example of metabolic modeling

helping removing uranium from a contaminated lake, can be found in [24].

http://sbrg.ucsd.edu/Downloads

23

23

1.3. Focusing on modeling bacterial metabolism

Prokaryotes are organisms whose cells lack a membrane-bound nucleus, including

both bacteria and archaea.

Prokaryotes have a fundamental role in the world's ecosystem. They were the first

form of life found on earth and are present in nearly every habitat on the planet.

There are approximately 5×1030

prokaryotes on Earth, forming a biomass that far

exceeds that of all plants and animals together. A healthy human harbors

approximately ten times as many bacterial cells as his own cells. Bacteria are vital in

recycling nutrients, and while some bacteria are pathogenic, others can be exploited

for a wide range of applications, from food and fuel production to clinical uses and

bioremediation. Elucidating the way these species interact with their surrounding and

their neighbors is crucial for our understanding of their biology and ours.

From the research point of view, prokaryotes are relatively simple compared to

eukaryotes, and thus they are ideal for the piloting studies done in this thesis.

Extensive research has been done on many Prokaryotes at the phenotypic and the

genotypic levels, and the relative simplicity of prokaryotes has importantly led to the

development of semi automated tools for the construction of metabolic models for

them.

1.3.1. Semi automatic tools for metabolic model generation

Genome-scale metabolic models have proven to be important resources for

predicting organism phenotypes from genotypes. However despite their numerous

applications, efforts to develop new models have failed to keep pace with genome

sequencing. To address this, new tools have been developed to aid with the creation

of metabolic models, automating the parts that could be automated. The leading tool

is „Model Seed‟ (hereafter referred to as SEED) [25]. This is web-based resource for

high-throughput generation, optimization, and analysis of draft prokaryotes genome-

scale metabolic models (available at http://www.theseed.org/models/). SEED

https://en.wikipedia.org/wiki/Bacteria

https://en.wikipedia.org/wiki/Archaea

http://en.wikipedia.org/wiki/Habitat

24

24

integrates existing methodologies and introduces new techniques to automate many

of the steps of the metabolic reconstruction process[17], enabling generation of

functioning draft models from assembled genome sequences in less than 48 hours. A

validation of 22 SEED-generated draft models was done against available gene

essentiality and Biolog data, with average model accuracy determined to be 66%

before optimization and 87% after optimization. The following chapters are based on

the models generated by SEED.

The major steps of the automatic model reconstruction in prokaryotes

Automatic model reconstruction for a given organism requires the following

information:

The genome sequence and its breaking into genes

A GPR (Gene-Protein-Reaction) mapping repository for all known genes

A databasee of all known reactions, with their full stoichiometry and chemical

formulla

For higher species and better mapping we also need enzyme localization information,

i.e., the compartment in which each enzyme operates. In prokaryotes, the SEED

assumed that all enzymes rest in the Cytosol.

The steps in automatic reconstruction of a model include:

Mapping as many genes as possible from the genome to genes that have a GPR

mapping, and thus creating a „core‟ set of GPR-mapped reactions.

Adding a biomass reaction containing all biomass pre-requisites as reactants.

Performing a Gap-Filling process in which a minimal set of reactions outside of

the „core‟ (i.e., reactions that are not mapped to the organism‟s genes) is added to

the model in order to enforce that the organism can „grow‟ (i.e., can have flux in

its biomass reaction while adhering to the model‟s constraints, including the

steady state assumption). The selection of Gap-filling reactions is usually done

by solving a MILP problem that tries to minimize the number of added reactions

while preferentially adding reactions belonging to established pathways enriched

25

25

with „core‟ reactions, and especially minimizing the addition of exchange

reactions. As a technical alternative, some Gap-Filling algorithms aim also to

maximize the number of core reactions that can carry flux.

The output of such a process is a working Model -- „working‟ being defined by the

model‟s basic functioning and its ability to carry biomass -- that should later be

validated and corrected.

Following this process, 2500 metabolic models, representing nearly every sequenced

prokaryote in NCBI (and spanning both bacteria and archaea) [26], were built via

SEED.

1.3.2. Large scale analysis of a collction of prokaryote

metabolic models

Most of the work that is currently done with metabolic models is still performed at

the level of a single species. However, with the newly emerged resource just

described and with recent advances, interest, and focus on metagenomics, it is timely

to begin exploring multiple-organism analyses using these models. This analysis

includes a comparison of the models and extraction of common features as shown in

Chapters 2, 3, and 4, the construction of more complex structures such as

communities, and an analysis of the interactions between species.

1.3.3. Metabolic simulating of prokaryotes communities

Most of the research using metabolic modeling has focused on single cell models,

and has assumed that each species is an individual entity, isolated from any

interaction with others. However, prokaryotes most typically live and thrive in dense

communities in which the interactions community members with each other and with

the environment determine the functionality, adaptability and capabilities of the

group as a whole. While individual models are adequate to predict the behavior of

cells in pure cultures, simulation of any realistic community requires consideration of

26

26

possible metabolic interactions between different species. Although there have been

several previous attempts to model particular consortia [27, 28], the work done was

focused on specific pairs and did not include a global analysis of the interaction

between species. The work done in Chapter 2 suggests a computational platform,

protocols, and resources for creating and analyzing models of communities in an

easy and systematic way. For that purpose, we have used the models curated using

the SEED semi-automatic workflow, as the building blocks of our communities.

1.3.4. Finding hidden qualities from large scale analysis of

metabolic models

In Chapter 3, we aim to explain and predict the phenotypic feature of growth rate

using metabolic modeling across different species. Growth rate has long been

considered one of the most valuable phenotypes that can be measured in cells [29].

Aside from being highly accessible and informative in laboratory cultures, maximal

growth rate is often a prime determinant of cellular fitness [30, 31], and predicting

phenotypes that underlie fitness is key to both understanding and manipulating life

[32-34]. Despite this, current methods for predicting microbial fitness typically

focus on yields [e.g., predictions of biomass yield using GEnome-scale metabolic

Models (GEMs)] or notably require many empirical kinetic constants or substrate

uptake rates, which render these methods ineffective in cases where fitness derives

most directly from growth rate [34, 35]. In Chapter 3 we present a new method for

predicting cellular growth rate, termed SUMEX, which does not require any

empirical variables apart from a metabolic network (i.e., a GEM) and the growth

medium. SUMEX is calculated by maximizing the SUM of molar EXchange fluxes

(hence SUMEX) in a genome-scale metabolic model. SUMEX successfully predicts

the growth rate of microbes across species, environments, and genetic conditions,

outperforming traditional cellular objectives (most notably, the convention assuming

biomass maximization). The success of SUMEX suggests that the ability of a cell to

catabolize substrates and produce a strong proton gradient enables fast cell growth.

Easily applicable heuristics for predicting growth rate, such as what we demonstrate

with SUMEX, may contribute to numerous medical and biotechnological goals,

27

27

ranging from the engineering of faster-growing industrial strains, modeling of mixed

ecological communities, and the inhibition of cancer growth.

1.3.5. Using metabolic modeling for the analysis of Gut

bacterial communities

In Chapter 4, we use metabolic modeling in order to predict bacterial species

abundance and diet-specific adaptations in the mammalian gut microbiota, by

analyzing gut bacterial glycan metabolism. Glycans form the primary nutritional

source of microbes in the mammalians‟ gut. Understanding the metabolism of

glycans by the microbiota is therefore a key target of microbiome research. In

Chapter 4 we present a novel computational pipeline for modeling Glycan

Degradation, providing a broad view of the usage of these compounds on genome

and metagenome scales. Our platform predicts, for the first time, the usage patterns

of thousands of glycans by all the sequenced individual gut bacteria deposited in the

Human Microbiome Project (HMP) database, giving a new metabolic view of the gut

community. Using our new platform we show that the ability of a bacterial species to

degrade polysaccharides is highly correlated with its abundance, suggesting a

potential selective advantage for primary glycan degraders. We further demonstrate

that differences in community composition carry functional importance, i.e., that the

microbiota of herbivores and carnivores have stronger affinities to plant- and animal-

derived glycans, respectively. We show that our platform can be used to train an

extremely accurate classifier to predict the diet type (plant vs. animal) of a host based

on its glycan degradation profile, going markedly beyond a classification based on

enzymatic content alone. Applying our classifier to microbiota samples from US

residents, we show they mostly favor animal-derived glycans, while those of

individuals from Malawi and Venezuela shift towards plant-derived glycans in

adulthood. Our platform opens the door for a systematic prediction of microbiota-

specific dietary patterns.

28

28

Chapter 2

2. Competitive and cooperative

metabolic interactions in bacterial

communities

Based on an article with the same title by the authors:

Shiri Freilich, Raphy Zarecki, Omer Eilam, Ella Shtifman Segal, Christopher S.

Henry, Martin Kupiec, Uri Gophna, Roded Sharan & Eytan Ruppin

In this article the first 2 authors had equal contribution.

Published in: Nature communication, Dec 13 , 2011[36]

2.1. Introduction

A fundamental question in ecology is how different species can co-exist in nature.

Darwin's famous documentation of the nutritional divergence within a family of

finches resulted in the principle of competitive exclusion, which asserts that co-

existence is made possible through divergence and the subsequent reduction in

resource overlap[37, 38]. However, the observed phenotypic similarity between co-

occurring species has led to renewed questioning about the role of competitive

interactions in shaping communities; it has been suggested that the carrying capacity

of many environments is sufficient to allow the co-existence of closely related

29

29

species[39]. In addition to competition, growing evidence supports the prevalence of

cooperative interactions between organisms [40-43]. Yet, despite their prevalence,

the consequences of cooperative interactions for species diversity are still poorly

understood[44].

The analysis of species' co-occurrence data has long been used by ecologists to

discern the forces that dictate community structure[45, 46]. Yet, to date empirical

records of species' distribution have been highly fragmented, and a systematic

approach for estimating the corresponding levels of inter-species competitive and

cooperative interactions has been lacking. Within bacterial communities, competitive

(where two species consume shared resources) and cooperative (where the

metabolites produced by one species are consumed by another and, potentially, vice

versa) interactions are to a large extent derived from metabolism. Stoichiometric-

based metabolic models were recently shown to provide accurate predictions for the

patterns of metabolic interactions in bacterial two-species systems[27, 28, 47],

making these approaches a useful tool for exploring ecological concepts[43]. Beyond

focusing on a few well-defined case studies, stoichiometric Constraint-Based

Modeling (CBM) was already used for the systematic design of cooperation-

supporting media for all pair-wise combinations formed between seven

microorganisms represented by genome-scale metabolic models[47]. Yet, the

relative scarcity of such manually curated models has precluded the conductance of

larger scale explorations. Moreover, the ecological significance of these interactions

has not been examined on a large scale. The publishing of an automatic high-

throughput reconstruction pipeline has generated more than 100 genome-scale

metabolic bacterial models spanning 13 bacterial divisions[25]. This development,

complemented by the accumulation of meta-genomics data from environmental

surveys, has provided a golden opportunity to perform systematic inter-species in

silico studies on an ecological scale.

30

30

Previous large-scale computational studies of microbial ecology and metabolism

relied solely on network representations of enzymes and reactions [48-50] (rather

than representation by an operative stoichiometric-based metabolic model) and such

studies lacked the tools for systematically describing pair-wise interactions in a

media-dependent manner. Presented in this chapter are the results of the first

integrative computational and ecological study that aims to provide a global-scale

description of bacterial metabolic interactions between geographically co-occurring,

mutually exclusive, and randomly-distributed species pairs. To this end, a conceptual

computational framework for characterizing the levels of metabolic competitive and

cooperative interactions between pairs of species was defined. Subsequently, an

exploration was done for the distribution patterns of species as derived from

environmental samples in order to relate their ecological co-occurrences to the types

of interactions inferred.

The ability of predicting the type of relationships between community members can

help in bio-remediation task[24], it can help fighting pathogens by adding their non-

patogen competitors or by finding a media that favors their non-pathogen

neighbours. This can be used in human medicine, and in plant pesticides[51].

2.2. Results

2.2.1. In silico and in vivo description of co-growth patterns

Starting from a collection of 118 genome-scale metabolic-models of bacteria that

were automatically generated and published[25] , a systematical use of CBM for

computing the biomass production rate for each of the individual species and their

corresponding 6903 pair-wise combinations (Methods) was done. Analogously to

the computation of genetic interactions[52], it was assumed that there are three types

of potential interactions (Chapter2::Figure 1): negative, where two species consume

shared resources (competition); positive, where the metabolites produced by one

species are consumed by another, and potentially vice versa (representing mutualism,

commensalism or parasitism -- that is positive/positive, positive/neutral or

31

31

positive/negative interactions), hence producing a synergic co-growth benefit; and

neutral, where co-growth has no net effect (Chapter2::Figure 1). As in genetic

interactions, the extent and type of interactions occurring between two species can be

described by comparing the total biomass production rate in the pairwise system to

the sum of the corresponding individual rates recorded in their individual growth.

Chapter2::Figure 1: Metabolic modeling in a multi-species system.

The scheme on the left is an illustrative example of potential interaction types occurring between

species in a pairwise system. No interaction is expected when species A and species B use non-

overlapping resources of the corresponding environment; Negative interaction/Competition: decrease

in the overall growth is expected when species A and species B share the same resources; Positive

interaction/Cooperation: increase in the overall growth is expected when the products of one species

are the substrates of the second species. On the right: co-growth experiments of Listeria innocua and

Agrobacterium tumefaciens in three interaction-specific (no interaction, competition and cooperation),

computationally pre-designed media. Species were grown in a defined medium modified for Listeria

growth (Methods & supplementary section A.1.3.5). Computational predictions for the experiments:

32

32

no interactions SIG(83.1)=~CG(87.5); competition: SIG(109.2)>CG(97.3); cooperation:

SIG(0.0)<CG(19.0). SIG: Sum individual Growth; CG: Co Growth. SIG(0.0) means no growth at the

given media. OD represents optical density which is used as a measure for growth rate.

Naturally, interactions between a pair of species are expected to vary significantly

depending on the given growth environment. Consequently, for a given pair of

species different media types were designed, and as expected different types of

interactions were revealed. The predictive power of our simulation in inducing shifts

from neutral to negative and positive interactions was experimentally tested for 10

bacterial pairs, representing all possible pair-wise combinations between five species

capable of growing in the same defined media (IMM, Methods). For all

combinations, we simulated co-growth in the original defined media as well as in a

range of modified media formed by the addition and subtraction of specific nutrient

combinations, leading to the selection of two media compositions that induce

maximal negative and positive shifts, respectively (Methods). Laboratory co-growth

experiments were then conducted for all species pairs across the three designed

media (original, negative and positive) where positive and negative shifts were

correctly predicted in 65% of the experiments (precision 0.75, recall 0.8,

Supplementary Note 1). The observed and predicted interactions between Listeria

innocua and Agrobacterium tumefaciens, demonstrating a close to neutral interaction

in the original defined media, are shown in Chapter2::Figure 1. As evident, shifts

from neutral to negative and positive interactions between the two species are

successfully induced in the designed media, testifying to the model's predictive

ability. Notably, one should bear in mind that our experiments only cover a small

subset of all potential pairwise interactions. Yet, our experiments together with a

growing number of studies are testifying for the ability of metabolic-driven

computational approaches to describe the metabolic interaction between two species

[27, 28, 47, 53].

2.2.2. Systematic predictions of the competitive potential

33

33

Since interactions are condition specific, and because nutrient concentrations in

specific natural niches are mostly unknown and subject to significant variations, it

was subsequently aimed to design simulated media that, for each given pair of

species, can efficiently uncover their potential capacity to compete or cooperate. To

design a medium that maximizes potential competitive interactions a traditional

perception of competition as a situation with a high level of resource requirement

overlap was taken. This approach precludes resource sharing [54-56] and yielded

6903 pair-specific in-silico minimal optimal media, termed Competition-inducing

Media (COMPM, Methods). For each pair, COMPM includes the minimal set of

metabolites, provided at their minimal quantity, yet still allowing each species to

individually grow at its maximal possible growth yield, leading to the full

consumption of external resources (Methods). Thus, when resources overlap, this

medium will uncover potential competition.

For each pair of species placed in its respective competition-inducing medium, a

prediction of the win-lose relationships was done by comparing the individual

biomass production (growth yield) rates within the pair-wise system. Winners (faster

species in the pair-wise system) tend to be species with higher potential biomass

production rates (the latter determined in a single-species system, Figure 2A), in

accordance with the notion that faster species out-grow their competitors[57].

Looking at the identity of the frequent winners in-silico, it was observed that there

exists a clear correspondence between computed predictions and ecological data,

where winners include fast growing, ecologically versatile species such as

Escherichia coli, Salmonella typhimurium, Vibrio cholerae and Pseudomonas

aeruginosa (in accordance with earlier observations[50]). Similarly, in-silico losers

include slow growing specialists such as Mycoplasma genitalium and Buchnera

aphidicola. The identification of winners as species with higher individual growth

rates is also maintained when considering the experimentally recorded doubling

times (Figure 2B). In correspondence with the ecological observation that the faster

growing species are the ones exploiting the shared resources[57], Figure 2C shows

34

34

that the in-silico faster species tend to grow closer to their full capacity than the slow

growers.

Chapter2::Figure 2: Metabolic modeling of pairwise growth on a competition-inducing media (COMPM).

The matrices describe the outcome of competition between all species pairs. Rows and column represent species (sorted

differently in each matrix) where each cell shows the win/lose outcome of the column species following co-growth with the row

species. (A, B) Green, red and blue represent win/lose/inconclusive outcome predictions, respectively. Briefly, the winner is

defined as the species with the higher predicted growth in a two species system (see Methods). (A) Species in rows and

columns are sorted according to their computed biomass production rates. Winner-loser relationships were determined for more

than 90% of the pair combinations. (B) Species in rows and columns are sorted according to their experimentally measured

doubling times (retrieved as described at Supplementary Note 2). The predicted win-lose division in B is found to be

significantly more distinct than in permuted matrices (P value 0.002, Supplementary Note 2). (C) The ratio of biomass

production rate of each species in the pairwise system relatively to its biomass production rate when grown alone (Methods).

Cells are sorted as in A. The full list of species (including the computed and measured doubling times) is provided at

Supplementary Table 1. Growth rates (computed) of each species across all pairwise combinations are provided in

Supplementary Table 2.

35

35

Going beyond win-lose predictions, a Potential Competition Score (PCMS, Methods)

was designed to quantify the level of competition predicted among the species in the

tested collection, by comparing their individual and combined biomass production

rates across simulated Competition-inducing Media (COMPM). A PCMS value of 0

represents no competition and PCMS of 1 indicates maximal competition, while

negative PCMS values denote cooperation and synergic co-growth. 98% of the

PCMS values are positive (competitive) with a mean PCMS of 0.77 (Figure 3A). As

expected, it was observed that PCMS values strongly correlate with the degree of in-

silico resource overlap, the latter determined by the level of intersection between the

minimal media sufficient for maximal growth rate of the two species (Figure 3B and

Methods).

2.2.3. Systematic predictions of the cooperative potential

Due to the rich nature of the competition-inducing media, which is likely to conceal

inter-species metabolite transfer and cooperation[28], only very few positive

interactions (negative PCMS values) are revealed (Figure 3A). For example, the

documented cooperative interaction between the two halophylic species Salinibacter

ruber and Haloquadratum walsbyi[58] is only revealed in a simulation setting when

reducing their in-silico growth medium, inducing the reported dependence of H.

walsbyi in S. ruber for the supply of dihydroxyacetone (DHA) (Supplementary Note

3). Thus for each pair of species an in-silico minimal medium was designed to

support a predetermined small level of growth of both species together, termed a

Cooperation-inducing Medium (COOPM) (Methods), taking a similar approach as

in[47]. Potential Cooperation Scores (PCPS) are then computed according to the

ratio between the sum of individual growth rates and the co-growth rate, where

positive values indicate cooperation and negative values indicate competition

(Methods). Whereas in rich in-silico media almost none of the pairs exhibit positive

interactions (negative PCMS), about 35% of the pairs show a cooperative potential

(positive PCPS) in the in-silico cooperation-inducing media (with scores > 0.05,

Figure 3C).

36

36

Unlike the monotonic association between similarity in media requirements and the

competitive potential described above (Figure 3B), resource overlap and cooperative

potential demonstrate an inverted-U relationships (Figure 3B), where a moderate

level of similarity in the required resources maximizes the potential for collaboration,

and the cooperative potential declines at higher levels of resource overlap. This is

likely to stem from the increasing competition on available resources, combined with

the scarcity of differing resources that can be shared. Typically, cooperation inducing

media lack amino-acids (Supplementary Note 4), enhancing the need to exchange

these metabolites, which were suggested to be transferred between species in

mutualistic interactions by[28]. It was observed that a moderate association between

competition and cooperation for intermediate levels of competition exists (Figure

3D).

37

37

Chapter2::Figure 3: Distribution of competition and cooperation values.

(A) The distribution of predicted potential competition scores (PCMS) across the 6903 non-redundant species‟ pairs grown in

competition-inducing environments (COMPM). (B) The relation between resource overlap and competition (white) and

cooperation (black) scores. Resource overlap and competition: Spearman rank correlation 0.4, P value < 2.2e-16. Resource

overlap and cooperation: correlation coefficient for a second order polynomial regression 0.3, P value < 2.2e-16. IS (the

extreme right bars) indicates Intra-Species interaction (competition and cooperation values recorded when a species is paired

with itself). (C) The distribution of predicted potential cooperation scores (PCPS) across the 6903 non-redundant species pairs

grown in cooperation-inducing media (COOPM). (D) The relation between cooperation and competition levels: The Spearman

correlation between competition and cooperation is significant but very low (0.04, P value 8e-4). When limiting to intermediate

competition values of 0.1<PCMS<0.8 this correlation is more substantial but still quite moderate (0.2, P value < 2.2e-16). The

computed PCMS, PCPS and resource overlap are provided at Supplementary Table 3, 4 and 5, respectively.

Interestingly, an inverted-U relationship between resource overlap and cooperation

38

38

has been reported in economical models describing the likelihood of forming inter-

firm alliance versus the corresponding degree of technological overlap. As suggested

here for bacterial communities, such economical models suggest that although some

degree of technological overlap is necessary to support a successful alliance, at some

point such overlap yields diminishing and perhaps even negative returns[59].

Notably, a cooperative potential denotes an overall gain at the pair-wise, system

level, though at the species level we can observe a benefit either for both species

(mutualism) or to only one of them. Examining the gain of each species in a pair-

wise system, it was observed that the large majority of in-silico cooperative

interactions are unidirectional, i.e., there is a single species that benefits from the

interaction, where the other species is not affected (commensalism, Methods).

Similar results were obtained when using alternative approaches for modeling

cooperation (Supplementary Note 5). This is in agreement with a recent

investigation of computationally predicted pair-wise interactions between seven

microbial species across a wide range of environments[47], and to numerous

experimental observations of syntrophic interactions[40, 58, 60, 61]. As displayed

in Supplementary Note 6, one can observe a high tendency of Clostridia species to

be involved in cooperative interactions as the giving side. Indeed, Clostridia are

known to be involved in the fermentative digestion of cellulose and lignin leading to

the subsequent release of easily degradable carbohydrates to other community

members[62, 63].

2.2.4. Patterns of interactions across ecological samples

Since the benefits to the giver in predicted unidirectional cooperative interactions are

not obvious, their relevance for species‟ co-existence may be questioned. To directly

relate the computational predictions to patterns of species co-existence, 16S data

from environmental surveys across 2801 samples belonging to 59 different

ecological niches[64] was used. Two categories of ecologically-associated pairs

were defined: pair members that show a similar distribution pattern across the 59

39

39

ecological categories are termed niche-associated (648 pairs versus 2512 non niche-

associated pairs); some of the niche-associated pair-members further show a similar

distribution pattern across the 2801 individual samples composing the different

ecological categories (niches) and are termed co-occurring pairs (84 pairs, Methods).

Competition scores recorded for ecologically-associated, and in particular co-

occurring, species are significantly higher than those of non-associated pair members

(Figure 4A). This is in agreement with the dominant ecological perception of high

level of competition between neighboring species making use of the same

resources[39, 64]. It was also observed that a significantly higher rate of cooperative

give-take interactions between ecologically-associated (at both niche and sample

level) versus non-associated species (Figure 4B). This observation is retained when

compared at different levels of competition and resource overlap (Supplementary

Figure 1), in line with existing ecological theory[44].

40

40

Chapter2::Figure 4: Predicted competitive and cooperative interactions across different ecological groups.

The level of predicted interaction potential was calculated across randomly distributed and ecologically associated species pairs.

Three categories of ecological associations were considered (Methods): association at the level of ecological niche; association

at the level of the sample (co-occurring pairs) and an antagonistic pattern of distribution at the sample level (mutual exclusive

pairs). (A) Competition scores. (B) Cooperation scores. The difference between ecologically associated and non-associated

groups for both competition and cooperation is highly significant (P value < 2.2 e-16, one sided Kolmogorov-Smirnov test). (C)

The mean cumulative number of loops across 1000 reconstructions of the networks versus the number of species in a network

of give-take interactions, for networks of ecologically associated versus networks of non-associated species (Methods). (D)

Parent similarity, calculated as the fraction of common givers. Bars in (A, B, D) represent standard deviations. The ecological

association between species pairs is provided in Supplementary Table 6.

Although the accumulation of some end-product metabolites can be toxic, the

advantages for the giver species, remains obscure. To explore the role of cooperative

41

41

interactions at the level of the community, we constructed the inter-species network

of predicted directional (give-take) interactions (Methods); within this network

motifs of closed cooperative loops were identified, e.g., A gives to B; B gives to C; C

gives to A (Methods). The occurrence of these closed cooperative loops across

natural communities (the 2801 samples described above) was compared to their

occurrence across randomly generated communities preserving the original size and

rank of species' distribution. Remarkably, the frequency of loops predicted in natural

communities (194) is an order of magnitude higher than in randomly drawn samples

(maximum 95 in 1000 random data sets, mean 10, Supplementary Note 7). Thus,

cooperative interactions in nature are likely to be beneficial, forming cooperative

cycles. Furthermore, there is a rapid increase in the number of cooperative loops as

more species are added in, in particular for ecologically-associated species (Figure

4C). This may suggest an explanation to the observed rise in the population size

when the species‟ diversity increases[44, 65].

A closer examination of the cooperative loops found in natural communities sheds

light on how cooperation and competition are intricately intertwined: An illustrative

case is that of Pseudomonas putida and Nocardia farcinica, each forming an

analogous loop with Streptomyces coelicolor and Bacillus anthracis in two distinct

natural samples. As can be expected from their equivalent location inside the loop,

the literature suggests that P. putida exhibits a similar role to Nocardia species in the

degradation of oil contamination, where the synthetic introduction of P. putida

suppresses the enrichment of indigenous degraders such as Nocardia species[66]. To

systematically explore the consequences of analogous network-positioning for

species co-existence we defined a third group of ecologically associated pairs:

mutually exclusive species, referring to pairs of species whose level of co-existence

across samples is lower than expected by chance despite the fact they inhabit similar

niches (28 pairs, Methods). Figure 4 reveals some interesting trends: We observe that

mutually exclusive pairs exhibit high similarity in their network positioning

(competing for common givers, Figure 4D) as well as high levels of resource

competition (Figure 4A), providing systematic evidence for the association of

42

42

exclusion and competition[37]. Notably, co-occurrence and mutual-exclusion

relations may be interchangeable, and the choice between these contradictory fates is

determined by the carrying capacity of their environment[39]. Accordingly, similar

levels of competition are observed between mutually exclusive and co-occurring

pairs (Figures 4A, 4D). Strikingly, the highest level of cooperative interactions is

recorded between mutually exclusive pairs (Figure 4B, and Supplementary Note 6).

This may suggest that under true, natural conditions, cooperative potential,

describing the propensity of a species pair to be involved in a unidirectional give-

take interaction, might be obscured by competition even to the level of exclusion of

one of the pair members. Such is the case with Pseudomonas putida and

Acinetobacter sp., two highly competing species which were also predicted to have

cooperative potential; when these species were grown experimentally in a deprived

environment with benzyl alcohol as the sole carbon source, the benzoate excreted by

Acinetobacter sp. was used by P. putida, which subsequently suppressed the growth

of Acinetobacter [67].

2.3. Discussion

To date it has been difficult to predict which bacteria can stably co-exist, let alone

cooperate metabolically, making the artificial design of beneficial microbial

consortia extremely difficult. Here, we suggest a generic approach for the systematic

description of inter-species interactions, making use of recently available data. Our

approach is obviously not without limitations. First, it is solely aimed at the

metabolic dimension while putting aside regulation as well as the numerous

strategies that microorganisms have evolved to augment the acquisition of resources.

Antimicrobial production, motility and predation can tip the competitive balance,

resulting in outcomes that significantly differ from those predicted by simulations

restricted to passive nutrient consumption[43]. Moreover, several mechanisms for

nutrient sequestration function directly to actively restrict or remove a nutrient from

one organism and supply it to another[68]. Second, the analysis lacks information on

43

43

the true metabolic composition of the environments considered and hence focuses on

predicting the overall potential inter-species interactions, rather than providing a

direct account of their actual in-vivo communications in one specific environment.

Finally, although the automatic reconstruction procedure results in a significant

increase in the number of genome scale metabolic models and although such models

have been proven useful in the prediction of a variety of phenotypes, yet they are

typically less accurate than manually curated models[25]. Yet, despite these

significant limitations, our generic approach succeeds in delineating clear differences

in the interaction patterns of ecologically associated and randomly to fundamental

ecological principles in a systematic fashion. With the increasing efforts to provide

an a-biotic description of different environments, together with the expected rapid

rise in the number of metabolic models as well as the improvement in their quality,

the utilization of metabolic modeling for community-level modeling framework such

as the one laid down here provides a computational basis for many exciting future

applications. These include the artificial design of 'expert' communities for

bioremediation, where currently the selection of community species is done by

intelligent guesswork. Similarly our work may be applied to the rational design of

probiotic administration, as well as to the identification of species that may

metabolically out-compete pathogenic species. The ability to design and test novel

interactions, and to study existing ones, means that microbial experiments can be

used to complement and extend classical plant and animal ecology, in which many of

the principles of biological interactions were first described[61].

2.4. Methods

2.4.1. Metabolic simulations

118 operative metabolic models were retrieved from The Seed's metabolic models

section (http://seed-viewer.theseed.org/seedviewer.cgi?page=ModelViewer)[25]. The

models are automatically constructed by a pipeline that starts with a complete

http://seed-viewer.theseed.org/seedviewer.cgi?page=ModelViewer

44

44

genome sequence as an input and integrates numerous technologies such as genome

annotation, reaction network annotation and assembly, determination of reaction

reversibility, and model optimization to fit experimental data. The list of species and

their corresponding identifiers in the SEED database is provided in Supplementary

Table 1. Briefly, in these models, a stoichiometric matrix (S) is used to encode the

information about the topology and mass balance in a metabolic network, including

the complete set of enzymatic and transport reactions in the system and its biomass

reaction. Our approach for generating multi-species models follows the definition

employed by[27]. We converted the model of each organism into a compartment in a

multi-species system. Applying the multi-species system analysis to all possible

pairwise combinations we examined 6903 unique pairs whose growth can be

simulated under a range of environments. For a single species model A, the

competition-inducing medium (COMPM) is defined as the ranges of fluxes of the

exchange reactions that supports its maximal biomass rate (MBR), when all

exchange metabolites are provided at the minimal required amount. For the multi-

species system of A and B, COMPMAB allows A and B to reach their MBR at individual

growth. However, at co-growth, any resource overlap will prevent species A and B

from simultaneously reaching their MBR, and reveal potential sources of

competition. A cooperation-inducing medium (COOPM) for a multi-species system

is defined as a set of metabolites that allows the system to obtain a small positive

growth rate (above a certain predetermined threshold, which may yet be far from

optimal), and such that the removal of any metabolite from the set would force the

system to have no such solution. A feasible solution in this context is defined as one

achieving at least 10% of the joint MBR obtained when grown on a rich medium

(COMPM). The process of model selection, calculation of maximal biomass

production rate (MBR), construction of pair-wise systems and computation of pair-

specific environments (COMPM and COOPM) are fully described at Supplementary

Note 8. To relate the computed environments to real ecological conditions we

verified that species inhabiting similar environments tend to have similar metabolic

profiles, as previously demonstrated in[64]. As documented in many laboratory

experiments, typical limiting factors in COMPM environments include oxygen,

glucose and nitrogen sources (Supplementary Note 4). Finally, computational

45

45

simulation providing predictions for the effect of removal of chosen metabolites on

species growth were experimentally tested, supporting the ability of the models to

identify growth limiting factors (Supplementary Note 9).

2.4.2. Experimental and computational co-growth analysis

Co-growth experiments were conducted between all co-growth combinations formed

between five species, all non-pathogenic and capable of growing in IMM. The

species and their seed models are the following: Listeria innocua Clip11262

(Core272626_1), Agrobacterium tumefaciens str. C58 (Core176299_3), Escherichia

coli K12 (Core83333_1), Pseudomonas aeruginosa PAO1 (Core208964_1), Bacillus

subtilis str. 168 (Opt224308_1). Exprimental procedure and media selection are fully

described at Supplementary Note 1.

2.4.3. Determining win-lose and give-take relationships

In a multi-species system the CBM solver aims to maximize the total growth

potential of all species, and provides a range of potential solutions for the

contribution of each compartment to the total growth. We define:

Equation 1:

, ,

{ , }

, m in m ax

:

0

m a xB M A B B M m

i i

A B

i

i

v

m

S u b je c t to

v V

v

S V

v v v

46

46

where VBM,m is the maximal biomass production rate in a system m, corresponding to

species A and B.

We define A as a winner when the lowest value of its predicted maximal growth is

higher than the highest value predicted for species B.

Equation 2:

Awinner Vmin,BM,COMPM,compartment_A > Vmax,BM,COMPM,compartment_B

Where Vmin,BM,COMPM,compartment_A and Vmax,BM,COMPM,compartment_B (maximal biomass

production rate of organism A and B respectively, in the multi species system) are

calculated when running FVA for the multi species system and fixating the VBM,AB to

its maximal value on the given media.

To determine give-take relationships in a multi species system of species A and B we

look at the individual benefit of each species/compartment when the species are

grown together. We define A as a "Taker" if its maximal growth in the multi-species

system is higher than its individual maximal growth in a minimal medium. In this

case we call B a "Giver"

Equation 3:

Ataker, Bgiver Vmax,BM,COOPM,compartment_A > Vmax,BM,COOPM,A

It is of course possible within the same system that species A and B are both "givers"

and "takers" (a symmetrical interaction). Overall, we observe that 94% of the

interactions are unidirectional (commensalism) (Supplementary Note 6), i.e., no

unidirectional interactions affect the growth of the giver (neutral interactions).

The directional network of give-take interactions is provided at Supplementary Table

11. Within this network we looked for close cooperative loops (that is A->B, B->C,

47

47

C->A) up to a size of 4 species. We compared the results with results when taking

random pairs, and compared the number of cycles created in this case. The number

of random pairs matched exactly the number of cooperating pairs as ofound by our

method to predict cooperation.

2.4.4. Determining the level of competition and cooperation

Potential Competition Scores (PCMS) are calculated as:

Equation 4:

, , , , , ,

, , , , , , , ,

m a x ( , )1

m a x ( , )

B M C O M P M A B B M C O M P M A B M C O M P M B

A B

B M C O M P M A B M C O M P M B B M C O M P M A B M C O M P M B

V V VP C M S

V V V V

Potential Cooperation Scores (PCPS) are calculated as:

Equation 5:

, , , ,

, ,

1B M C O O P M A B M C O O P M B

A B

B M C O O P M A B

V VP C P S

V

where VBM,x,y is the flux through the biomass reaction of species y in medium x.

Computed PCMS and PCPS values are provided at Supplementary Table 3 and

Supplementary Table 4, respectively.

2.4.5. Calculating the resource overlap within a species pair

The resource overlap (RO) between a pair of species is calculated as the ratio

between the intersection and union sizes (Jaccard index) of the set of uptake

reactions included in their individual competition-inducing media (COMPM).

Equation 6:

48

48

A B

A B

A B

C O M P M C O M P MR O

C O M P M C O M P M

Computed RO values are provided at Supplementary Table 5.

2.4.6. Collection of ecological distribution data

Data of Operational Taxonomic Units (OTUs) distribution in environmental samples

were retrieved from[64], using their 97% identity threshold for sequence clustering.

Each sequence is mapped to a "sampling event" defined as the unique concatenation

of the three annotation fields "author" + "title" + "isolation_source". For example, 51

sampling events are mapped to the publication ' Microbial ecology: human gut

microbes associated with obesity' [69]. These sampling events refer to 15 different

individuals, each under a different diet, at different time points (considering the

beginning of the experiment). Notably, in samples from host-associated

metagenomic studies, an "isolation source" is individual specific, as in[69].

„„isolation_source‟‟ fields are further mapped to an Environment Ontology

(EnvO)[70] (e.g., "agricultural soil" and "Rocky Mountain alpine soil" are mapped to

the term "soil"). Overall 2662 samples are mapped to 183 EnvO categories, termed

"niches". Since many niches contained only a few samples, we strived to group

similar niches together to obtain a better signal. Using the hierarchical clustering in

EnvO we automatically mapped samples from lower order niches to higher order

ones. This process continued iteratively until reaching a barrier of predefined niches

with no biological significance. Using this approach, the ultimate set contained 59

ecological niches (Supplementary Table 12). Full length 16S rRNA sequences

corresponding to the 118 species with metabolic models that were used throughout

the analysis were manually retrieved from the Kyoto Encyclopedia of Genes and

Genomes (KEGG)[71]. BLAST[72] was then used to map the models to OTUs,

requiring 97% sequence identity and 95% alignment overlap, considering the length

of the query sequence. In case of multiple matches for a given OTU, we map it to the

model represented by the highest-ranking sequence, thus resulting in a one-to-many

49

49

mapping between models and OTUs. That is, each OTU can only be mapped to a

single model, but a model can be mapped to many OTUs. Overall, 80 models were

identified across the environmental samples. Supplementary Table 13 lists the

samples tested, their mapping to niches and the detected array of species.

2.4.7. Determining ecological association between species

To identify ecologically associated species we examined the distribution pattern of

the 80 OTU-mapped models across the 59 niches[70]. The probability that two

species co-occur together at a rate higher than chance expectation was determined by

calculating a cumulative hypergeometric P-value. Significance cut-off was

determined by setting a False Discovery Rate threshold of 10%.

Similarly, we looked at the pattern of species' distribution across the 2662 samples.

We identified 111 non-redundant combinations of co-occurring species and 39 non-

redundant combinations of mutually exclusive species from the pairs – that is species

for which the co-existence in samples is higher or lower than expected by chance,

respectively. As can be expected, the large majority of co-occurring pairs is observed

between ecologically-associated species (84/111). Less trivial is the identification of

a significant part of the mutually exclusive species pairs (28/39) as ecologically

associated species, implying that the pattern of distribution of species in nature is far

from being random. Only niche-associated pairs are further analyzed as co-occurring

(84) or mutually-exclusive (28) combinations. The ecological association types

determined for the species pairs tested are provided at Supplementary Table 6. The

distribution of resource overlap values between ecologically associated and non-

associated pairs is shown at Supplementary Figure 2 demonstrating that ecologically

associated pairs differ in the pattern of distribution of their resource overlap values,

supporting both the observed high level of competitive and cooperative interactions.

The identification of close cooperative loops in real and random networks of give-

50

50

take integrations and in real and randomly drawn communities is fully described at

Supplementary Notes.

51

51

Chapter 3

3. Maximal Sum of metabolic

exchange fluxes outperforms

biomass yield as a predictor of

growth rate of microorganisms

Based on an article with the same title by the authors:

Raphy Zarecki, Matthew A. Oberhardt, Keren Yizhak, Allon Wagner, Ella Shtifman

Segal, Shiri Freilich, Christopher S. Henry, Uri Gophna and Eytan Ruppin


The article was submitted to Genome Biology (currently under review), and was

presented in the conference on predicting cell metabolism and phenotypes in CA

USA (4-6/3/2013)

3.1. Introduction

In the data-rich landscape of present-day biology, large-scale network-based models

are being increasingly tapped to make sense of the deluge of available data. Towards

52

52

this end, genome-scale metabolic models (GEMs) have proven highly successful

[73]. Incorporating gene-protein-reaction associations and stoichiometric reaction

detail for the majority of known metabolic genes in an organism, GEMs have

achieved high accuracies in predicting essentiality of gene knockouts (~90%),

growth phenotypes on a variety of substrates (~90%) [34], growth yields, and

metabolic fluxes [74]. These predictions typically rely on an assumption that single-

celled organisms are optimized to maximize yield (for example: dry weight of

biomass per unit of glucose consumed), following deep-rooted theories about

evolutionary tuning towards optimal fitness [32], but as has been shown previously,

maximization of molar yield is by no means a universal principle [75].

Metabolic phenotypes in GEMs are typically computed by a linear optimization

method termed Flux Balance Analysis (FBA), in which a biomass objective is

optimized while various network-defined constraints are upheld. Non-biomass

objectives have also been tried, with varying powers of prediction [11, 12, 76, 77],

but these objective functions are common in that they link metabolic models to

growth yield or to a global flux distribution, rather than predicting growth rate.

Growth yield (units of [g biomass produced]/[g substrate consumed]) is different

from growth rate (units of 1/[hour]), although they are related by the substrate uptake

rates of an organism growing at steady state (for growth on a single carbon source,

for example, Growth rate = Substrate uptake rate * Yield). Prediction of yield using

GEMs applies most rigorously to highly defined conditions such as in a chemostat in

which one nutrient is limiting, and it is unclear how broadly applicable the

„maximization of yield‟ principle actually is [75]. In many conditions (including

standard laboratory batch growth, growth of cancer cells displaying the Warburg

effect, and competition of organisms for certain environmental niches), cells do not

necessarily maximize their yield, yet their growth rate cannot be predicted without

empirical data (e.g., substrate uptake rates). There is currently no framework for

predicting cellular growth rates akin to the GEM-based methods available for

predicting growth yields, which does not require extensive additional kinetic

53

53

parameters. It would therefore be of significant value if a predictor of growth rate

could be determined using genome-scale properties of GEMs that do not necessitate

the arduous measurement of substrate uptake rates. In a large number of conditions,

especially in competitive niches, growth rate is a better measure for fitness than

yield, so the ability to predict growth rates could significantly increase the utility of

GEMs.

3.2. Results and Discussion

In this study we explore novel large-scale methods to predict growth rates from

GEMs grown on rich or defined media, and in some cases with gene knockouts. We

focus on environments in which cells are expected to be optimizing their growth rate,

such as maximal listed growth rates for species in rich media, or careful growth rate

measurements of isogenic cultures in early exponential phase of batch growth. Our

approach was inspired by an article by Vieira-Silva and Rocha [78], which

investigated a number of bioinformatics-based measures for predicting the maximal

growth rate across species. Vieira-Silva and Rocha collected from the literature the

maximal growth rates in rich medium of over two hundred bacterial species, and then

searched for a genomic measure that correlated best with these data. The genomic

property of codon usage bias yielded their most promising correlation, but this

property is not dependent on the growth medium, so it will fail when assessing

growth rate of a species across media or other conditions. Furthermore, in cases of

different cells of the same organism, such as human cancer cells, the cells share the

same codons, and thus codon bias cannot be used to predict specific growth rate.

Analogous to Vieira Silva and Rocha, we explore a new class of metabolic

objectives, related to maximizing the total metabolic secretion of a cell, which

predict growth rate directly from GEMs. We focus on exchange fluxes because they

are the missing gap between growth yield (which can be calculated relative to uptake

rates by a GEM, e.g., in [34]) and growth rate, and because there is an observed

54

54

strong positive correlation between cellular surface-to-volume ratio and growth rate,

as well as additional evidence suggesting that cell surface metabolism exerts most of

the control of a cell over growth rate [79]. The exemplar of predictors we tested is a

novel method called “SUMEX,” which predicts growth rates of cells under different

media conditions without requiring substrate uptake rates, kinetic constants, or any

other empirical parameters. SUMEX is computed by maximizing the total molar

output exchange minus input exchange of metabolites (which, given the sign

convention in GEMs that all exchange reactions point outwards, is calculated as the

„maximal SUM of EXchange fluxes‟), while setting a nominal lower bound on

biomass production in order to ensure that some flux runs through biomass-

producing pathways (see Fig. S6). SUMEX represents a simple heuristic to

maximizing catabolic activity of a cell, focusing exclusively on exchange reactions,

and still ensuring a nominal production of biomass (we discuss a sensitivity analysis

of this and other necessary bounds later in the chapter, and in the supplement).

The SUMEX formulation is:

1

, m in m a x

m in

:

0

m a x

j j

n

e x c h a n g e

i

j

b io m a s s b io m a s s

j

V

S u b je c t to

V V

v V

S V

v v v

It is explained in greater detail in the methods part of the supplementary data.

To test SUMEX and other methods, we collected two datasets of measured cellular

growth rates from the literature: the previously mentioned Vieira Silva and Rocha

dataset of maximal growth rates on rich media reported for 66 organisms (ds66)

[78], and growth rates in early exponential phase of batch growth of 57 Escherichia

coli wild type (WT) and knockout (KO) strains evolved for growth on a number of

55

55

minimal media (ds57) [34]. We generated a third dataset in the lab, by measuring

growth rates in vitro in the early exponential phase of batch cultures of 6 organisms

on 3 defined media (ds18). Using automatically generated models from SEED [25],

we then computed various growth-rate predictors for each of the models and

conditions in these three datasets (ds66, ds57, and ds18). We compared SUMEX (as

the exemplar of exchange-based metrics we had experimented with) against several

metrics presented in a previous experimental study in E. coli of the optimal

objectives of GEMs for predicting metabolic flux distributions [77]. Strikingly,

SUMEX outperformed every previous metric in all three datasets in predicting

growth rates with only one exception in one dataset (codon usage bias from [78]

correlated better than SUMEX with growth rates in ds66, but was non-predictive in

the other datasets as it inherently cannot account for changes in the medium or gene

knockouts). Overall, SUMEX was the only metric among those tested to

significantly correlate with growth rate across all three datasets (see Fig. 1d).

56

56

Chapter 3::Figure 1: Correlation of different metrics to growth rate. (A-C) Spearman correlations of SUMEX vs.

growth rate in three datasets. Colors in (B) represent media (green triangles, IMMxt; blue diamonds, IMM; red

squares, IMM-gt; see Table S6 for details). Colors in (C) represent strains. Trend-lines in (C) are shown for strains that

individually show significance (*P≤5e-2, **P≤5e-3). Correlation values for SUMEX and Biomass vs. growth rate are

listed below. (D) Significant (P-val≤5e-2) Spearman correlations (i.e., ρ values) across three bacterial datasets for all

tested metrics (non-significant correlations are not shown). Metrics are listed in descending order of the sum of ρ across

the three datasets. Vertical lines denote rhos for SUMEX.

57

57

Notably, the maximization of biomass yield, the aforementioned fitness metric used

in hundreds of GEM studies, failed to significantly predict growth rates in two out of

the three datasets (ds18 and ds57). This is despite previously noted strong

correlations between GEM-predicted biomass yields and growth rates in ds57 when

accounting for experimentally measured glucose uptake rates [34], which emphasizes

the difference between predicting rate and predicting yield. In contrast, biomass

yield was predictive of growth rate in ds66 (although not as predictive as SUMEX).

This suggests that in rich media and when looking across a large range of organisms,

both the growth rate and yield depend greatly on the capacity of an organism to take

up many substrates -- an observation supported by the strong correlation between

“count of uptake exchange reactions” and growth rate, as well as by the strong

observed correlation between SUMEX and biomass yield, in ds66 (see Fig. 1a and

Fig. S5). Despite this, SUMEX correlates significantly with growth rate in ds66

even when controlling for biomass yield (ρ=0.38, P=1.6e-3 in partial Spearman

correlation), showing that SUMEX provides information beyond that obtained from

maximizing biomass. Surprisingly, maximization of ATP hydrolysis correlated

poorly with growth rate, even though it has been previously shown to be predictive

of intracellular fluxes in E. coli [77, 80]. These results suggest that while biomass

and ATP hydrolysis are appropriate for measuring growth yield, they are not

necessarily suited to measure growth rate using GEMs. A full description of metrics

we tested is provided in the Supplement.

As previously mentioned, SUMEX requires no kinetic parameters, substrate uptake

rates, or other empirical values to predict growth rate. To further benchmark

SUMEX, we also tested it against previous methods for predicting growth rates that

do require empirical parameters. A few such methods, which include several

hundred kinetic constants or molecular crowding constraints, were introduced in

recent years for E. coli [35, 81]. We tested the ability of SUMEX to predict growth

rates reported in [35] for E. coli grown on 24 minimal media (henceforth: ds24), and

achieved equivalent results to the state of the art (for consistency with the previous

analyses, SUMEX was calculated for this dataset on the manually curated model,

iAF1260 [18]; SUMEX and MOMENT, the method described in [35] and achieving

58

58

the best previous result, each attained ρ=0.47 and P=0.02 in 2-sided Spearman tests;

see Table S2). Because SUMEX uses only the stoichiometry of metabolic reactions

but no empirical parameters, it has the clear advantage that it can be easily computed

across many species (if their metabolic models are available), as shown in the

analyses of ds66 and ds18.

To understand in more detail the mechanisms linking SUMEX to growth, we studied

the relative contributions of different exchanged compounds to SUMEX. We did

this by analyzing the effect of either leaving out or of individually optimizing the

flux of each individual exchange metabolite. We found that the compounds that

contribute most to SUMEX (those shown in Fig. 2) are H+ and several TCA-cycle

intermediates, in addition to CO2. CO2, the main product of cellular catabolism, was

necessarily released from the cell in nearly all conditions when SUMEX was

optimized (Fig. 2C).

59

59

Chapter 3::Figure 2: Component-wise analysis of SUMEX (A-B) Spearman correlations of SUMEX versus growth rate

(GR) across the 3 bacterial datasets when different exchange reactions are (A) removed from SUMEX or (B) optimized

individually. Horizontal lines and rightmost set of columns show SUMEX ρ values. The components presented are all of

those whose removal affected SUMEX ρ by >5% or that came within 5% of the SUMEX rho when maximized alone, for

any of the 3 datasets. (C) The difference between the percent of models (per dataset) that must uptake vs. that must

excrete a component in order to achieve maximal SUMEX.

Interestingly, the removal of proton exchange from the SUMEX objective reduces

the correlation of SUMEX with growth rate more than removal of any other

component (it severely reduced the predictiveness of SUMEX in both ds18 and ds57

datasets – see Fig. 2A). Additionally, we found that maximizing the production of

0

0.2

0.4

0.6

0.8

1Sp

earm

an's

rh

o v

s. G

R

Leaving components out of SUMEX

ds18ds66ds57

0

0.2

0.4

0.6

0.8

1

Spea

rman

's r

ho

vs.

GR

Optimizing individual components of SUMEX

ds18ds66ds57

a.

b.

c.

SUMEX after removal of key components

Optimization of individual key SUMEX components

-100

-50

0

50

100

% c

on

dit

ion

s (t

akin

g u

p -

secr

etin

g) t

his

co

mp

ou

nd Directions of allowed flux in optimal SUMEX

ds18ds66ds57

se

cre

tio

n |

up

take

Flux directionality of key SUMEX components

60

60

protons alone is nearly as predictive as SUMEX across the three bacterial datasets

(see Fig 2B). Protons are the smallest metabolites in the metabolic models and can

be readily produced from many different sources, and thus can account for a large

portion of the total SUMEX flux (as we confirmed by flux variability analysis[7]).

The strong correlation between maximal proton production and growth rate led us to

hypothesize that if a cell has abundant resources for producing free extracellular

protons, the strong resulting pH gradient may help drive ATP synthesis and gradient-

driven transport, thus increasing overall growth rate and thus also contributing to the

predictive power of SUMEX. It has been shown in E. coli and other species that

when flux ranges are below saturation, the rate of ATP synthesis relates

approximately linearly to the electrochemical gradient, which in respiring bacteria is

determined primarily by the proton (i.e., pH) gradient [82, 83]. Therefore, we would

expect the proton-related contribution to SUMEX to be more predictive in respirers

than in obligate fermenters, for which the production of ATP does not depend on the

membrane gradient.

To test the fermenters vs. respirers hypothesis, we categorized the organisms in ds66

into two groups: 9 obligate fermenters (ds66f) and 57 organisms that can respire

(ds66r). We found that the correlation of SUMEX with growth rate is stronger

among only the respirers than among all organisms in ds66 (see Fig. 3a), that

SUMEX is not significantly predictive of growth rate for obligate fermenters (also

Fig. 3a), and that these same trends also apply when we instead compare

maximization of proton production (PMAX) vs. growth rate (Fig. 3b). PMAX

correlates strongly with SUMEX in models of both respiring and obligate fermenting

organisms, despite the observation that neither is predictive of growth rate for

obligate fermenters (see Fig 3c). This emphasizes the strong interdependency of

SUMEX and PMAX.

61

61

Chapter 3::Figure 3: Prediction of growth in Respirers vs. Fermenters in ds66 Maximization of (A) SUMEX or (B) H+

production is plotted against growth rate for ds66 organisms, categorized into obligate fermenters (blue diamonds) and

respirers (red circles) with trendlines shown. Rho and pvals are for 2-sided Spearman correlations. (C) Maximization

of proton gradient correlates strongly with SUMEX in both respirers and fermenters. (D) SUMEX and Biomass as

calculated on obligate fermenters are plotted vs. GR. Trendlines and Spearman correlations (1-sided) exclude L.

plantarum, which can respire in the presence of heme and menaquinone (L. plantarum is shown on the plot as an orange

asterisk (SUMEX) and a green “X” (Biomass)).

When we remove a borderline case from the set of obligate fermenters (Lactobacillus

plantarum, which has been shown to respire if provided heme and menaquinone

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.5 1 1.5no

rmal

ize

d S

UM

EX o

r b

iom

ass

in vitro GR (1/h)

Biomass and SUMEX vs. GR for fermenters

rho p

respirers 0.97 1.2e-6

obligate fermenters 0.70 0.04

rho p

SUMEX* 0.73 2.4e-2

Biomass* 0.92 2.7e-3

rho p

respirers 0.60 8e-7


-1000

1000

3000

5000

7000

9000

0.01 0.1 1 10

SUM

EX (

arb

itra

ry u

nit

s)

max in vitro growth rate (1/h)

SUMEX vs. GR

-2000

0

2000

4000

6000

8000

10000

0.01 0.1 1 10

H+

pro

du

ctio

n (

arb

itra

ry u

nit

s)*

max in vitro growth rate (1/h)

max H+ production vs. GR

-100

1900

3900

5900

7900

9900

-100 4900 9900H+

pro

du

ctio

n (

arb

itra

ry u

nit

s)*

sumex (arbitrary units)

max H+ production vs. SUMEX

a. b.

c. d.

rho p

respirers 0.53 1.7e-5


-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.5 1 1.5

norm

aliz

ed p

redi

ctor

s

in vitro GR (1/h)

Biomass and SUMEX vs. GR, fermenters

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 0.5 1 1.5

norm

aliz

ed p

redi

ctor

s

in vitro GR (1/h)

Biomass and SUMEX vs. GR, fermenters

*Excluding L. plantarum ( , ):

62

62

[84]), both SUMEX and biomass maximization became predictive of fermenter

growth rates (Fig. 3d). Therefore, a larger dataset of growth rates of obligate

fermenters than currently at our disposal will be needed to unequivocally determine

whether SUMEX can be used to predict growth rates of obligate fermenters. See our

continued analysis in the Supplementary data.

SUMEX activates a variety of pathways not predicted by the typical biomass yield

objective. To test whether these pathways are reflected in global gene expression, we

analyzed the relationship between (a) the measured expression of a gene on 5

different growth media [85] and (b) the medium-dependent contribution of reactions

associated with the gene to a cellular objective, comparing between biomass yield

and SUMEX (a reaction's contribution was determined by measuring its effect on the

cell objective when forcing an incremental flux through it; see Supplement). We

found that genes that contribute positively to SUMEX have significantly higher

expression levels than genes detrimental to SUMEX across 4 of the 5 media (with

borderline significance for the 5th

medium), whereas no significant association was

seen in the same analysis done using the biomass objective (p≤0.05 on 4 media and

p=0.06 on the fifth for SUMEX, vs. p≥0.30 on all media for biomass, in 1-sided

ranksum tests; see Supplement). Furthermore, SUMEX outperformed biomass yield

in the prediction of highly-active genes both in terms of precision and recall on all

media (avg. precision and recall were 0.19 and 0.63 for SUMEX, vs. 0.08 and 0.01

for biomass – see Table S3). This strongly suggests that many of the genes predicted

to be active by SUMEX, but which are detrimental or indifferent with respect to

biomass yield, are actually important for unaccounted-for cellular processes.

As a final test of SUMEX, we wished to assess its ability to predict growth rates of

NCI60 human cancer cell lines, as cancer cells are expected to maximize growth as a

fitness objective. To do this, we produced GEMs for 60 NCI60 cancer cell lines

based on the full human metabolic model [20], by altering bounds of reactions based

on the expression of 222 genes that had significant correlation with growth (see

Supplement for details of how the models were built). As a basic validation of the

models, we checked the correlation of the biomass yield objective against published

NCI60 growth rates [86], and found that it indeed correlated highly significantly

63

63

(ρ=0.68, P=2.7e-9). Finally we checked the ability of SUMEX to predict these

growth rates, and found that it obtained even slightly higher correlations (ρ=0.74,

P=2.6e-11) (see Fig. 4), thus extending our results to cancer cell lines and

emphasizing the predictive power of SUMEX.

Chapter 3::Figure 4: NCI60 cancer cell line growth rates predicted by SUMEX. Maximization of (A) biomass yield and

(B) SUMEX both correlate highly significantly with growth rate. Spearman correlation values vs. growth rates are

overlaid on plots.

Limitations must be set on certain reaction bounds in a GEM in order to obtain

feasible solutions (we used standard flux bounds of -50 for all allowed uptakes in

SUMEX), which is a confounding factor in any attempt to produce parameter-less

metrics in GEMs. Therefore, in order to ensure that the results seen for SUMEX

were not simply due to the particular bounds we chose, we performed a sensitivity

analysis. This test revealed that the correlation of SUMEX with growth rate was

highly robust even up to 50% (or more) random variations imposed across all uptake

(or secretion) bounds; we furthermore found that biomass is significantly less robust

than SUMEX in 2 of the 3 datasets (see Fig. S1 and Table S1). In addition to these

bounds set on exchange reactions, we performed a sensitivity analysis on the nominal

lower bound set on biomass production within SUMEX, and found that the results

were relatively stable for large ranges of this bound (see Fig. S2).

27

27.5

28

28.5

29

29.5

30

30.5

31

0 0.035 0.07

SUM

EX (

arb

itra

ry u

nit

s)

growth rate (1/h)

a. b.

0.12

0.13

0.14

0.15

0.16

0.17

0.18

0.19

0.2

0 0.035 0.07

Bio

mas

s (a

rbit

rary

un

its)

growth rate (1/h)

ρ = 0.68P = 2.7e-9

ρ = 0.74P = 2.6e-11

SUMEX vs. GR, NCI60Biomass vs. GR, NCI60

64

64

3.3. Conclusions

SUMEX represents a maximization of cellular catabolic activity as a cellular

optimality principle, as outlined at the start of this chapter. Notably, albeit its

simplicity, SUMEX predictions correlate significantly with growth rate on every

suitable dataset we were able to find in literature, as well as a set of growth rate data

we measured ourselves. Accurately predicting cell growth rate is critical for

understanding the dynamics of microorganism-dominated ecosystems, and may lead

to improved biotechnology applications and perhaps even to new functional insights

into cancer, as well as filling in a basic gap in our understanding of cellular function.

This exploration of a promising alternate objective for predicting cell growth rate

will hopefully stimulate future research in this area, and lead to better predictive

models in the future.

3.4. Materials and Methods

3.4.1. Models

Unless otherwise noted, analyses were done on genome-scale metabolic

reconstructions (GEMs) as obtained from SEED [25], at http://seed-

viewer.theseed.org/. The 66 organisms in ds66 were chosen because (1) their GEMs

were available from SEED and published in [25], and (2) their optimal doubling

times were available from [78]. For analysis of ds24, the iAF1260 E. coli model was

used, and the NCI60 cancer cell line analysis used custom-made models based on the

generic human model (see Supplement IIId for full details). Table S5 lists the names

of the ds66 models and organisms.

3.4.2. Implementation of growth rate predictors

Optimizations were run in in silico environments consistent with the known media,

in which all exchange metabolites for a given species were available at a fixed rate of

http://seed-viewer.theseed.org/


65

65

-50.0 (with output bounds of 1000). A sensitivity analysis was done to determine if

these bounds affected the performance of SUMEX, and SUMEX was found to be

robust to random changes in the bounds (and significantly more robust than biomass

yield optimization; see Fig. S1 and Table S2). In the case of ds66, the environment

was „rich‟, so we allowed uptake flux in all exchange reactions present in each

organisms.

By convention, exchange fluxes denoting entrance of a metabolite into the cell

(uptake) are negative valued, while exchanges denoting exit of a metabolite from the

cell (output / secretion) are positive valued. Therefore, maximizing the total

exchange flux (i.e. the SUMEX metric) would denote maximizing the output at the

expense of the input (output exchanges – input exchanges).

For simulation of maximal proton production (PMAX) (e.g., in Fig. 3 and for the

NCI60 models), we increased the upper bound on proton production to +inf in order

to avoid capping total protons produced. Manipulating this bound while running

SUMEX did not significantly affect SUMEX results (data not shown). More details

of the model constraints are provided in Supplementary methods.

3.4.3. Building NCI60 cancer cell models

Reconstructing the NCI60 cancer cell lines required several key inputs: (a) the

generic human model [20], (b) gene expression data for each cancer cell line from

[87], and (c) growth rate measurements (Note: the growth rates were used only to

determine which genes should be used in constraining the models, in order to obtain

models that were as physiologically relevant as possible; they were not used to

determine the actual bounds). Specific metabolic models were produced for each

cell line by modifying the upper bounds of reactions in accordance with the

expression of the individual gene microarray values. See Supplement IIId for more

details.

66

66

3.4.4. Growth experiments of 6 organisms on 3 defined IMM

media (ds18)

To validate SUMEX, we performed in vitro experiments to measure the growth rates

of a number of organisms (listed in Table S6) in multiple environments. Growth

experiments were conducted in 96-well plates at 30°C, with continuous shaking,

using a Biotek ELX808IU-PC microplate reader, on variants of IMM medium, as

detailed in Table S7. Optical density was measured every 15 minutes at a

wavelength of 595nm. Growth rates were determined during early to mid

exponential growth phase by taking the slope of a linear fit through the natural log of

the data.

67

67

Chapter 4

4. Glycan Degradation (GlyDe)

analysis predicts mammalian gut

microbiota abundance and host diet-

specific adaptations

Based on an article with the same name by the authors:

Omer Eilam, Raphy Zarecki, Matthew Oberhardt, Martin Kupiec, Uri Gophna &

Eytan Ruppin


The article was submitted to Genome Research (currently under review) and has

been presented in the conference: Exploring human host-microbiome interactions in

health and disease (2012)

4.1. Introduction

The human gastrointestinal tract harbors an extensive array of commensal

microorganisms. Species composition is highly diverse both within and between

individuals [88] and the activities of these organisms affect the host through many

pathways, including the production of short chain fatty acids (SCFA) that regulate

68

68

epithelial cell growth and immune system development, displacement of potential

pathogens, detoxification of protein fermentation products, and gas production [89-

93]. The beneficial or detrimental outcomes of these effects depend largely on the

community structure, environmental factors, diet and the genetic background of the

host [94, 95]. Large-scale metagenomic studies have uncovered associative

relationships between these factors, yet typically provide limited insights into the

underlying mechanisms [96]. Simplified in-vitro models aim to bridge this gap, but

the reliance of these models on a limited number of strains and on results from

defined culture media make them difficult to relate to the complexities of the actual

gut environment [97, 98].

While host tissues and other substrates of endogenous origin such as mucins are

continually being broken down and recycled by intestinal bacteria, the composition

and metabolic activities of gut bacterial communities are primarily determined by our

diet [89]. Since the efficiency of the digestive system is remarkably high, very few

simple metabolites escape digestion in the small intestine [99]. Therefore, complex

carbohydrates and their derivatives, collectively termed glycans, which are not

digested by the host‟s endogenous pathways higher in the gastrointestinal tract, are

the predominant nutrients for microbes in the colon [100, 101]. Modern Western

diets incorporate a large variety of food sources, resulting in a nutrient-rich colonic

environment that supports a tremendous diversity of species [102]. With recent in-

vitro studies showing that the breakdown of a given substrate can be highly species-

specific [103], the overall picture of the human gut is that of a diverse bacterial

community in which different microbial groups occupy distinct metabolic niches.

While some human colonic bacteria simply require acetate or branched chain fatty

acids [104], the detailed growth requirements for the majority of gut bacteria remain

unknown [103]. Characterizing these requirements will shed light on the different

metabolic niches organisms fill, and may enable the design of dietary interventions

that promote growth of particular beneficial microbes, an approach collectively

termed “prebiotics” [105]. While several glycans are currently marketed around the

world as prebiotics, few have been validated through high-quality human trials [105,

106]. Furthermore, dietary enrichment of a specific prebiotic compound may permit

69

69

preferential expansion of a microbial group that is well adapted to its use, but the

outcomes for the gut community as a whole can be unpredictable [107].

In this study we investigate the connections between diet and glycan metabolism of

the human gut microbiota. Whereas the study of the metabolic activity conducted by

gut microbiota is at the focal point of a wide range of computational studies [96,

108], current approaches have been highly limited in their ability to analyze glycan

degradation. We present a novel algorithm (termed GlyDe) and computational

pipeline for predicting the glycan degradation patterns of bacteria based on nearly

150 Carbohydrate Active enZymes (CAZymes) and 10,000 glycans, and apply it to a

cohort of 203 microbial genomes and nearly 10,000 glycan structures. Given a

particular bacterial (meta-) genome, GlyDe can `reverse-engineer' the predicted

efficiency by which it degrades a variety of different glycans. These predictions

correlate with known KEGG reactions and expand upon the limited, previously

available glycan degradation data 100-fold. We determine that the microbiota of

herbivores and carnivores have stronger degradation affinities for plant-derived and

animal-derived glycans respectively, and that a Western diet in humans correlates

more strongly with meat-derived glycans than a non-Western diet. Finally, we show

that species-specific glycan degradation profiles are associated with and can be used

to predict that species abundance, making GlyDe a valuable tool for the future

rationale-design of novel prebiotics, by deliberately manipulating the microbiome

based on nutrient availability.

4.2. Results

4.2.1. The construction of the Glycan Degradation (GlyDe)

pipeline

Although the exact biochemistry of glycan degradation is missing from all currently

available databases, considerable knowledge is embedded in the descriptions of

Carbohydrate-active Enzymes (CAZymes) that catalyze these degradation reactions,

and is typically represented by enzymatic commission (EC) numbers. We leveraged

70

70

this knowledge to develop a new computational pipeline that uses enzymatic and

structural data sources to predict the degradation of every glycan in KEGG [109]

given a sequenced (meta-) genome. That is, given (meta-) genomic data as input,

GlyDe yields phenotypic (glycan degradation) data as output. The construction of the

pipeline comprises two steps: (1) The first step relies on a novel algorithm that we

developed which we term Glycan Degradation (GlyDe). The algorithm takes as input

a manually curated annotation of all the rules defining for each known CAZyme its

capablities of performing Glycan degradataion, and a graph representation of the

structure of all the glycans in KEGG, in which the nodes are the monosaccharides

and the edges are the glycosidic linkages (Chapter4::Figure 1a). We convert the

CAZyme annotations to computer-based rules dictating their mechanism for breaking

a given glycan into two sub-components. The manual curation of this critical step

was done using the help of experts with knowledge in the biochemistry of glycan

metabolism. GlyDe then executes these rules recursively on all the glycans to

generate 141,561 GlyDe reactions, each linking a specific enzyme to a glycan

substrate and its products (an example reaction is given in Figure 1a and a more

detailed explanation is provided in the Methods). (2) In the second step GlyDe

reactions are mapped back to CAZymes in order to produce a table where the rows

are CAZymes, the columns are glycans, and each entry contains a CAZyme score for

CAZyme i and glycan j, calculated as follows: if CAZyme i is unable to break glycan

j then the score is 0, otherwise the score is

𝐶𝐴𝑍𝑦𝑚𝑒 𝑠𝑐𝑜𝑟𝑒𝑖𝑗 =1

𝑔𝑖

where gi is the number of glycans that are broken by CAZyme i. The entire

construction process is summarized in Figure 1b. A CAZymes table which contains

all of the CAZyme scores can be found in Supplementary Table 5.

4.2.2. The usage of the GlyDe pipeline

71

71

microbial (meta-) genomes are annotated for CAZymes using BLAST [110] against

three reference databases: The Carbohydrate Active enZymes (CAZy) Database

[111], the Seed - RAST annotation [112], and KEGG [71] (Methods). Then,

CAZyme scores can be assigned to genes and a (meta-) genome-specific GlyDe

score calculated for each glycan as follows:

𝐺𝑙𝑦𝐷𝑒 𝑠𝑐𝑜𝑟𝑒𝑖𝑗 = 𝑛𝑗𝑘

𝑔𝑘 ,∀ 𝑒𝑘

where ek is an enzyme that can degrade glycan j, nik is the number of genes in the

(meta-) genome which translate to enzyme ek, and gk is the number of glycans broken

by enzyme ek.

The GlyDe score represents the predicted efficiency with which the glycan can be

degraded by that (meta-) genome, taking into account how many CAZymes can

degrade the glycan, and decrementing the score of promiscuous enzymes with low

specificities (Methods). For example, an organism containing three enzymes that

degrade maltotetraose, each of them degrading also four other glycans, would have a

GlyDe score of 3/5 (see two examples in Figure 1c). The use of GlyDe is captured in

Figure 1d.

72

72

Chapter 4::Figure 1: The Glycan Degradation (GlyDe) platform. (a) A visual representation of the glycan degradation

reaction performed by EC 3.2.1.115 breaking down Kojitriose into Kojibiose and Glucose. (b) A schematic

representation of the construction of the computational pipeline. Information is taken from multiple databases and

analyzed as follows: Step 1 (left red arrow) using CAZyme information and the GlyDe algorithm, glycan degradation

reactions are reconstructed. Step 2 (right red arrow) a CAZyme table is constructed that represents the efficiency in

which different CAZymes break different glycans. (c) GlyDe score calculation. Top: The organism has one enzyme

(yellow pacman) dedicated to the degradation of one glycan (purple), therefore the GlyDe score for the purple glycan

equals 1. Bottom: The organism has two enzymes capable of degrading 3 and 4 glycans respectively, therefore the GlyDe

score for the purple glycan equals 7/12. (d) GlyDe utilization: (meta-) genomes are annotated for CAZymes using CAZy,

SEED and KEGG databases, and using the CAZyme table a GlyDe score can be calculated, reflecting the capacity of a

(meta-) genome to degrade a specific glycan. GlyDe: Glycan Degradation. CAZymes: Carbohydrate Active Enzymes.

4.2.3. Validating the GlyDe pipeline

To assess the biological relevance of GlyDe, we performed a cross-validation

procedure that examines its consistency in capturing known degradation reactions in

KEGG (Methods). We found that the products of GlyDe reactions were highly

enriched with known rather than hypothetical glycans (p-value = 10-19

in hyper

geometric test; see Supplementary Fig. 2a). As further validation, we compared the

predicted genome-specific GlyDe scores of each bacterial strain with the glycans

that, according to KEGG, that strain is able to break. Since the above information

a. b.

b.

c.

d.

(Meta-) Genome sequence

BLAST against CAZy to infer CAZyme

(meta-) genome GlyDe score table

(samples x glycans scores matrix)

73

73

from KEGG was not used to construct the set of GlyDe reactions, a circular

argument is avoided. We find a significantly higher mean GlyDe score for KEGG

glycans across all strains when comparing it to the mean GlyDe score of non-KEGG

glycans (Figure 2a). Notably, our analysis produced GlyDe scores for over 100x the

number of unique glycan degradation reactions that are currently reported in KEGG

(116,388 vs. 1374), highlighting the limited scope of glycan metabolism captured in

the KEGG database.

4.2.4. Characterization of glycan degradation patterns across

the major gut bacterial phyla

We first applied GlyDe to a cohort of 203 reference gut microbial genomes retrieved

from the Human Microbiome Project (HMP [113]). We used GlyDe to study the

extent to which different microbial phyla metabolize different glycans. Initially, we

examined whether phylogenetic clusters are reflected in glycan degradation patterns.

We therefore computed for each of the HMP genomes a GlyDe profile, i.e. a vector

of its GlyDe scores for all glycans (Methods). This species specific GlyDe `signature'

describes the efficiency by which a given species can catabolize each of the ~10,000

reference glycans in our database, and hence provides an overall view of its glycans

utilization capabilities. We then mapped each species to its respective phylum and

performed Principal Coordinates Analysis (PCoA) on the Bray-Curtis dissimilarities

between the species GlyDe profiles (Methods). This yielded clusters of phyla that

were statistically distinct (Supplementary Fig. 2b, MANOVA test, Wilke's

Lambda<0.001). Still, there are apparent significant differences in the average glycan

degradation capacities of genera belonging to a given phylum (Supplementary Fig.

2c).

Inspecting individual phyla, Bacteroidetes display the highest GlyDe scores over a

range of glycan categories (Figure 2b), and all nineteen of the highest GlyDe score

ranking species belong to the Bacteroides genus, consistent with their known role as

primary glycan degraders in the gut [114-116]. Recent papers have shown that

74

74

glycans found in human milk, i.e. Human Milk Oligosaccharides (HMOs), are

utilized mainly by a few Bifidobacterium and several Bacteroides species [114, 117].

According to GlyDe, 21 out of 23 HMP genomes capable of degrading HMOs

belonged to Bacteroides species (the other degraders were Parabacteroides sp. D13

and Bifidobacterium bifidum.

We next analyzed the relative efficiency of the different bacterial genera in

degrading glycans of various degrees of polymerization (Supplementary Fig. 2e). We

found a large variability in the predicted GlyDe profiles of different genera within

each phylum. For example, within the Bacteroidetes phylum, species belonging to

the Bacteroides genus are predicted to be far better glycan degraders than those

belonging to the genera Parabacteroides or Prevotella. Similarly, while Firmicutes

are generally poor glycan degraders, Rosburia species are predicted to be highly

efficient in breaking down polysaccharides. Notably, Roseburia intestinalis has one

of the highest predicted GlyDe scores, an observation also supported by recent

literature [100]. Thus, it is important to assess the glycan degradation capacity of

individual taxa within the larger community.

4.2.5. Glycan degradation patterns can be used to predict

bacterial abundance

We studied the relationship between the glycan degradation scores of a given

bacterial taxon and its abundance in the gut. We matched the abundance of 16S

rRNA marker gene sequences from 325 human individual gut samples found in the

HMP database with the aforementioned 203 microbial reference genomes (Methods).

For each taxon we extracted 6 features characterizing its glycan degradation capacity

(Methods), including: Plant-specific, Animal-specific, Disaccharides,

Oligosaccharides, Short Polysaccharides and Long Polysaccharides. Each feature

represents the sum of GlyDe scores for the glycans that belong in the class. Based on

these features we built a linear regression model for the abundance of these taxa in

the samples. In order to apply the linear regression model we filtered out taxa that

75

75

were not detected in any sample and taxa that were highly varied (see Methods for

criteria), resulting in 48 predictable taxa for the analysis. This regression yielded a

correlation coefficient of 0.46 (Figure 2c), a score markedly higher than the

correlation achieved using a model based on CAZymes abundances in a genome

(r=0.11, Methods). We next built similar regression models independently for each

class of bacteria. Remarkably, the Clostridia class had the highest combination of R

(0.76) and p-value (0.0001), while other classes with significantly predictive models

were Bacteroidia and Fusobacteria (Figure 2d). These results suggest that glycan

supplements can be tailored to control certain species abundances, especially those of

potential pathogenic Clostridia.

In an effort to include taxa that were initially omitted in the analysis above due to

their high variation , we first clustered the HMP samples according to their 16S

rRNA data into 2 main groups using KMeans (Methods), and recalculated the

average taxa abundance separately for each cluster. The same procedure for building

predictors of bacterial taxa abundance based on their genome-specific GlyDe

features was then calculated, yielding 53 predictable taxa in the first cluster and 71

predictable taxa in the second cluster, with concomitant increases in the correlation

coefficients (0.51 and 0.57, respectively). Based on the two clusters we assembled a

list of 25 strains with highly predictable abundance (Methods), and with only one

exception, all the strains belong either to the Bacteroidia or Clostridia classes (see

Discussion). Notably, based on the regression formulas of all three models, the

degradation capacity of long polysaccharides had the highest effect on bacterial

abundance. It is therefore likely that because most long polysaccharides are not

digested by the human host prior to reaching the colon, the ability to degrade them

provides a significant selective advantage for gut microbes.

76

76

Chapter 4::Figure 2: Glycan Degradation of the gut microbiota reference genomes. (a) Distribution of species-specific GlyDe

scores (y axis) for all the glycans in KEGG. GlyDe scores with corresponding reactions in KEGG appear on the left while those

with no KEGG reactions appear on the right (student's t=6.14, p<0.0001). (b) The bar plot compares the glycan degradation

potential of the different bacterial phyla for different glycan dietary categories. Each bar depicts the sum of GlyDe scores of

organisms belonging to their respected phylum. Red indicates glycans derived from animals, blue indicates glycans derived

from bacteria and green indicates glycans derived from plants. The purple bar represents the overall number of CAZymes. The

height of the colored bar represents the median while the error bars reflect the lower and upper quartiles. Asterisks denote

significant p-values when comparing Bacteroidetes to the other phyla. (c) The log-log scatter plot shows the average abundance

of 48 HMP strains within 325 human fecal samples (Y axis) and the linear regression- predicted abundance of each individual

strain (X axis). (Linear Regression correlation coefficient=0.46, p=0.0016). (d) A bar chart denoting the correlation value

(height of the bar) between actual and predicted abundance from linear regression models built for each class of bacteria (X

axis) based on its taxa's GlyDe features. The color of the bar reflects the number of species in the class. The feature extraction is

explained in the methods.

4.2.6. Glycan degradation profiles of mammalian species are

associated with their diet

Because diet is the prime determinant of colonic glycan composition and gut

microbiota vary according to general dietary patters[118], we expected that glycan

degradation would systematically vary between the microbiota of different

mammalian hosts based on their diet. To test this, we analyzed variation in diet and

c.

b.

R = 0.46

-1

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

0 0.5 1 1.5 2 2.5 3 3.5

log(

Ave

rage

Ab

un

dac

e i

n

HM

P F

eca

l Sam

ple

s)

log(Predicted Abundance)

Gly

De

Sco

re (l

og)

***

Not Degraded in KEGGDegraded in KEGG

Phylum

BacteriaAnimalsPlantsTotal

Sum

of

Gly

De

Sco

res

*

*

a.

*

*

*

d.

77

77

glycan degradation profiles across different mammalian species, using metagenomic

sequencing data from 57 fecal samples across 34 different species, including 18

human samples [119] (Methods). According to the host‟s diet, each sample is

characterized as being either herbivorous, carnivorous or omnivorous. To correct for

research biases arising from uneven annotations of CAZymes between species, we

normalized each sample by the total number of CAZymes in it before running PCoA

on the GlyDe profiles (Methods). The PCoA analysis revealed a clear spectrum of

samples over the first principal coordinate, from herbivores, through omnivores to

carnivores (Supplementary Fig. 3a). Using a subset of glycans that were categorized

into either plant-derived or animal-derived (Methods), we discovered a striking

relationship between the diet of a host organism and the glycans predicted to be

degraded by its gut microbiota: microbiota from herbivores tend to degrade plant-

derived glycans (p=0.04 compared to carnivores using Wilcoxon test, Figure 3a),

while microbiota from carnivores prefer animal-derived glycans (p=0.0001 compared

to herbivores using Wilcoxon test, Figure 3a). Interestingly, the degradation

efficiencies of omnivores and human gut microbiota places them as intermediates

between herbivores and carnivores (Figure 3a). To further explore where humans

stand with respect to the dietary spectrum, we trained an SVM classifier to

distinguish between herbivore and carnivore samples based on their inferred glycan

degradation profiles (Methods). The classifier predicted all but one sample correctly

in a leave-one-out cross validation (AUC=0.93, F=0.96), and notably outperformed a

classifier based only on the abundance of CAZymes found in each sample (which

had three misclassifications: AUC=0.71, F=0.84), confirming the added predictive

value of GlyDe. We next applied this classifier to the 11 available non-human

omnivore samples, and classified 6 and 5 of the samples as herbivores and

carnivores, respectively. These samples are missing direct dietary labeling, however,

a comparison of these classifications versus the Fiber Index of these mammals [120]

shows a nice correspondence with the predicted dietary regimes of the animals

(Table 1). We next applied the classifier to predict the dietary habits of the human

samples, which are unknown, resulting in 15 out of the 18 samples labeled as

carnivores. Thus, at least in the small population sample analyzed here, humans may

be closer to carnivores in some functional aspects of their gut microbiota.

78

78

Sample Mammalian Species SVM Diet Predicted Fiber Index Correspondence

4461343 Hamadryas baboon Herbivore 50-500 ✓

4461344 Hamadryas baboon Herbivore 50-500 ✓

4461347 North American black bear Carnivore 0-50 ✓

4461348 Black lemur Carnivore N/A N/A

4461351 Goeldi's marmoset Carnivore 0-50 ✓

4461353 Chimpanzee Herbivore 50-500 ✓

4461354 Chimpanzee Herbivore 50-500 ✓

4461374 Ring-tailed lemur Herbivore 50-500 ✓

4461375 White-faced saki Herbivore 0-50 ✗4461376 Spectacled bear Carnivore 50-500 ✗4461378 Prevost's squirrel Carnivore 0-50 ✓

Chapter 4::Table 1: Mammalian host diet prediction by GlyDe profiles. An SVM classifier was trained based on the

GlyDe profiles of herbivores and carnivores. A diet Fiber Index for these species was obtained from Ley et. al. [120],

defining the percentages in each diet of acid-detergent fiber (ADF) and neutral-detergent fiber. A higher index suggests a

more plant-based diet. The last column displays the correspondence between the GlyDe predicted diet of the animal and

its Fiber Index, where a value of 0-50 corresponds to carnivores and a value of 50-500 corresponds to herbivores.

We next explored whether humans who live in different geographies with markedly

different diets exhibit different GlyDe profiles. We analyzed the Yatsunenko et al.

dataset [121], which contains metagenomic sequences from fecal samples of 110

human individuals who live in Venezuela, Malawi and the USA. Malawian and

Venezuelan diets are dominated by plant-derived polysaccharides, while typical US

diets contain large quantities of meat [121]. As before, we ran GlyDe and performed

PCoA on all the GlyDe profiles. Because infants display large variability over the

first coordinate (Supplementary Fig. 3b), and have an unusual diet relative to adults,

we filtered out all samples from individuals younger than 2 years old. This led to a

clear separation over the first coordinate between low meat consumers (Malawians

and Venezuelans) and high meat consumers (Americans) (Figure 3b). The ratio of

animal to plant-specific GlyDe scores revealed significant differences between

samples from different countries of origin (ANOVA's F=6.56, p<0.005), with a

larger animal/plant ratio in USA (p<0.003) and Venezuela (p<0.03) compared to

Malawi. The ratio in USA was slightly but not significantly higher than in Venezuela

(Tuckey-Kramer HSD, Supplementary Fig. 3c).

79

79

Chapter 4::Figure 3: The connection between glycan degradation and diet. (a) GlyDe profiling analysis of the Muegge

dataset. Bars showing the average sum of Plant-specific and Animal-specific GlyDe scores of the samples grouped

according to their hosts' diet and normalized by the number of CAZymes in each sample. A fourth group is created to

segregate humans from all other omnivores. The plant- and animal- specific GlyDe scores of herbivores and carnivores

are significantly different (p=0.04 and p=0.0001, respectively). (b) The Yatsunenko dataset. A scatter plot showing the

samples' projection on the first principal coordinate and colored according to the country of origin. Samples from

individuals younger than 2 years old were omitted (see text).

4.3. Discussion

In this analysis we aimed to determine the association between diet and microbial

glycan metabolism in the gut. We detected diet-driven adaptations at both the level

of single species (Figure 2b) and of communities (Figure 3a). We found species

belonging to Bacteroidetes to be the most efficient degraders of animal-derived

glycans and human milk oligosaccharides. While this trend is apparent in both the

-2000 -1500 -1000 -500 0 500 1000

PCo 1 = 56.69%

Malawi United States of America Venezuela

a.

b.

0

50

100

150

200

250

Carnivore Omnivore Human Herbivore

Plant-specific GlyDe Score

Animal-specific GlyDe Score

80

80

Bacteroides and Parabacteroides genera it is absent from Prevotella, another key

member of that phylum. Diets that are high in animal protein have been associated

with high levels of Bacteroides, whereas enrichment of Prevotella was associated

with diets rich in plant-derived carbohydrates and very low in animal protein [121-

123]. Given that many dietary animal glycans are derived from proteins (i.e.

glycoproteins and proteoglycans), we propose that the high capability of Bacteroides

and Parabacteroides to degrade animal glycans can explain why their abundance is

increased in Westerners [123, 124].

The plethora of novel glycans and their predicted glycan degradation efficiencies

supplied by our method may prove to be a highly important tool for designing

prebiotic interventions. As a striking example, a linear regression model based on

GlyDe-related features was capable of accurately predicting the abundance of

bacterial strains that displayed low inter-samples variance. Of the features in the

regression model, degradation of long polysaccharides was the most predictive, an

unsurprising result considering the importance of these glycans as the main carbon

and energy source for colonic bacteria. Finally, our results were improved

significantly when dividing the HMP samples into two clusters and re-analyzing each

cluster individually. This supports the notion that microbiome analysis should not be

general, but rather be based carefully on the background community structure.

Our GlyDe profiling revealed that the relative abundance of many taxa, especially

those of Clostridia, is significantly correlated with their ability to degrade glycans. It

was recently shown that Clostridium difficile and other pathogenic gut bacteria rely

on microbiota-liberated mucosal glycans during their expansion in the gut following

antibiotic treatment [125]. Thus, it may be possible to design prebiotics that help

increase the levels of beneficial Clostridia and prevent the expansion of pathogenic

strains. More generally, since the breakdown of a given substrate can be highly

species-specific [103], the prediction of bacteria‟s glycan degradation efficiencies

may prove to be an important tool for designing nutritional interventions to help alter

microbial communities.

81

81

In analyzing mammalian fecal samples data from Muegge et al. [118], we

demonstrated that differences in microbial community composition carry functional

importance -- namely, that the microbiota of herbivores and carnivores have stronger

affinities to plant- and animal-derived glycans, respectively. To the best of our

knowledge, this is the first time that a computational framework has been able to

provide such observations. The lack of large scale in-vitro glycan utilization assays

makes straightforward validation of many of our predictions difficult at present.

Nevertheless, our ability to train an accurate classifier to predict the diet of a host

based on its microbiota glycan degradation profile, and the correspondence between

the classifier's predictions and animal nutrition (Table 1), both provide strong

operative testimony to the veracity and utility of the GlyDe pipeline.

Although humans are generally thought of as omnivores, there is an ongoing debate

on the subject of our dietary history and adaptations. Tackling this question through

the lens of our microbiota, we used the aforementioned binary herbivore-carnivore

classifier in order to classify humans. Remarkably, the classifier predicted 15 out of

18 human subjects to be carnivores. This result is less surprising considering that all

of the human subjects were US residents, and that the US is the most meat-

consuming country per capita in the world [126]. In contrast to the US population,

the diets of individuals from Malawi and Venezuela mainly include plant-derived

polysaccharides (they consume, on average, 8.3 and 76.8 kilograms of meat per year,

as opposed to 120.2 in the US [126]). We consequently find a lower ratio of animal-

to plant- degradation efficiency in the microbiota of individuals from these countries

(Supplementary Fig. 4c). Notably, GlyDe does not predict a reduced efficiency of

plant degradation within the US population. Therefore, it seems that the capacity of

Western individuals to degrade glycans has not diminished over the course of

evolution, but merely shifted towards the direction of carnivores.

Taken together, these results further advance our understanding of human diet-

specific adaptations but, as always, conclusions must be drawn with caution. First,

extrapolating from animal data to humans is problematic because of countless

genetic and environmental factors. Secondly, the data we rely upon is often

incomplete. For instance, only 74 out of the 146 CAZymes in GlyDe were mapped to

82

82

at least one HMP genome (Supplementary Fig. 2c). Furthermore, 31 CAZymes were

not capable of breaking any glycan, either because some glycan structures are

missing from the database or because of inaccurate enzymatic annotation

(Supplementary Fig. 2c). Finally, the GlyDe platform does not take into account

many important factors such as enzyme transcription levels and downstream

biochemical pathways for glycan utilization. Nevertheless, GlyDe is the first

computational analysis framework that successfully enables one to directly model

how the microbiota can respond to dietary glycans from a mechanistic point of view.

We expect that future studies will integrate GlyDe into routine 16S rRNA analysis

(e.g. with the help of PICRUSt [127]), as well as incorporate GlyDe within the larger

framework of genome scale metabolic modeling (e.g. [108, 128-131]), further

advancing our understanding of human dietary needs and the design of novel

nutritional interventions.

4.4. Methods

4.4.1. Data Retrieval

Information about glycans and the enzymes that might break them is spread across

many databases and tools. In this section we list the sources of the data used later to

infer genome-based glycan degradation capacity.

Bacterial taxa

A catalog of 281 taxa was downloaded on 10/08/11 from The Human Microbiome

Project (HMP) website (http://www.hmpdacc.org/) using the following filters:

NCBI Superkingdom: Bacteria, HMP Isolation Body Site: Gastrointestinal Tract,

Project Status: Complete, NCBI Submission Status: annotation (and sequence)

public on NCBI site. The catalog contains the following annotation fields: HMP ID,

GOLD ID, Organism Name, Domain, NCBI Taxon ID, NCBI Superkingdom, NCBI

Phylum, NCBI Class, NCBI Order, NCBI Family, NCBI Genus, NCBI Species, All

Body Sites, All Body Subsites, Current Finishing Level, NCBI Project ID, Genbank

http://www.hmpdacc.org/

83

83

ID, Gene Count, Size (KB), GC Content, Greengenes ID, NCBI 16S ACCESSION,

Strain Repository ID, Oxygen Requirement, Cell Shape, Motility, Sporulation,

Temperature Range, Optimum Temperature, Gram Stain, and Type Strain

Genome Annotations

All of the HMP taxa were searched against The Seed database

(http://pubseed.theseed.org/) using the key of NCBI Taxon ID as a cross-reference.

204 matches were detected and their RAST genome annotations were extracted using

the web services API.

Glycans

The entire KEGG Glycan database (http://www.genome.jp/kegg/glycan/) was

downloaded on 01/07/11. The database contains 10978 glycans. We used the

following annotation fields from the annotations: G number, Name, KCF file and

Class. An additional Biological Origin field was retrieved from an external source,

as described later.

The KCF file for each glycan describes a graphical representation of its 2D structure.

This representation takes into account the monomeric building blocks (nodes) and

the glycosidic linkages (edges) of the glycan. Textual and visual representations of

the KCF graph for glycan G00010 are given in Supplementary Figure 1a and 1b.

Glycan Filtering

The glycans database was subsequently filtered according to the following criteria:

Since the database contains more than 800 nodes denoting glycan-related "building

blocks", many of which are extremely rare, we chose to focus on a subset of 35

nodes corresponding to the most prominent sugar monosaccharides, prevalent

modifications, amino acids found in glycoproteins, and Ceramide found in

glycolipids. Therefore, we removed from the analysis all of the glycans which

contained nodes not part of this subset.

http://pubseed.theseed.org/seedviewer.cgi

http://www.genome.jp/kegg/glycan/

84

84

Similar to the nodes, an edge connecting two nodes in KEGG Glycan mostly has a

standard form denoting whether the sugar at the non-reducing side is in alpha or beta

conformation, as well as the number of the carbons participating in the glycosidic

linkage, e.g. "Glc a1-3 Glc" denoting Glucose alpha 1-3 Glucose. However, there are

some rare edges that have a different form. In order to maintain consistent reaction

rules we defined an edge to be legal if it has the common form of 'R z$-$ R' where:

R - is any node except for 'Thr' (Threonine), 'Ser' (Serine), 'Ser/Thr' (Serine or

Threonine), 'Asn' (Asparagine), 'S' (Sulfate) or 'P' Phosphate,

z - is either 'a' or 'b'

$ - is any number.

Since threonine, serine, asparagines, sulfate and phosphate are not monosaccharides

the glycosidic linkages they are involved in are not done via a carbon atom and

therefore the edge description is different. In these case the rule we used is 'R z$-

R*', where R is a regular node and R* is a non-monosaccharide node. All other edges

were marked as illegal and their containing glycans were omitted from the analysis.

Some glycans in the database have different IDs but identical structures and therefore

we denote them as "Synonym Glycans". Synonym Glycans were grouped together

and one glycan from each group was chosen to represent the entire group for further

analyses.

Glycan Structure Definition

In order to process the database to conform to our subsequent Glycan Degradation

(GlyDE) reactions (see Section: the Reconstruction of Glycan Degradation

Reactions) we identify several types of glycans:

Regular glycans: these glycans are of a fixed and known length (Supplementary Figure 1c).

Linear repeating glycans: these glycans are built completely from a repeating sugar segment.

Repeating parts are marked by (*) and [] (Supplementary Figure 1d).

85

85

Non Linear repeating glycans: these glycans have a repeating linear segment but also

modifications on some of the sugars, which makes them non-linear (Supplementary Figure 1e).

Polysaccharides: a glycan was defined as polysaccharides if it met one of the following

conditions:

- The glycan is a repeating glycan.

- The glycan has the value "Polysaccharide" in its Class field in the KEGG Glycan

database.

- The glycan has more than 10 nodes.

Enzyme Commission (EC) Numbers and Glycan Degradation rules

We obtained a list of 146 Carbohydrate-Active EnZymes (CAZymes) with a textual

description of their enzymatic function using the Carbohydrate Active Enzymes

(CAZy) database (http://www.cazy.org/). CAZy describes families of structurally-

related catalytic and carbohydrate-binding modules (or functional domains) of

enzymes that degrade, modify, or create glycosidic bonds. We retrieved from the

database all the EC numbers that belong to following families:

EC 2.3.1 – Acyltransferases, transferring groups other than amino-acyl groups;

EC 2.4.1 – Glycosyl transferases;

EC 3.1.1 – Carboxylic Ester Hydrolases;

EC 3.2.1 – Glycoside hydrolases;

EC 3.5.1 – Hydrolases acting on carbon-nitrogen bonds, other than peptide bonds, in

linear amides;

EC 4.2.2 – Polysaccharide lyases.

Based on the information available for these EC numbers in ExPASy

(http://expasy.org/) and KEGG (http://www.genome.jp/kegg/) we manually

generated a table that links each EC number with the following fields: EC number,

Enzyme Name, Linkages Broken, Contained Sub-glycan (linkage must be part of the

http://www.cazy.org/

http://expasy.org/

http://www.genome.jp/kegg/

86

86

sub-glycan), Contains Only (nodes), Glycan Released, Endo vs. Exo, DP preference

(# of nodes), Terminal Side Preference, Enzymatic Reaction, and Comments.

These fields were later used to generate glycan degradation reactions by defining and

implementing a set of rules analyzing the KCF file of all the glycans (described in the

section: Reconstruction of Glycan Degradation Reactions). Below is a description of

the fields.

EC number – The number of the enzyme reflecting its catalytic activity.

Enzyme Name – The accepted name of the enzyme.

KEGG Reactions - The reactions from KEGG mapped to this EC number.

Linkages Broken – In the case of glycosidic linkages the value is a string

representing two nodes and an edge that connects them based on the KCF graph

representation of the glycan structure. In the case of deacetylation reactions the value

given is "Ac-R", denoting the removal of an acetyl group from node R.

Contained Sub-glycan - A G# identifier of a glycan which structure must be

contained within the structure of a larger glycan, e.g. "Glc b1-4 Glc" is a sub-glycan

of "Man a1-3 Glc b1-4 Glc".

Contains Only (nodes) - A G# identifier for one or more nodes which the glycan

must contain and only contain.

Glycan Released - A G# identifier that defines a glycan that must be one of the

products after the reaction of this enzyme takes place.

Endo vs. Exo - "exo" enzymes can remove only terminal sugars (the edges of the

terminal nodes), "endo" enzymes can break all glycosidic bonds except for terminal

ones (remove all edges except the ones of the terminal nodes), and "both" enzymes

can break any bond (remove any edge).

DP preference - This field reflects the degree of polymerization of the glycans this

enzyme works upon. This number is also the exact/minimal ("+" sign)/ maximal ("-"

87

87

sign) number of nodes that the KCF graph denoting the glycan structure must

contain.

Terminal Side Preference – This field is unique for exo-acting enzymes and

describes their specificity towards the reducing or non-reducing end. The KCF graph

is directional, hence "reducing" means only the removal of the rightmost node is

allowed, "non-reducing" refers to the leftmost, and "both" allows the removal of both

edges. Notice that the terminal node of the reducing end is always at position 1 in the

KCF graph, except in repeating glycans.

Enzymatic Reaction - A textual description of the reaction performed by this EC

number taken from http://enzyme.expasy.org/

Comments - Specific comments about this EC number taken from

http://enzyme.expasy.org/

The above fields contain special characters which are described below:

$ - signifies any number

R - signifies any type of the following sugars (nodes): Ara, Araf, D/LAra, D/LAraf,

LAra, LAraf, D/LAraf, D/LAra, Api, Apif, D/LApi, D/LApif, 3,6-Anhydro-LGal,

L3,6-Anhydro-Gal, 3,6-Anhydro-Gal, GalA, D/LGalA, GalNAc, GalfNAc,

D/LGalNAc, GalN, GlcNAc, D/LGlcNAc, GlcA, D/LGlcA, GlcN, D/LGlcN, Glc,

Glcf, D/LGlc, Fru, Fruf, D/LFru, D-Fruf, Man, Manf, D/LMan, ManA, Rha,

D/LRha, LRha, D/LRha, Gal, Galf, D/LGal, D/LGalf, Fuc, D/LFuc, Fucf, LFuc,

D/LFuc, Xyl, D/LXyl, Xylf, Neu, Neu5Ac, Neu5Gc, MurNAc.

# - denotes an OR association

& - denotes an AND association



88

88

Overall, this workflow resulted in 141561 glycan degradation reactions, of which

9325 are reactions that degrade KEGG glycans and newly reconstructed glycans, and

132236 intermediate glycan degrading reactions.

CAZyme annotation

We used sequence similarity to match the genes which belong to the HMP taxa with

specific Carbohydrate Active enZymes (CAZymes). We therefore BLASTed all of

the genomes of the HMP taxa against the bacterial protein sequences found in the

CAZy database. Each enzyme family in CAZy contains a set of manually curated

enzymes determined to execute a specific catalytic function. We used the NCBI Blast

utility and filtered errors of the level of 10 e-10 and matches bellow 97% exactness .

At this point we had a mapping between genes in the HMP taxa and CAZyme

families. Because many families contain a one-to-many mapping between a family

and its associated EC numbers we had to refine this annotation. We therefore

extracted the genes predicted enzymatic annotations from „The SEED‟ and KEGG

databases. While the CAZyme annotations are more comprehensive they are

sometimes not as accurate as the manually curated ones. Thus, we integrated the

information obtained from all of these sources using the following logic: for proteins

that were mapped to families of one EC number in CAZy, we accepted this

annotation. For proteins that were mapped to families of multiple EC numbers, we

first checked if they had an available annotation in SEED or KEGG, and if that was

the case, then we checked if this annotation belonged to one of the multiple

annotations in CAZy. If it did then we accepted the KEGG/SEED annotation.

Subcellular Localization annotation

To define the subcellular localization (SCL) of reactions we used the RAST genome

annotation as a first proxy. We mined the function and subsystem fields of the

annotation for special keywords. For our purposes we were only interested whether

the enzyme exerts its function inside the cell or outside. Enzymes were defined as

intracellular if their associated genes contained the keywords cytoplasm, cytosol and

cytoplasmic. Enzymes were defined as cross-membrane if their associated genes

89

89

contained one of the keywords: periplasm, periplasmic, inner membrane or

cytoplasmic membrane. And finally enzymes were defined as extracellular if their

associated genes contained one of the keywords: cellulosome, outer membrane,

secreted, cell wall, or extracellular. For enzymes that were not associated with any

meaningful keyword we took advantage of the LOCtree localization prediction

software (https://rostlab.org/owiki/index.php/LOCtree). LOCtree uses a protein

amino acid sequence to predict its SCL. It supplies five possible SCLs: cytosol, inner

membrane, periplasmatic, outer membrane, and secreted. Eenzymes with the value

cytosol were classified as intracellular, enzymes with the values secreted and outer

membrane were classified as extracellular, and enzymes with the values

periplasmatic or inner membrane were classified as cross-membrane, i.e. enzymes

that exert their function on the cross-membrane between the cell and its environment.

To fix possible erroneous annotations we refined our localization selection based on

specific knowledge of the glycan degradation biochemistry. A literature survey

suggested that there are no polysaccharides within the bacterial cytoplasm (with

glycogen being the only exception). Thus, enzymes that were predicted as

intracellular or cross-membrane were filtered out if the glycan that they processed

was either repeating, defined as a polysaccharide or had more than 10 sub-

components.

Biological origin of glycans

We retrieved the CarbBank database [132] and mapped the KEGG glycans to it using

the KEGG Glycan ID (G number) number as a cross reference. CarbBank contains

detailed descriptions of where a specific glycan can be found in nature. We parsed

these data in order to define certain glycans as being either plant-derived or animal-

derived.

Degree of polymerization of glycans

Glycans are routinely categorized as one of four possible degree of polymerization

classes. With respect to classes, glycans were defined as disaccharides if they contain

90

90

2 nodes, oligosaccharides if they contain 3-10 nodes, short polysaccharides if they

contain >10 nodes, and long polysaccharides if they have a repeating structure.

4.4.2. Construction of the CAZyme table (a key step in the

GlyDe pipeline)

We manually curated all of the CAZymes (146 EC numbers) and mapped each one

to a set of computer-based rules dictating the mechanism by which it can break a

given glycan (i.e. split its graph into two separate components). These rules account

for structural features such as the glycosidic linkages the enzyme can break, the

cleavage mechanism, the chemical neighborhood, and the degree of polymerization

of the glycan. We then executed these rules on all the glycans that appear in the

KEGG Glycan database, which yielded 141,561 glycan degradation (GlyDe)

reactions. In the following section we describe the logic behind the reconstruction of

glycan degradation reactions by identifying for each glycan which enzymes are able

to break it and how will the degradation reaction look like. GlyDe reactions are then

mapped back to CAZymes in order to produce a table where the rows are CAZymes,

the columns are glycans, and each entry contains a CAZyme score, calculated as

follows:

𝐶𝐴𝑍𝑦𝑚𝑒 𝑠𝑐𝑜𝑟𝑒𝑖𝑗 = 1

𝑔𝑘 ,∀ 𝑒𝑖

where ei is an enzyme that can degrade glycan j and gk is the number of glycans that

it breaks. The entire construction process is summarized in Figure 1b.

Glycan Degradation (GlyDe) rules

A reaction is represented by its:

- Substrate(s).

91

91

- Product(s).

- The enzyme(s) responsible for the catalysis.

- A sub-cellular localization.

For the GlyDe reactions generation process we used all the computer-based glycan

breaking rules explained bellow. For an EC number-related rule to break a glycan,

the glycan and the resultant reaction must comply with all the limitations defined in

the fields of the given rule, namely:

The glycan must contain at least one of the glycosidic linkages (or nodes containing

an acetyl group in case of deacetylation reactions) described in the Linkages Broken

field.

The glycosidic linkage hydrolyzed must appear in the terminal edges of the glycan if

the value in the Endo Vs. Exo is set to Endo, and vice versa.

The number of nodes the glycan contains must conform to the value described in the

DP Preference field.

In case the Endo Vs. Exo field is set to Exo the terminal side of the glycosidic linkage

hydrolyzed must be located on the right side of the graph of the glycan if the value in

the Terminal Side Preference field is set to Reducing and on the left side if this field

is set to Non-reducing. If this field is set to both then location of this linkage on both

sides is allowed.

The glycan must contain the structure of a glycan (nodes and edges) described in the

Contained Sub-glycan field, and the linkage being broken must also be part of this

sub-glycan.

The reaction must contain the glycan described in the Glycan Released field as one

of its products.

Figure 1a gives an example of an Exo-acting enzyme breaking a regular glycan.

92

92

Deacetylation rules

Some of the EC numbers we analyzed have a deacetylation activity, i.e. they have the

capability to remove acetyl groups. In KEGG Glycan, monosaccharides containing

an acetyl group are described as a single unique node, e.g. the node GlcNAc

corresponds to N-acetyl-glucosamine. Therefore, if an enzyme has the capability to

remove an acetyl group, we simply remove the substring "Ac" from the label of the

node and make it the product of the reaction, e.g. GlcNAc GlcN + Ac.

Reconstruction of new glycans

We manually constructed a set of 107 glycans which we determined important but

were missing from the KEGG Glycan database. To distinguish these glycans from

the ones previously available in the database we gave them the prefix "TAU" instead

of "G" that was given by KEGG. Furthermore, in most cases the products of the

degradation reactions do not have a pre-existing G number, meaning they currently

do not exist in KEGG Glycan database. Working under the assumption that most

glycans in nature are still uncharacterized in databases, we decided to add them

automatically. Thus, whenever a reaction produced a new glycan, we gave this

glycan a unique ID beginning with "TAUS" (to distinguish it from original glycans

beginning with "G" or "TAU").

4.4.3. Data Analysis

Single taxa data analysis.

Defining microbial genome-specific Glycan Degradation (GlyDe) scores.

After building the CAZyme table we associated these CAZymes with the genomes of

the HMP gut taxa. For every taxon-specific gene we calculated, based on its

enzymatic annotation and those enzymes' subcellular localization, a GlyDe score.

Given a bacterial taxon i and a glycan j, the GlyDe score is calculated as follows:

𝐺𝑙𝑦𝐷𝑒 𝑠𝑐𝑜𝑟𝑒𝑖𝑗 = 𝑛𝑗𝑘

𝑔𝑘 ,∀ 𝑒𝑘

93

93

where ek is an enzyme that can degrade glycan j, nik is the number of genes in its

genome which translate to enzyme ek, and gk is the number of glycans broken by

enzyme ek.

This metric decrements the contribution of CAZymes that are more promiscuous

versus those specifically geared to degrade the glycan in question (Figure 1b).

For some categories of glycans such as "Long Polysaccharides" and "Plant-specific

glycans", we defined a category specific score GSic, which is the sum of GlyDe

scores for glycans that belong in that group:

𝐺𝑙𝑦𝐷𝑒 𝑠𝑐𝑜𝑟𝑒𝑖𝑐 = 𝐺𝑆𝑖𝑗 ,∀ 𝑗 ∈ 𝑐

where GSij is the GlyDe score for genome i and glycan j, and c is the collection of

glycans that belong to category C.

This scoring system has the feature that summing the GlyDe scores over all the

glycans in a given genome gives the total number of CAZymes in the genome:

Total CAZymes𝑖 = 𝐺𝑆𝑖𝑗 ,∀ 𝑗 ∈ 𝐽

where 𝑗 ∈ 𝐽 is the set of all glycans, and i is the index of a specific taxon.

Subsequently, we defined the GlyDe Profile of a bacterial taxon as:

𝐺𝑙𝑦𝐷𝑒 𝑝𝑟𝑜𝑓𝑖𝑙𝑒𝑖 = 𝐺𝑆𝑖1, 𝐺𝑆𝑖2,… ,𝐺𝑆𝑖𝑗 −1,𝐺𝑆𝑖𝑗

where 𝑗 ∈ 𝐽 is the set of all glycans, and i is the index of a specific taxon.

GlyDe reaction consistency check - cross validation.

When applied to all the glycans available in KEGG, the GlyDe pipeline produced a

list of 114,573 intermediate glycan products, most of which were novel and thus do

not appear in the original KEGG database.To test how consistent GlyDe is, we

performed a cross-validation process where we picked a random subset of 1,000

94

94

glycans from KEGG and applied GlyDe to degrade them. We then tested whether the

products obtained from these 1000 glycans were enriched with known versus novel

intermediate glycans. A hyper geometric test indicated that the products were highly

enriched for known glycans, (p-value = 10-19

; see Supplementary Figure 2a). A

sensitivity analysis with subsets of different initial random sets and sizes still resulted

in highly significant enrichments (data not shown). This result testifies that GlyDe is

capable of recapitulating the biochemical knowledge imprinted in the CAZymes that

constitute its computational foundation.

Principal Coordinates Analysis on GlyDe profiles

We calculated the pairwise Bray-Curtis dissimilarities between all the GlyDe profiles

and performed Principal Coordinates Analysis (PCoA) on the resulting dissimilarity

matrix to project the differences in degradation into two dimensions (Supplementary

Figure 2d).

GlyDe-related features definition.

Based on GlyDe, we extracted 6 features that characterize the several dimensions of

a (meta-) genome glycan degradation potential. These features include: Plant-

specific, Animal-specific, Disaccharides, Oligosaccharides, Short Polysaccharides

and Long Polysaccharides GlyDe scores. Each feature represents the sum of GlyDe

scores for the glycans that belong in the class.

16S rRNA sequence data analysis.

The Human Microbiome Project (HMP) dataset.

We retrieved the 16S rRNA sequence data and metadata from fecal samples

belonging to 325 healthy human individuals from the HMP Data Analysis and

Coordination Center (DACC) [133] . Because we were not interested in time series

data we only used samples from the initial time point. 16S rRNA sequences were

mapped to the HMP genomes based on sequence similarity and further used to build

95

95

an OTU table describing the abundances of the HMP taxa in each sample. For this

purpose we used the QIIME software [134] with these exact commands:

pick_otus.py -i HMP_samples_seqs -r HMP_taxa_ref_seqs -m uclust_ref -C

make_otu_table.py -i pick_otus_output

Metagenomics sequence data analysis.

We retrieved the Muegge et. al. [118] and the Yatsunenko et. al. [121] and datasets

from MG-RAST [135]. For both datasets we downloaded the FragGeneScan gene

calling output which maps each original read to 0, 1, or more ORFs. This way many

reads could be assigned to a single ORF and so the abundance of each ORF was

taken into account. Next, ORFs were assigned to CAZYmes and to specific

subcellular localizations. We then constructed a CAZYmes abundance table

describing the abundances of all the CAZYmes in each sample. In order to calculate

a sample-specific GlyDe score we used the samples CAZYmes abundance table and

pre-calculated CAZymes scores table. Thus, the GlyDe score GSkj of glycan j in

sample k is:

𝐺𝑆𝑘𝑗 = 𝑛𝑗𝑘

𝑔𝑘 ∙ 𝐷𝑚𝑎𝑥𝐷𝑘

,∀ 𝑒𝑘

where ek is an enzyme that can degrade glycan j, nik is the number of genes in sample

k that map to enzyme ek and gk is the number of glycans broken by enzyme ek. Dmax/

Dk is a normalization factor that denotes the ratio between the depth (i.e. the total

number of sequenced reads) of the sample with the maximum depth (Dmax) and the

depth of the current sample Dk.

Subsequently, we defined the GlyDe Profile of sample k as:

𝐺𝑃𝑘 = 𝐺𝑆𝑘1, 𝐺𝑆𝑘2,… ,𝐺𝑆𝑘𝑗 −1,𝐺𝑆𝑘𝑗

We calculated all the GlyDe scores for the Muegge et. al. dataset and for the

Yatsunenko et. al. dataset. The GlyDe algorithm output tables contain the following

fields describing each sample in the data: Unique CAZymes, Total CAZymes,

96

96

KEGG-derived Glycans Degraded, Total GlyDe Score, Plant-specific GlyDe Score,

Animal-specific GlyDe Score, Bacteria-specific GlyDe Score, Disaccharides,

Oligosaccharides, Short Polysaccharides, Long Polysaccharides.

Multivariate regression between GlyDe features and bacterial abundance.

We analyzed the 16S rRNA sequences from the HMP fecal samples in order to

determine the abundance of our 203 bacterial taxa in each sample. To increase the

signal to noise ratio (SNR) we excluded species with high abundance variability

based on the following criterion:

𝑆𝑁𝑅 = mean abundance

standard deviation > 0.3

The 6 features described in the previous section were used to build a linear

regression model with bacterial abundance as the dependent variable:

Bacterial abundance = -13.248 * Plant-specific GlyDe Score + 28.1822 * Disaccharides -

9.6206 * Oligosaccharides + 32.701 * Long Polysaccharides + 42.7354

To eliminate the possibility of over-fitting the data we used a standard 10-fold cross

validation method. All calculations were performed using WEKA [136].

To assess the added value of using the GlyDe features over genomic information

alone, we defined for each HMP taxon a vector containing the genomic copy number

of the CAZymes used in the analysis. We then built a similar linear regression model

with these 82 CAZymes used as features and bacterial abundance in the HMP

samples as the dependent variable.

Because of the high variability in bacterial taxa abundance across the samples, we

used the KMeans algorithm to cluster the samples. We chose to use 3 clusters

because this option resulted in the lowest Cubic Clustering Criterion (CCC).

However, one cluster was composed of only one outlier taxon, so we omitted it from

further analysis. We built a cluster specific linear regression model as described

above:

97

97

Cluster 1 Bacterial abundance = -40.6116 * Plant-specific GlyDe Score + 33.283 * Long

Polysaccharides + 31.6149

Cluster 2 Bacterial abundance = -15.9773 * Oligosaccharides + 69.8184 * Long

Polysaccharides + 70.697

We defined a list of 25 bacterial taxa with highly predictable accuracy. A taxon was

included in the list if the standard error of its predicted abundance obeyed the

following rule for either cluster 1 or cluster 2:

predicted abundance − actual abundance

actual abundance < 1

Classification of dietary patterns using GlyDe scores.

The GlyDe scores of all herbivore and carnivore mammals from the Muegge et. al.

dataset were used to train a binary Support Vector Machine (SVM) classifier. The

SMO implementation of this classification algorithm in WEKA [136] was used for

the computation. To estimate the accuracy of the classifier we used a standard leave

one out cross validation. To apply this classifier to the remaining human and non-

human omnivore samples, we hid the labels of the samples and classified them as

either carnivore or herbivore.

98

98

Chapter 5

5. Discussion

At the heart of systems biology is the increasing acknowledgement that biological

systems are highly interconnected, and that studies of biological „nodes‟ in isolation

are not generally sufficient to recapitulate the complex emergent properties of a full

network. Genome-scale metabolic modeling (GSSM) is an implementation of this

holistic approach presented by Systems Biology within the realm of single cell

Metabolism. Genome-scale metabolic models have proven to be crucial resources for

predicting organism phenotypes from genotypes in numerous medical,

bioengineering and bio-remediation applications. The efforts needed for manually

developing models for new organisms, together with the lack of standard

nomenclatures for metabolites and reactions in manually curated models, has

restricted the focus of most research to date on isolated, non-interacting species;

however, it is well known that prokaryotes live and thrive in dense communities, and

the interactions of community members with each other as well as with the

environment determine much of the functionality, adaptability, and capabilities of the

whole group. While individual genome-scale models are adequate to predict the

behavior of cells in pure cultures, most natural systems on earth require modeling of

metabolic interactivity between species in order to capture the most relevant biology.

The appearance of semi automatic tools for generation of metabolic models, as well

as the subsequent availability of models for thousands of sequenced prokaryotes, has

made it possible to climb one step up in the ``holistic ladder'', from Genome-scale

metabolic modeling to Community genome-scale metabolic modeling (C-GSSM).

Although automatically generated models are still not of equivalent quality to

manually curated ones, they yet open new avenues in the types of questions we may

ask. Importantly, these models solve the problem of differing metabolite and

99

99

reaction nomenclature between models -- a major technical hurdle in the past -- and

thus enable seamless modeling at the community level and comparisons between its

members.

In this dissertation I focused on the computational study of some questions that can

now be addressed due to the availability of a large set of „normalized‟ prokaryotic

metabolic models.

5.1. Answering new types of questions with the

large number of available bacterial metabolic

models

In this dissertation I tried to address different classes of „community‟ related

questions that could now be answered.

The first class of questions dealt with the metabolic relations between different

bacterial species. In Chapter 2, I have done the largest study (to date) of interactions

between different microbial species. The second group of questions dealt with the

comparison between GSSMs of different species and extraction of common features.

In Chapter 3, I found a way to predict the growth rates of prokaryotes using GSSMs

in a way that outperforms the currently commonly used GSSM based method of

measuring biomass yield. Prokaryotes often live within and interact with a larger

host. In Chapter 4, I started examining the interaction between the gut microbiota and

the human host.

100

100

5.1.1. Metabolic Interaction within Bacterial communities

It is well known that prokaryotes live and thrive in dense communities, and the

interactions of community members with each other as well as with the environment

determine much of the functionality, adaptability, and capabilities of the whole

group. To date it has been difficult to predict which bacteria can stably co-exist, let

alone cooperate metabolically, making the artificial design of beneficial microbial

consortia extremely difficult.

In Chapter 2, we focused on the metabolic interactions between species, and the way

it affects the actual in-vitro and in-vivo interactions. We suggested a generic

approach for the systematic description of inter-species interactions. Our method has

its limitations. It solely focuses on the metabolic dimension, ignoring regulation as

well as the numerous strategies that microorganisms have evolved to augment the

acquisition of resources. The analysis lacks information on the true metabolic

composition of the environments considered, and hence focuses on predicting the

overall potential inter-species interactions, rather than providing a direct account of

their actual in-vivo communications in one specific environment. And finally, the

automatic reconstruction procedure for the GSSMs results in a significant increase in

the number of genome scale metabolic models, which, while proving useful in

predicting a variety of phenotypes, are typically less accurate than manually curated

models [25]. Yet, despite these significant limitations, our generic approach succeeds

in delineating clear differences in the interaction patterns of ecologically associated

versus randomly associated communities, and reveals fundamental new ecological

principles.

With the increasing efforts to provide an a-biotic description of different

environments, together with the expected rapid rise in the number of metabolic

models as well as the improvement in their quality, the utilization of metabolic

modeling for community-level modeling framework such as the one laid down here

provides a computational basis for many exciting future applications. These include

the artificial design of 'expert' communities for bioremediation, where currently the

selection of community species is done by intelligent guesswork. We can look at the

101

101

construction of „expert‟ communities as an alternative to large scale gene insertion to

a single species as done today with limited success in many bioengineering projects.

Similarly, our work may be applied to the rational design of probiotic administration,

as well as to the identification of species that may metabolically out-compete

pathogenic species.

5.1.2. Extracting cell qualities from a large scale metabolic

analysis across a large number of species

The availability of GSSMs for thousands of prokaryotes with standard nomenclatures

for metabolites and reactions made it possible to compare between the different

GSSMs, and to try to explain phenotypic behaviors across species using these

models. In Chapter 3 we aimed to explain the principles that determine cell growth

by analyzing the behavior of many bacterial species across different growing media.

Understanding cell growth rate is a long-standing scientific goal, of interest in

biology and especially in biotechnology when designing bacteria-based

manufacturing plants. In this work, we presented a new method for predicting

cellular growth rate, termed SUMEX, which does not require any empirical variables

apart from a metabolic network (i.e., a GEM) and the growth medium. SUMEX is

calculated by maximizing the SUM of molar EXchange fluxes (hence SUMEX) in a

genome-scale metabolic model. SUMEX correlated significantly with the growth

rate of microbes across species, environments, and genetic conditions, outperforming

traditional cellular objectives (most notably, the convention assuming biomass

maximization). The success of SUMEX suggested that the ability of a cell to

catabolize substrates and produce a strong proton gradient enables fast cell growth.

Easily applicable heuristics for predicting growth rate, such as what we demonstrate

with SUMEX, may contribute to numerous medical and biotechnological goals,

ranging from the engineering of faster-growing industrial strains, modeling of mixed

ecological communities, and the inhibition of cancer growth.

We expect that future work based on the principles listed here will lead to improved

biotechnology applications and perhaps even to new functional insights into cancer,

as well as filling in a basic gap in our understanding of cellular function. The work

102

102

done in Chapter 3 was the first work (to my best knowledge) comparing GSSMs

across a large number of species in order to learn about nature. There are many

questions and phenomena that be researched using the same general principles. I will

introduce some of them later in this chapter.

5.1.3. Investigating the relations between gut bacterial

communities and their hosts

Many bacterial communities operate within a larger host and interact with it. Our

initial goal when we started what turned to be Chapter 4 was to build a metabolic

model of the human gut with its microbiota and use it to explain gut related diseases.

We found out that all the existing prokaryote GSSM models, both manually and

automatically generated, had hardly any reference to Glycans. This finding forced us

to change our initial goal. We decided then to focus on Glycans which form the

primary nutritional source of microbes in the human gut, and on developing a way to

automatically integrate them into the GSSMs. In Chapter 4 we presented a novel

computational pipeline for modeling Glycan Degradation (GlyDe), providing a broad

view of the usage of these compounds on genome and metagenome scales, and

integrating them into the GSSMs.

GlyDe predicts the usage patterns of thousands of glycans by each of the sequenced

individual gut bacteria deposited in the Human Microbiome Project (HMP) database.

We aimed to determine the association between diet and microbial glycan

metabolism in the gut. We found diet-driven adaptations at both the level of single

species and of communities. We found species belonging to Bacteroidetes to be the

most efficient degraders of animal-derived glycans and human milk

oligosaccharides. While this trend is apparent in both the Bacteroides and

Parabacteroides genera it is absent from Prevotella, another key member of that

phylum.

Diets that are high in animal protein have been associated with high levels of

Bacteroides, whereas enrichment of Prevotella was associated with diets that are rich

in plant-derived carbohydrates and very low in animal protein [121-123]. Given that

103

103

many dietary animal glycans are derived from proteins (i.e. glycoproteins and

proteoglycans), we proposed that the high capability of Bacteroides and

Parabacteroides to degrade animal glycans can explain why their abundance is

increased in westerners [123, 124].

We expect that the plethora of novel glycans generated by GlyDe and their predicted,

glycan degradation efficiency, supplied by our method will be a highly important

tool for designing prebiotic interventions. We showed, as an example, a strong

observed relationship between degradation of long polysaccharides and the

abundance of Clostridia that may help identify prebiotic interventions that may

prevent Clostridium difficile infections by increasing levels of non-pathogenic

clostridia. The identification of such degradation-abundance relationships by GlyDe

is highly promising for future applications.

To the best of our knowledge, this is the first time that a computational framework

has been able to computationally discriminate between carnivores and herbivore

based on their glycans degradation profiles.

GlyDe also supplied the automatic tool that can adds the Glycans degradation

support to GSSMs. Using GlyDe, we can now build a compound model of the human

gut with its microbiota and use it to further advance our understanding of human

dietary needs, to try and explain gut related diseases, and to suggest the design of

future nutritional interventions.

There are many other questions related to the relationship between a bacterial

community and a host, many of them a related to diseases. The availability of

GSSMs for many of the pathogens enables us to model such systems using the

C-GSSMs.

5.2. Future directions

This dissertation contains some pioneering work done using the newly available

large set of prokaryotes‟ GSSMs. The dissertation only scratched the surface of the

104

104

opportunities that opened when these models became available. Chapters 2-4 each

suggest additional future research venues within the scope of their respective studies.

In the next few paragraphs, I explore a few additional ideas I find worth pursuing

now, building upon the work presented here and now that we have those metabolic

models.

5.2.1. New methods for simulating communities

In this dissertation we focused on modeling communities in steady state in a

chemostat. There are additional methods of utilizing metabolic models which are

currently used for single species and can be extended to community research. One of

the methods that removes the requirements for steady state is „dynamic flux balance

analysis‟ as described in [137] which is currently used only for single species

models. This method can be viewed as a part of the „cellular automata‟ family[138].

This method is excellent for simulating batch processing conditions and can also

assist in simulating large communities in chemostat when the existing FBA based

methods might fail due to the large size of Linear Programming problem it creates.

The problem with cellular automata based methods is the number of parameters that

need be applied in order to simulate an in-vivo test. However, these parameters may

be considerably less than the number of parameters needed when using full kinetic

methods.

5.2.2. Specialized bacterial communities

The ability to design and simulate the action of a given bacterial community opens

many opportunities in the fields of bio-engineering and bio-remediation.

Bacterial communities can be designed, using the methods used in this dissertation,

to manufacture materials without using complex gene knockouts or gene insertions.

Currently many bio-remediation operations are done by specific bacterial

communities; however these communities were mostly selected by guesswork. Using

the tools listed in this work, better communities may be designed so that they will

outperform the existing ones.

105

105

5.2.3. Examples of organisms’ traits that can be extracted

using the large number of prokaryotes metabolic models

In this dissertation in Chapter 3, we investigated „growth rate‟ across a large number

of prokaryote species. There are many interesting questions which we can be

investigated using the existing models. Following is a small list of such questions:

Is there a propagation of metabolic pathways across the phylogenetic tree?

What is the correlation between the phylogenetic distance of species and their

growing environment?

Are there metabolic factors that play an important role in determining the PH

sensitivity of species?

Currently the existing metabolic models are not accurate enough to give exact

answers on the species level, however when doing a large scale survey on many

species, the relevant qualities may provide a good signal in aggregate, and lead to

new insights.

5.2.4. Investigating relationships of specific communities of

bacteria and their host

In Chapter 4, we began to investigate the gut bacterial community within mammals

and especially in humans. There are many such bacterial communities in nature that

operate within the vicinity of a given host. Some are related to the research of

diseases and pathogens, and others are related to parasitic or symbiotic relationships.

An additional example, which I believe is very important, is that of specific

communities of bacteria that can act as fertilizers for plants, or that can act as

insecticides and help plants survive attacks from pathogens or insects.

5.3. Summary

106

106

The appearance of semi automatic tools for generation of metabolic models, and the

subsequent availability of models for thousands of sequenced prokaryotes have

opened a new research sub field within the research field of metabolic modeling.

Comparing between species using GSSMs and building C-GSSMs shows a strong

potential in answering questions regarding species behaviors and in assisting us in

better utilizing prokaryotes in bioengineering applications. This dissertation has only

begun to address some of the key questions arising in this new field.

The current results as shown in this dissertation are promising; however it is clear

that the methods listed here are only partial and limited, mainly due to fact that they

focus only on the metabolic aspect of the relationship between species, ignoring

regulation at the single species level, and the cell specialization at the community

level. It is also expected that when modeling communities, the assumption of „steady

state‟ built in Flux Balance analysis, will become more limiting than it was when

researching a single cell.

I expect that in the future, regulation will be introduced into C-GSSMs at all levels,

and that a more dynamic metabolic modeling approach (cellular automata) will be

more commonly used.

An opening of a new research field is always exciting, and I see myself blessed to

live in such exciting times.

107

Appendix 1. Supplementary data for Chapter 2

A.1.1 Supplementary Figures

Appendix 1::Supplementary Figure S1. Cooperation and competition levels of the ecological groups at different levels of

competition and resource overlap.

Bars represent standard error (bars are not shown for sample sizes <2; for very small values of standard error bars are shown in

red).

108

108

Appendix 1::Supplementary Figure S2: The frequency of resource overlap values between ecologically associated (black) and

non-associated (white) species pairs.

As shown, ecologically associated pairs differ in the pattern of distribution of their resource overlap values. This further

supports a non-random distribution of metabolic-demand similarity between ecologically-associated and non-associated pairs

whereas the pick observed for moderate values (~0.5) arises mainly from the contribution of co-occurring pairs.

109

109

A.1.2 Supplementary Tables

Appendix 1::Supplementary Table S1. Description of model species and selected properties.

The table is sorted according to the fraction of winning events. Species seed ids are as in[25]. Fractions of regulatory genes were retrieved from[139]. General environmental complexity estimates (1- obligatory symbionts; 2- specialized; 3- aquatic; 4-

facultative host-associated; 5- multiple; 6- terrestrial species) were obtained from[140]. Minimal doubling time information was

retrieved from[78].

Species' name Species' seed

id

Maximal

Biomass

Productio

n Rate

(MBR)

Fraction

of

winning

events

Fraction

of

regulator

y genes

Estimate of

environme

ntal

diversity

Minimal

doubling

time

Dehalococcoides

ethenogenes 195 Core243164_3 4.3 0 19

Thiomicrospira

crunogena XCL-2 Core39765_1 4 0 1

Bartonella

bacilliformis

KC583 Core360095_3 3.8 0

Mycoplasma

genitalium G-37 Core243273_1 1.2 0 0 1 12

Anaplasma

marginale str. St.

Maries Core234826_3 4.7 0 21.6

Buchnera

aphidicola str. APS

(Acyrthosiphon

pisum) Core107806_1 3.5 0 0 1

Treponema

pallidum subsp.

pallidum str.

Nichols Core243276_1 0.9 0

Borrelia

burgdorferi B31 Core224326_1 3.6 0 4 4

Bifidobacterium

longum NCC2705 Core206672_1 11.4 0.1 0.1 4 1.51

110

110

Wolbachia sp.

endosymbiont of

Drosophila

melanogaster Core163164_1 13 0.1

Coxiella burnetii

RSA 493 Core227377_1 8.8 0.1 5 8

Ehrlichia

ruminantium str.

Gardel Core302409_3 14 0.1

Rickettsia

prowazekii str.

Madrid E Core272947_1 10.3 0.1

Gluconobacter

oxydans 621H Core290633_1 8.1 0.1 0.94

Blochmannia

floridanus Core203907_1 11.7 0.1 1 36

Thiomicrospira

denitrificans

ATCC 33889 Core326298_3 7.6 0.1

Tropheryma

whipplei str. Twist Core203267_1 21.5 0.1

Aquifex aeolicus

VF5 Core224324_1 8 0.1 0 2 1.8

Helicobacter pylori

26695 Core85962_1 23.3 0.2 0 4 2.4

Streptococcus

pneumoniae R6 Core171101_1 31.3 0.2 0 4

Streptococcus

pneumoniae

TIGR4 Core170187_1 31.3 0.2 0 4 0.5

Thermoanaerobact

er sp. X514 Core399726_4 34.5 0.2

Carboxydothermus

hydrogenoformans

Z-2901 Core246194_3 24 0.2 2

Onion yellows

phytoplasma OY-

M Core262768_1 20.3 0.2

111

111

Thermotoga

maritima MSB8 Core243274_1 21 0.2 0 2 1.2

Mycoplasma

pulmonis UAB

CTIP Core272635_1 22 0.2 0 1 1.5

Streptococcus

thermophilus

CNRZ1066 Core299768_3 29.1 0.2

Ureaplasma

parvum serovar 3

ATCC 700970 Core273119_1 16.1 0.2

Legionella

pneumophila

subsp.

pneumophila str.

Philadelphia 1 Core272624_3 24.9 0.2 0 3.3

Idiomarina

loihiensis L2TR Core283942_3 26.9 0.2

Xylella fastidiosa

9a5c Core160492_1 16.5 0.3 0 1 5.13

Campylobacter

jejuni subsp. jejuni

84-25 Core360110_3 26.6 0.3

Neisseria

gonorrhoeae FA

1090 Core242231_4 25 0.3 0.58

Campylobacter


NCTC 11168 Core192222_1 26.6 0.3 5 1.5

Chlamydia

trachomatis

D/UW-3/CX Core272561_1 23.2 0.3 1 24

Haemophilus

influenzae Rd

KW20 Core71421_1 41.2 0.3

Leifsonia xyli

subsp. xyli str.

CTCB07 Core281090_3 46 0.3 5

112

112

Symbiobacterium

thermophilum IAM

14863 Core292459_1 33.8 0.3 4.2

Desulfovibrio

desulfuricans G20 Core207559_3 24.6 0.3 0

Elusimicrobium

minutum Pei191 Core445932_3 30.6 0.3

Chlamydophila

pneumoniae AR39 Core115711_7 22.9 0.3 0 1

Campylobacter


CF93-6 Core360111_3 26.6 0.3

Bdellovibrio

bacteriovorus

HD100 Core264462_1 21.9 0.3 1.4

Zymomonas

mobilis subsp.

mobilis ZM4 Core264203_3 21.8 0.3 2

Pseudomonas

putida KT2440 Core160488_1 78.8 0.4 0.1 5 1.1

Bordetella

pertussis Tohama I Core257313_1 46.9 0.4 0.1 1 3.8

Lactococcus lactis

subsp. lactis Il1403 Core272623_1 63.8 0.4 0.1 5 0.7

Lactobacillus

plantarum WCFS1 Core220668_1 80.7 0.4 0.1 4 1.6

Kineococcus

radiotolerans

SRS30216 Core266940_1 48.7 0.4

Bacteroides fragilis

YCH46 Core295405_3 28 0.4 0.63

Clostridium tetani

E88 Core212717_1 56.4 0.4 0 5 0.5

Cytophaga

hutchinsonii ATCC

33406

Core269798_1

2 48 0.5 0

Salinibacter ruber

DSM 13855 Core309807_5 20.7 0.5 14

113

113

Clostridium

acetobutylicum

ATCC 824 Core272562_1 83.8 0.5 0.1 5 0.58

Mannheimia

succiniciproducens

MBEL55E Core221988_1 87.4 0.5 0.6

Anaeromyxobacter

dehalogenans 2CP-

C

Core290397_1

3 72.9 0.5 9.2

Francisella

tularensis subsp.

tularensis Schu 4 Core177416_3 57.3 0.5 3

Caulobacter

crescentus CB15 Core190650_1 36.3 0.5 0.1 3 1.5

Nitrosococcus

oceani ATCC

19707 Core323261_3 44.7 0.5

Listeria

monocytogenes

J0161 Core393130_3 55.5 0.5

Nitrosomonas

europaea ATCC

19718 Core228410_1 41.3 0.5 0 5 18.5

Francisella

tularensis subsp.

novicida U112 Core401614_5 59.8 0.5 3

Frankia sp. Ccl3

Core106370_1

1 71.8 0.5

Neisseria

meningitidis MC58 Core122586_1 30 0.5 0 4

Acinetobacter sp.

ADP1 Core62977_3 92 0.5

Listeria

monocytogenes

FSL J1-194 Core393117_3 55.5 0.5

Nocardia farcinica

IFM 10152 Core247156_1 53.2 0.6 3

Magnetospirillum Core342108_5 46.3 0.6 0

114

114

magneticum AMB-

1

Staphylococcus

aureus subsp.

aureus N315 Core158879_1 94.8 0.7 0 4 0.4

Methylococcus

capsulatus str. Bath Core243233_4 55.5 0.7 1.87

Acinetobacter

baumannii ATCC

17978 Core400667_4 73.8 0.7

Streptomyces

coelicolor A3(2) Core100226_1 94.3 0.7 0.1 5 2.2

Corynebacterium

glutamicum ATCC

13032 Core196627_4 64.5 0.7 5 1.2

Flavobacterium

johnsonia

johnsoniae UW101 Core376686_6 41.4 0.7

Thiobacillus

denitrificans

ATCC 25259 Core292415_3 61.5 0.7

Leptospira

interrogans serovar

Copenhageni str.

Fiocruz L1-130 Core267671_1 64.1 0.7

Listeria innocua

Clip11262 Core272626_1 94 0.7 0.1 5 0.6

Pseudomonas

fluorescens PfO-1 Core205922_3 104.9 0.7

Pseudoalteromonas

haloplanktis

TAC125 Core326442_4 48.8 0.7 0.5

Rhizobium

leguminosarum bv.

viciae 3841 Core216596_1 105.3 0.8

Staphylococcus

aureus subsp.

aureus COL Core93062_4 123.9 0.8

115

115

Rhodopseudomona

s palustris CGA009 Core258594_1 69.3 0.8 9

Methylobacillus

flagellatus KT Core265072_7 76 0.8 2

Agrobacterium

tumefaciens str.

C58 Core176299_3 91.7 0.8 0.1 5

Rubrobacter

xylanophilus DSM

9941 Core266117_6 70.7 0.8 3.85

Brucella melitensis

16M Core224914_1 72.8 0.8 0 4 2

Vibrio cholerae

O395 Core345073_6 127.3 0.8 0.2

Burkholderia

pseudomallei

K96243 Core272560_3 99.5 0.8 1

Ralstonia

solanacearum

GMI1000 Core267608_1 135.8 0.8 0.1 5 4

Vibrio vulnificus

YJ016 Seed196600_1 128.1 0.8 0

Staphylococcus

aureus subsp.

aureus NCTC 8325 Core93061_3 94.8 0.8

Yersinia pestis

Pestoides F Core386656_4 104.2 0.8

Clostridium

beijerincki

beijerinckii

NCIMB 8052

Core290402_3

4 120.3 0.8

Pseudomonas

putida GB-1 Core76869_3 104.7 0.8

Vibrio

parahaemolyticus

RIMD 2210633 Core223926_1 118.7 0.8 0 4 0.2

Yersinia pestis

CO92 Core214092_1 103.5 0.8 0.1 5 1.25

116

116

Mycobacterium

tuberculosis

H37Rv Core83332_1 69.4 0.8 0 4 19

Polaromonas sp.

JS666 Core296591_1 70.4 0.8

Staphylococcus

aureus subsp.

aureus Mu50 Core158878_1 94.8 0.8 0 4

Bradyrhizobium

japonicum USDA

110 Core224911_1 107.2 0.9 0.1 4 20

Shewanella

frigidimarina

NCIMB 400

Seed318167_1

0 92.5 0.9

Bacillus subtilis

subsp. subtilis str.

168 Opt224308_1 211.7 0.9 0.1 6 0.43

Escherichia coli

W3110 Core316407_3 247.1 0.9 4

Pseudomonas

aeruginosa PAO1 Core208964_1 158.4 0.9 0.1 5

Sinorhizobium

meliloti 1021 Core266834_1 202.8 0.9 0.1 5 1.5

Burkholderia

cepacia R1808 Core269482_1 150.6 0.9

Klebsiella

pneumoniae MGH

78578 Core272620_3 205.4 0.9

Bacillus anthracis

str. 'Ames

Ancestor' Core261594_1 167.7 0.9

Listeria

monocytogenes

EGD-e Core169963_1 87.9 0.9 0.1 5 1

Shigella flexneri

2a str. 2457T Core198215_1 183.2 0.9 4

Shigella

dysenteriae Core216598_1 145.6 0.9

117

117

M131649

Bacillus anthracis

str. Ames Core198094_1 171.2 0.9 0.1 0.5

Photobacterium

profundum SS9 Core298386_1 156.1 1 2.5

Photorhabdus

luminescens subsp.

laumondii TTO1 Core243265_1 116.5 1 0.1 0.5

Vibrio cholerae O1

biovar eltor str.

N16961 Core243277_1 134.5 1 0 4 0.2

Salmonella

typhimurium LT2 Core99287_1 229 1 0.1 4 0.4

Escherichia coli

K12 Core83333_1 250.2 1 0.1 4 0.35

Shewanella

oneidensis MR-1 Seed211586_1 91.3 1 0 5 0.66

Appendix 1::Supplementary Table S6. The list of EnvO niches used in the analysis and the number of assigned samples.

Envo ID Niche description Number of samples

ENVO:00000063 water body 541

ENVO:00001998 soil 489

ENVO:00002007 sediment 276

ENVO:00002006 water 258

ENVO:00002044 sludge 113

ENVO:01000009

biotic mesoscopic physical

object 98

ENVO:00000023 stream 93

ENVO:00002002 food 66

ENVO:00002264 waste 58

ENVO:00000076 mine 52

ENVO:00002031 anthropogenic habitat 52

ENVO:00000176 elevation 48

ENVO:00002009 terrestrial habitat 48

ENVO:00001995 rock 40

118

118

ENVO:01000001 mud 37

ENVO:00002985 oil 36

ENVO:00000043 wetland 34

ENVO:00000131 glacial feature 25

ENVO:02000019 bodily fluid 23

ENVO:00000104 undersea feature 21

ENVO:00000013 cave system 20

ENVO:00000479 mouth 18

ENVO:00002204 contamination feature 18

ENVO:00002170 compost 17

ENVO:00000094 volcanic feature 14

ENVO:00000303 coast 13

ENVO:00000073 building 12

ENVO:00000291 drainage basin 12

ENVO:00000026 well 12

ENVO:00002005 air 10

ENVO:00000077 agricultural feature 10

ENVO:00003030 silage 9

ENVO:00002008 dust 8

ENVO:00003869 straw 8

ENVO:00000463 harbor 7

ENVO:00000182 plateau 6

ENVO:00000309 depression 6

ENVO:00000395 channel 5

ENVO:00000130 reef 5

ENVO:00002982 clay 5

ENVO:00010505 aerosol 3

ENVO:00000091 beach 3

ENVO:01000010

abiotic mesoscopic physical

object 3

ENVO:00000097 desert 3

ENVO:00002272 waste treatment plant 3

ENVO:00000475 inlet 3

ENVO:00002226 borehole 3

ENVO:00002040 wood 2

ENVO:00000086 plain 2

ENVO:00000304 shore 2

119

119

ENVO:00000175 karst 2

ENVO:00002000 slope 2

ENVO:00000049 volcanic hydrographic feature 2

ENVO:00000062 populated place 1

ENVO:00002039 bone 1

ENVO:00000474 cut 1

ENVO:00000562 park 1

ENVO:00005738 foam 1

ENVO:00000098 island 1

A.1.3 Supplementary Methods

A.1.3.1 Computing the Maximal Biomass Production Rate

(MBR) of species

Constraint-Based Modeling (CBM) was used in order to simulate co-growth in two-

species systems, where species are represented by genome-scale metabolic models.

Briefly, in these models, a stoichiometric matrix (S) is used to encode the

information about the topology and mass balance in a metabolic network, including

the complete set of enzymatic and transport reactions in the system and its biomass

reaction. Reactions are inferred from genome annotations and specialized prediction

tools. Given a metabolic model, Constraint-Based Modeling (CBM) provides a

solution space in terms of predicted fluxes that is consistent with the constraints set

up by the model. Flux balance analysis (FBA)[7] is a CBM method that further

constrains the solution space by solving a linear problem of maximizing or

minimizing a biomass production rate objective function[141, 142]. The biomass

production rate describes the rate of production of a set of metabolites required for

cellular growth, where a higher biomass flux corresponds with a faster growth rate

of the organism[143].

Here, 160 metabolic models were retrieved from The Seed's metabolic models

section (http://seed-viewer.theseed.org/seedviewer.cgi?page=ModelViewer)[25].The

models are automatically constructed by a pipeline that starts with a complete

http://seed-viewer.theseed.org/seedviewer.cgi?page=ModelViewer

120

120

genome sequence as an input and integrates numerous technologies such as genome

annotation, reaction network annotation and assembly, determination of reaction

reversibility, and model optimization to fit experimental data. For each species we

calculate its maximal biomass production rate (MBR) by assuming that all exchange

reactions can be potentially fully active (which is equivalent to assuming a rich

media). The upper and lower bounds of exchange and non exchange reactions are

conventionally set as follows:

For irreversible reactions:

Exchange reactions:

0 ≤ Vi,ex ≤ Vi, Max_ex (Vi, Max_ex = 1000)

Non exchange reactions:

0 ≤ Vi ≤ Vi, Max ( Vi, Max = 1000)

For reversible reactions:

Exchange reactions:

Vi, Min_ex ≤ Vi,ex ≤ Vi, Max_ex (Vi, Min_ex = -50 Vi, Max_ex = 1000)

Non-exchange reactions:

Vi, Min ≤ Vi ≤ Vi, Max (Vi, Min = -1000 Vi, Max = 1000)

Simulations were run using the "ILOG CPLEX" solver using the "Condor"

platform[144]. Following the filtering out of different strains of the same species and

models that did not have a biomass reaction defined or that their biomass reaction

could not be activated, we were left with a final set of 118 models (see

Supplementary Table S1). In addition to the 118 bacterial models, a metabolic model

121

121

for H. walsbyi (archaea) was constructed by using the SEED tool. This model was

only used for growth simulations with Salinibacter ruber.

A.1.3.2 Generation of a multi-species system metabolic

model

Our approach for generating multi-species models follows the definition employed

by[47]. Briefly, we converted the model of each organism into a compartment in a

multi-species system. For two species A and B this system consists of:

[CA]=cytoplasm compartment of species A; [CB]=cytoplasm compartment of species

B; and [EAB]=Extra-cellular compartment of species A and B. CA and CB include all

non-exchange and transport reactions of the corresponding species . EAB includes the

union of the exchange reactions of A and B. The objective function of the multi-

species system was defined as the sum of the biomass reactions of the member

organisms. This method is used in our setup for simulating pairwise growth but is

applicable for any number of organisms. Applying the multi-species system analysis

to all possible pairwise combinations of our 118 metabolic models, we examined

6903 unique pairs whose growth can be simulated under a range of environments.

A.1.3.3 Computing a Competition inducing Medium

(COMPM) for single species and multi-species systems

For a single species model A, COMPM is defined as the ranges of fluxes of the

exchange reactions (VCOMPM,A) that supports its maximal biomass rate (MBR),

when all exchange metabolites are provided at the minimal required amount. The

latter is found by Flux Variability Analysis (FVA)[7], where Vi, Min_FVA then denotes

the lower limit (maximal flux of metabolites into a compartment) of a given reaction.

Following our definitions, an increase in (the negative) Vi, Min_FVA will effectively

limit the flux of metabolites into the compartment and prevent species A from

reaching its MBR. Notably, for the large majority of exchange reactions (70%-80%)

Vi, Min_FVA = Vi,max FVA. To relate COMPM environments to real ecological

122

122

conditions we verified that species inhabiting similar environments tend to have

similar metabolic profiles, as previously demonstrated in[64]. As documented in

many laboratory experiments, typical limiting factors in COMPM environments

include oxygen, glucose and nitrogen sources (Supplementary Note 4). Finally,

computational simulation providing predictions for the effect of removal of chosen

metabolites on species growth were experimentally tested, supporting the ability of

the models to identify growth limiting factors (Supplementary Note 7). The full

description of VCOMPM, A is provided at Supplementary Table S10.

A similar computation is done for species B and then, in the multi-species system of

A and B, we define:

VCOMPM, AB = VCOMPM, A U VCOMPM, B

Vi, Min_FVA,AB = Min(Vi, Min_FVA,A , Vi, Min_FVA,B))

so that the lower bound of each reaction is set according to the lower FVA limit,

considering the species involved. By definition, at individual growth, COMPMAB

allows A and B to reach their MBR. However, at co-growth, any resource overlap

will prevent species A and B to simultaneously reach their MBR, and reveal potential

sources of competition. The full description of VCOMP, AB is provided at

Supplementary Table S11.

A.1.3.4 Computing a Cooperation-inducing Medium

(COOPM)

A cooperation-inducing medium (COOPM) for a multi-species system is defined

here as a set of metabolites that allows the system to obtain a positive growth rate

(above a certain predetermined threshold, which may yet be far from optimal), and

such that the removal of any metabolite from the set would force the system to have

no such solution. A feasible solution in this context is defined as one achieving at

123

123

least 10% of the joint MBR obtained when grown on a rich medium (COMPM). The

use of other MBR thresholds is examined at Supplementary Note 5. COOPM is

calculated using mixed integer linear programming as in[145, 146], as described

below:

As a first step we start with VCOMPM, AB, a set of exchange reactions flux ranges as

defined above. We then solve a minimization problem which uses, in addition to the

usual FBA constraints: (i) a constraint on minimal growth rates, VBM-COOPM ≥ 0.1 x

VBM-COMPM where VBM-COOPM and VBM-COMPM are the Biomass Production rates on

COOPM and COMPM respectively (ii) a constraint expressing whether or not an

exchange metabolite i is consumed: Vi, COOPM, AB + Vi,min θi ≥ Vi,min, where Vi, COOPM,

AB is the flux running through the exchange reaction i, and Vi, COOPM, AB ≤ 0 when

the metabolite i is consumed (negative flux). Here, the binary variable θi attains a

value of 1 if metabolite i is not consumed (Vi, COOPM, AB ≥ 0) by any of the organisms,

and 0 otherwise.

Identifying a minimal set of metabolites in a medium then amounts to maximizing

the sum of the θi variables over all metabolites in Vi, COMPM, AB . Overall, the

optimization problem can be expressed as follows:

, ,

, m in , m a x

, ,

, , , m in , m in

, ,

m a x

:

0

/ 1 0

{0 ,1}

n

i

i C O M P M A B

j j j

B M C O O P M B M C O M P M

i C O O P M A B i i i

i C O O P M A B

j

i V

S u b je c t to

S V

V V V

V V

V V V

i V

v V

124

124

The bounds on the active exchange reactions are set to their COMPM value. The full

description of VMM, AB is provided at Supplementary Table S12.

A.1.3.5 Experimental and computational co-growth

analysis

Co-growth experiments were conducted between all co-growth combinations formed

between five species, all non-pathogenic and capable of growing in IMM. The

species and their seed models are the following: Listeria innocua Clip11262



subtilis str. 168 (Opt224308_1).

Growth experiments for each individual and pairwise combinations were conducted

in IMM defined medium[147], in 96-well plates at 30°C with continuous shaking,

using the Biotek ELX808IU-PC microplate reader. Optical density was measured

every 15 minutes at a wavelength of 595nm.

A simulated medium was designed to match the defined medium with minimal

modifications allowing co-growth (Table SM-1). For L. innocua (strain Clip1126)

and A. tumefaciens (strain C58), we both predict and observe a neutral interaction in

the given media. That is, the Sum of Individual Growth (SIG) approximately equals

the total Co-Growth in a multi-species system (CG). To simulate a negative shift

(SIG/CG > SIG/CGneutral_medium), co-growth simulations were conducted when adding

all one- and two-compound combinations of exchange metabolites to the simulated

IMM (considering all exchange metabolites of the given species). To simulate a

positive shift (SIG/CG < SIG/CGneutral_medium), co-growth simulations were

conducted, subtracting all one- or two-pertaining compound combinations from the

simulated IMM. Co-growth simulations were conducted across all

subtraction/addition combinations; subsequently we chose the media inducing the

most prominent shifts. A table describing co-growth patterns across all reductive

combinations is provided at Supplementary Table S13. Based on the selected

125

125

predictions, the experimental media were modified by adding thymidine and xylose

(for a negative shift) and by the subtraction of thiamine and glucose (positive shift).

The growth experiments for the additional 9 bacterial pairs are shown in Supplementary

Note 1. Growth experiments in additional selected shifted media are described at

Supplementary Note 7.

Appendix 1::Table SM-1. IMM defined medium and its in silico representation.

Modifications of IMM were done using the same algorithm used for selecting a minimal media (Supplementary Methods), aiming to find the minimal set of metabolites which are necessary to support co-growth. The same in silico media were used for

all pairwise combinations.

Metabolite In vitro medium In silico medium

Thiamin + +

D-Methionine + +

Magnesium + +

L-Valine + +

L-Isoleucine + +

L-Leucine + +

L-Histidine + +

Calcium + +

D-Glucose-6-phosphate + +

Potassium + +

Citrate + +

L-Arginine + +

L-Tryptophan + +

L-Phenylalanine + +

Biotin + +

Riboflavin + +

Adenine + +

Pyridoxal + +

Nicotinamide_D-ribonucleotide + +

L-Glutamine + +

L-Cysteine + +

Lipoic acid + -

para-aminobenzoic acid + -

126

126

Oxygen + +

Cytosine - +

Zinc - +

Cobalt - +

Fe2+ - +

Chloride - +

Sulfate - +

Copper2 - +

Manganese - +

Spermidine - +

gly-asn-L - +

sn-Glycerol-3-phosphate - +

octadecanoate - +

A.1.3.6 Finding close cooperative loops in real and

random networks of give-take interactions and in real and

randomly drawn communities

Starting from the network of 'give-take' interactions we derived two sub-networks:

the ecologically-associated sub-network including the edges between ecologically-

associated species, and the non ecologically-associated sub-network including the

edges between the non associated species. The original network is provided at

Supplementary Table S2, indicating the type of ecological association corresponding

to each edge. The original network and the sub-networks of ecologically associated

and non-associated species are composed of 80, 66, and 80 nodes and 3160, 648 and

2512 edges, respectively. For each of the two sub-networks, the number of loops

was compared to the number found in 1000 random networks. Random networks

were generated by shuffling edges, retaining node number and edge degree. Notably,

in the network describing interactions between niche-associated pairs the number of

loops is significantly higher than random (t test < 0.001), unlike in the network

describing non ecological-associations. Real communities were derived from the

127

127

ecological distribution data where a community represents the set of species detected

in a given sample (as listed in Supplementary Table S7). The rate of occurrence of

close cooperative cycles was recorded (1) across all true samples (194 appearances)

(2) and across 1000 data sets generated through random shuffling of the original

samples data while maintaining the same sample size distribution and the same rank

of species' appearances as in the original data.

A.1.4 Supplementary Notes

A.1.4.1 Supplementary Note 1: Experimental and

computational co-growth analyses for 10 bacterial pairs in

interaction-specific media.

Individual and co-growth experiments were conducted for five bacterial species, all

non-pathogenic and capable of growing in IMM and their 10 corresponding pairwise

combinations. Growth experiments for each individual and pairwise combination

were conducted in three media: IMM – a chemically defined minimal medium41

(termed primary medium), a "negative" medium designed to induce a negative shift

towards increased competition (by adding thymidine and xylose) and a "positive"

medium designed to induce a positive shift towards less competition (by subtracting

thiamine and glucose). The "negative" and "positive" media were designed as

described in the Supplementary Methods section, representing the most generic

media for the induction of a shift in the pattern of co-growth across most growth

combinations (Supplementary Table S13). The observed shift in the co-growth

pattern (in comparison to co-growth in the primary medium) was successfully

predicted in 65% of the experiments, with precision 0.75 and recall 0.8 (Table SN1-

1) The species and their seed models are the following: Listeria innocua Clip11262



subtilis str. 168 (Opt224308_1).

128

128

Appendix 1::Table SN1-1. Predicted and observed co-growth shifts.

For predicted and observed co-growth combinations we compared the ratio between the Sum of the Individual Growths (SIG) and the co-growth (CG) across the three media (primary, negative, and positive). The SIG/CG ratio in the negative and positive

media is compared to the ratio in the primary media where negative and positive shifts refer to an increase or decrease in this ratio, respectively. Colored columns represent a predicted directional shift in the corresponding interaction-designed media. Red

indicates a predicted negative shift in the negative media and green indicates a predicted positive shift in the positive media.

Observations: Table entries marked with '√' and 'X' represent corresponding or non-corresponding shifts as observed in laboratory co-growth experiments. Colored '√' symbols represent TP predictions; non-colored '√' symbols represent TN

predictions; Colored 'X' symbols represent FP predictions; non-colored 'X' represent FP predictions; For observations, SIG/CG

ratio was calculated according to OD values recorded in logarithmic growth and the corresponding growth rates, as described in Table SN1-3. Predicted and observed SIG/CG values are shown in Table SN1-2. Growth curves are presented at Figure SN1-1.

Positive shift Negative shift Species-pair

√ √ Agrobacterium tumefaciens-

Listeria innocua


Escherichia coli


Pseudomonas aeruginosa

X √ Agrobacterium tumefaciens-

Bacillus subtilis

√ X Listeria innocua-Escherichia

coli

X X Listeria innocua-Pseudomonas

aeruginosa

X X Listeria innocua-Bacillus

subtilis

√ √ Escherichia coli-Pseudomonas

aeruginosa

X √ Escherichia coli-Bacillus

subtilis

√ √ Pseudomonas aeruginosa-

Bacillus subtilis

6/10 7/10 True predictions

129

129

Appendix 1::Table SN1-2 Calculated values for predicted and observed co-growth shifts.

Values show the SIG/CG ratio (SIG: Sum of the Individual Growth; CG: Co Growth). PRM, NM, PM: Primary, Negative and Positive Medium, respectively. L: Listeria innocua; A: Agrobacterium tumefaciens ; E: Escherichia coli; P: Pseudomonas

aeruginosa; B: Bacillus subtilis.

Computational

predictions Experimental observations

Growth rate ratio‡ Growth ratio†

PRM NM PM PRM NM PM PRM NM PM

A-L 1.06 1.12 0 0.73 2.08 0.5 0.91 2.09 0.64

A-E 1.25 1.38 1.04 1.71 2.39 0.89 1.54 2.69 1.49

A-P 1.1 1.12 0.98 1.26 1.5 1.15 1.65 2.47 1.41

A-B 1.34 1.43 0.77 1.75 2.42 4.15 2.66 4.27 2.85

L-E 1.21 1.21 0.9 1.56 1.41 0.53 1.5 1.64 0.9

L-P 0.91 0.91 0.7 1.31 1.37 0.62 1.0 1.05 1.0

L-B 1.22 1.20 0.78 1.1 1.18 1.11 1.03 1.34 1.36

E-P 1.43 1.36 1.4 1.1 1.03 0.65 1.38 0.87 1.27

E-B 1.49 1.56 1.28 1.2 1.59 1.63 1.54 1.78 2.05

P-B 1.17 1.18 1.06 2.43 2.47 1.33 2.18 2.36 1.61

‡ Growth rate ratio was calculated by comparing ΔOD/Δtime ratio in SIG and CG

during exponential growth. Exponential growth was determined for each experiment

independently as shown in Figure SN1-1. For SIG, ΔOD was calculated as sum ΔOD

of both species.

† Growth ratio was calculated by comparing the OD in SIG and CG at a constant

time point (half time of the experiment). For SIG, OD was calculated as sum OD of

both species at the selected time point. OD values at time0 (the beginning of the

experiments) were subtracted.

130

130

Appendix 1::Figure NS1-1. Growth curves of individual and pair-wise combinations across different media.

Growth combinations are ordered as in Tables SN1-2 and SN1-3. The title of each graph indicates the species combination and

the medium (that is LA,PRM indicates Listeria innocua- Agrobacterium tumefaciens in primary medium). Abbreviations are as

in Table SN1-3. Gold and orange lines represent the individual growth of the first and second pair members, respectively (that

is, in LA combination L. innocua is shown in gold and A. tumefaciens is shown in orange). Blue line represents co-growth. Red

line represents the sum of individual growth at the manually selected exponential growth phase (OD values of first species at

time0 were subtracted). Only successful predictions are shown (√ at Table SN1-2).

131

131

132

132

A.1.4.2 Supplementary Note 2: using systematic data

sources for estimating the ecological relevance of win-lose

predictions

For each pair of species we looked at the outcome of the competition and defined

species as winners or losers according to their growth rate (the faster species is the

winner as in Figure 2a in the main text). For each species we calculated its fraction of

winning events across all its co-growth experiments. The list of species' competition

values is provided at Supplementary Table S1. Top "winners" include ecologically

diverse fast growers such as Escherichia coli, Salmonella typhimurium, Vibrio

cholerae and Pseudomonas aeruginosa. Species with a low mean competition score

include slow grower pathogens such as Mycoplasma genitalium and Borrelia

burgdorferi and obligatory symbionts as Buchnera aphidicola.

Maximal growth rate information is available for 66 species in the data, retrieved

according to manual survey of the scientific literature[78]. The matrix at Figure 2b

contains all these species for which doubling time information is available, sorted

according to their doubling time. To study the statistical significance of the win-lose

division in the experimentally-driven matrix we compared the strength of green-red

division (win-lose) in the original matrix to the red-green division in 1000 random

matrices. To produce such random matrices, the order of species was randomly

permuted 1000 times and for each corresponding matrix we counted how many

winners were mapped to the upper triangle ("green" side). In the original matrix we

observe 1256 true classifications (the experimentally faster is the winner), which is

higher than the number of true classifications observed in 998 random matrices (T-

test, P value 0.002).

To systematically study the biological relevance of the competition score we looked

at the correlation between competition values and environmental diversity,

133

133

considering two independent measures – fractions of regulatory genes were taken

from[139]

, describing the fraction of transcription factors out of the total number of

genes in the genome - an indicator of environmental variability[140]

. General

environmental complexity estimates were also obtained from[148]

where the natural

environments of bacterial species were categorized based on the NCBI classification

for bacterial lifestyle[148]

and ranked according to the complexity of each category (1-

obligatory symbionts; 2- specialized; 3- aquatic; 4- facultative host-associated; 5-

multiple; 6- terrestrial species). We observe a significant correlation between

environmental diversity and winning potential, considering both absolute and partial

mean competition score. Correlation values in a spearman correlation test between

competition values against the two measures of environmental diversity:

Mean competition score versus the fraction of regulatory genes: 0.6 (P value 9e-6)

Mean competition score versus NCBI estimate for environmental diversity: 0.46 (P

value 2e-3)

Mean competition score versus experimentally recorded minimal doubling time: -

0.34 (P value 5e-3)

Experimental growth rates and lifestyle annotations are provided at Supplementary

Table S1.

A.1.4.3 Supplementary Note 3: Simulating co-growth of

Salinibacter ruber and Haloquadratum walsbyi.

We studied the effect of the media on the type of interaction between Salinibacter

ruber and Haloquadratum walsbyi – two halophylic species that co-exist in salterns.

We chose to focus on these species as a synergistic interaction between them was

http://nar.oxfordjournals.org/cgi/content/full/gkq118/DC1

134

134

documented, where it was suggested that the improved growth of H. walsbyi can be

explained by the uptake of dihydroxyacetone (DHA) produced by S. ruber[58]

. We

computationally studied the interaction between the species in a poor medium, with

and without DHA. Starting from competition-inducing medium (COMPM),

reduction was done using the algorithm used for computing cooperation-inducing

media (Methods, main text and Supplementary Methods) where we looked for the set

of metabolites that allows a feasible solution for each of the species individually

(achieving at least 10% of the corresponding growth in COMPM). In our

simulations, co-growth of both species in a rich medium (COMPM, Methods and

Supplementary Methods) revealed no cooperative interaction (PCMS > 0, Table

SNS-1). Looking at the content of the metabolites in the media we observed that

DHA, externally provided to the system, is consumed by the multi-species system

and hence the contribution of its transfer between the species to their growth

potential is concealed. The dependence between H. walsbyi and S. ruber for DHA

supply is revealed when reducing the medium. As suggested in the experimental

studies, we observed that the growth of H. walsbyi becomes possible only when

adding DHA into the medium or by adding S. ruber to the community (Table SNS-

1).

Appendix 1::Table SN3-1. Interactions between Salinibacter ruber and Haloquadratum walsbyi across different

media.

Presence of

DHA in the

media

Co-

growth

Individual

growth of

H.

walsbyi

Individual

growth of

S. ruber

PCMS Cooperative

interaction

COMPM + 69.1 60.2 20.1 0.57 -

Reduced

media

(+ DHA)

+ 27.3 11.7 8.3 -0.88 +

Reduced

media

-DHA

- 27.3 0 8.3 NA +

135

135

A.1.4.4 Supplementary Note 4: Relating the designed

media to true ecological conditions

Throughout most of this analysis, simulations are conducted in computationally-

derived, designed media. In order to examine the ecological relevance of the

designed media we first tested whether ecologically related species exhibit similarity

in their media (VCOMPM,A, Supplementary Methods), as can be expected from the

demonstrated similarity in the metabolic pathways of co-occurring species[64].

Indeed, we observe that the resource overlap (Methods) between ecologically

associated species is significantly higher than the resource overlap between none

ecologically related species (P value 1.5 e-8 in a one-sided Wilcoxon test; median

values for resource overlap are 0.41 and 0.46, respectively). We then characterized

the rate of occurrence of different metabolites across species-specific

computationally-designed environments, as well as identified typical growth-limiting

factors. Notably, the 10 most frequent compounds (listed at Table SN4-1), include

essential inorganic compounds as metals and salts. In contrast, species show high

diversity in their carbon sources (Table SN4-1).

In order to identify species-specific limiting factors we looked at the typical flux of

each metabolite (that is, the mean Vi, Min_FVA across all species, Supplementary

Methods), where a low typical flux indicates that a compound is a limiting factor.

Typical limiting factors include oxygen, glucose and nitrogen sources, in

correspondence with experimental knowledge. Alternatively, the widely-distributed

inorganic compounds in Table SN4-1, are all consumed at a typically low levels (all

show mean Vi, Min_FVA > -1 in comparison to mean Vi, Min_FVA < -20 for the highly

consumed metabolites in the left column, Table SN4-2).

136

136

Finally, we studied the distribution of metabolites in the pair-specific, rich (VCOMPM,

AB) and poor (VCOOPM, AB) environments (Methods and Supplementary Methods).

Metabolites which are frequent at both pair-wise media are inorganic essential

compounds (Table SN4-2). Notably, metabolites which are typical of the rich media

but are missing from the poor, cooperation inducing media are typically derivates of

amino acids, representing a set of metabolic products that can be produced by one of

the species and then transferred to its pair members[61]

.

Appendix 1::Table SN4-1. Characterization of species-specific metabolic computationally-designed environments

The full list of species-specific environments is provided at Supplementary Table S10.

* Highest absolute value

Appendix 1::Table SN4-2. Characterization of pair-specific metabolic environments

10 most frequent

metabolites across the

118 species-specific

environments

10 most rare metabolites

across the 118 species-

specific environments

10 metabolites with the

highest* mean flux at

optimal conditions (typical

limiting factors)

Copper2

beta-

Methylglucoside_C7H14O6 Oxygen

Sulfate

D-

Glucosamine_C6H14NO5 H+

Fe3 Decanoic_acid_C10H19O2 D-Glucose

Magnesium

D-O-

Phosphoserine_C3H7NO6P L-Glutamate

Zinc

(R,R)-

Tartaric_acid_C4H4O6 sn-Glycerol_3-phosphate

Manganese Propanoate_C3H5O2 NH3

Cobalt Isocitrate_C6H5O7 Fumarate

Potassium

(R,R)-Butane-2,3-

diol_C4H10O2 D-Fructose

Calcium Nicotinamide_C6H6N2O Nitrate

Fe2

beta-

Methylglucoside_C7H14O6 L-Serine

137

137

The full lists of pair-specific rich and poor environments are provided at Supplementary Tables S10 and S11, respectively.

* Found across all environments

** calculated by subtracting the frequency of metabolite at poor media from its

frequency in rich media

A.1.4.5 Supplementary Note 5: The use of various

thresholds for determining a feasible growth solution in

minimal, cooperation-inducing, media

10 most frequent metabolites

across both minimal and rich

pairwise environments*

10 frequent metabolites in

rich media that are absent in

poor, cooperation inducing,

media*

Magnesium_Mg ala-L-glu-L_C8H13N2O5

Sulfate gly-glu-L_C7H11N2O5

Chloride_Cl Ala-Gln_C8H15N3O4

Potassium_K H+_H

Calcium_Ca ala-L-asp-L_C7H11N2O5

Fe2+_Fe Ala-Leu_C9H18N2O3

Manganese_Mn Sodium_Na

Cobalt_Co Gly-Leu_C8H16N2O3

Copper2_Cu gly-pro-L_C7H12N2O3

Zinc_Zn

L-

alanylglycine_C5H10N2O3

138

138

A Cooperation-inducing Media (COOPM) is defined here as a set of metabolites that

allows a feasible solution with positive growth rate, such that the removal of any

metabolite from the set would make such solution infeasible. We examined several

growth requirements threshold ranging between 10% of the BPR found in

competition-inducing (COMPM), rich media (the minimal media reported in the

main text) to 100% (as in the original COMPM environment, main text). All reduced

media reveal cooperative solutions (Table SN5-1). At all solutions, ecologically

associated pairs of species, and in particular mutually exclusive pairs exhibit higher

level of cooperation in comparison to non-associated pairs. The same trends were

observed when a minimal, cooperation-inducing, medium was calculated as the

intersection of the exchange reactions in COMPM.

In a multi-species system, we defined a symmetrical interaction as such where both

A and B are "givers" (and "takers"), that is both species improve their growth in

comparison to individual growth. Notably, this definition is permissive as A and B

can show variability in the extent to which their growth is improved. Despite the

permissiveness of the definition the majority of species show a-symmetrical

cooperative directionality with a single giver (Table SN5-1). We explored the

symmetrical interaction on different growth media. Growth media were determined

by setting thresholds on the feasible solution which required achieving at least X% of

the corresponding growth in COMPM. The thresholds were set on both the biomass

production of the multi-species system (as in Table SN5-1) and on the contained

organisms (i.e. requiring that each compartment/organism will have a biomass

production rate higher than a threshold, in comparison to its growth in COMPM).

This indeed raised the number of symmetrical events (Table SN5-2), though it

reduced the ecological signal, where similar fractions of cooperative events are

observed for the group of niche-associated and non-associated pairs, testifying

against the ecological relevance of enhancing the propensity of symmetrical

solutions via such means.

139

139

Appendix 1::Table SN5-1. Frequency of symmetrical interaction events under minimal growth media with different

thresholds for biomass production of the system.

%BPR

(out of

BPR in

COMPM)

Total number

of

cooperative

events

(fraction of

unidirectional

events¤)

§N=3160

Fraction of cooperative events within different

ecological groups

Non-

associated

N=2512

Niche-

associated‡

N=536

Co-

occurrin

g N=84

Mutuall

y-

exclusiv

e N=28

Rich

Media

(COMP

M)

100% 0 0 0 0 0

Reduce

d media

75% 1814(0.65) 0.52 0.77 0.85 0.93

50% 1466(0.88) 0.4 0.69 0.71 0.86

25% 1279(0.93) 0.36 0.58 0.6 0.79

10% 1293(0.94) 0.37 0.57 0.51 0.71

Intersectio

n† 656(0.96) 0.16 0.38 0.35 0.36

¤ Unidirectional events refer to cooperative interactions where only one of the pair

members is a giver and the other is a taker.

§N represents all possible combinations in a specific group

‡ Niche associated pairs do not include co-occurring and mutually exclusive pairs

†Intersection medium for a pair of species is calculated as the intersection of uptake

reactions from their individual COMPMs

140

140

Appendix 1::Table SN5-2. Frequency of symmetrical interaction events under minimal growth media with different

thresholds for biomass production of the system and the compartments in the system.

%BPR

(out of

BPR in

COMPM)

Total

number of

cooperative

events

(fraction of

unidirectio

nal events)

N=3160

Fraction of cooperative events within different

ecological groups

Non-

associated

N=2512

Niche-

associated

N=536

Co-

occurrin

g N=84

Mutuall

y-

exclusiv

e N=28

10%‡

(10%†) 1630(0.37) 0.52 0.51 0.44 0.71

25%‡

(25%†) 1768(0.40) 0.54 0.61 0.61 0.79

50%‡

(50%†) 2325(0.51) 0.72 0.8 0.82 0.89

75%‡

(75%†) 1156(0.52) 0.4 0.25 0.26 0.18

‡ the threshold for the feasible solution of the multi-species system

† the threshold for the feasible solution of the multi-species system in each

compartment in the multi-species system

141

141

A.1.4.6 Supplementary Note 6: Frequency of directional

give-take relationships across bacterial families (top 10

combinations).

Appendix 1::Table SN6-1. frequency of inter-family give-take interactions

The table describes the frequency of inter-family give-take interactions, considering the total number of pairwise inter-class

combinations. In order to have groups of similar size, some groups (e.g., Actinobacteria) describe the phylum level

classification.

Giving family Receiving family

Total

number of

inter-family

interactions

Total

number of

directional

give-take

inter-family

interactions

Fraction of

directional

give-take

inter-family

interactions

Lactobacillales

Alpha/others

proteobacteria 24 13 0.54

Clostridia Bacillales 36 20 0.56

Clostridia Betaproteobacteria 40 24 0.6

Clostridia

Hyperthermophilic

bacteria 8 5 0.63

Spirochete

Alpha/others


Clostridia Deltaproteobacteria 12 8 0.67

Clostridia Bacteroides 12 9 0.75

Clostridia Actinobacteria 32 26 0.81

Clostridia

Alpha/others


Clostridia Epsilonproteobacteria 12 11 0.92

142

142

A.1.4.7 Supplementary Note 7: Experimental and

computational growth analyses of Listeria innocua and

Agrobacterium tumefaciens across pre-designed media

In order to explore the predictive power of our co-growth simulations in changing

environments, we first identified the limiting factors of Listeria innocua and

Agrobacterium tumefaciens at their simulated IMM (i.e. where the metabolites are

consumed at the maximal threshold determined, Vi, Min_FVA =-50, Supplementary

Methods). As can be expected from the neutral interactions between the two

organisms (predicted and observed, Supplementary Methods), most of their limiting

factors at the simulated-IMM do not overlap (Table SN7-1). The predicted limiting

factors of L. innocua are glutamine, glucose and cysteine. The predicted limiting

factors of A. tumefaciens are isoleucine, glutamine and histidine. Simulations were

then conducted while decreasing the level of these metabolites (that is increasing Vi,

Min_FVA) at different thresholds until their full removal (Vi, Min_FVA=0). For the four

amino acids studied, decreasing the corresponding fluxes slowed down the growth

rate of the relevant species (cysteine and glutamine for L. innocua and histidne

glutamine and isoleucine for A. tumefaciens), but had only a minor effect pattern of

co growth (Table SN7-1). The predictions for the growth and co-growth patterns

following the removal of cysteine and histidine, predicted to have the most

significant effect on growth (Table SN7-1), were further tested experimentally.

Laboratory observations indicate that the effect of metabolites removal on both

growth and co-growth patterns can be fully predicted: the growth of A. tumefaciens is

affected by the removal of histidine but not cysteine and the growth of L. innocua is

affected by the removal of cysteine but not histidine (Table SN7-2). In both cases,

co-growth pattern remains relatively similar to the pattern observed in the original

media (Table SN7-2).

143

143

The computational predictions indicate that decreasing the level of glucose is likely

to affect both individual and co-growth patterns (Table SN7-1). At individual

growth, decreasing the level of glucose is likely to affect only L. innocua, slowing

down its growth, but at co-growth it is predicted to increase the inter-species

competition (possibly due to the resource overlap induced by the shortage in

glucose). The full removal of glucose is predicted to prevent the growth of L.

innocua (again, with no effect on A. tumefaciens), where co-growth is predicted to

induce a modest level of cooperation (Table SN7-2). Reassuringly, experimental

tests support the computational predictions where the removal of glucose from the

media prevents the growth of L. innocua but has no effect on A. tumefaciens. As

predicted, decreasing the level of glucose increases the competition at co-growth.

However, at full removal we do not observe the predicted mild cooperation.

Overall, in most growth experiments (7/8) predictions and observations correlate

(Table SN7-2). When looking at co-growth predictions, the most significant growth

ratio change occurs at the partial and full removal of glucose. In agreement with the

predictions, the partial elimination of glucose induced the most drastic elevation in

growth rate ratio, where, as predicted, a weaker effect is observed for the removal of

amino acids. With the exception of a single experiment experimental measure (co-

growth pattern following full removal of glucose), we observe an overall agreement

between predictions and observations. Hence, overall, this set of experiments

supports the ability of the metabolic models to predict the growth pattern of species

at varying environments.

144

144

Appendix 1::Table SN7-1. Computational predictions for the effect of reducing and removing computationally-predicted

limiting factors from IMM media.

L, A – predictions for the individual growth of Listeria innocua and Agrobacterium tumefaciens, respectively across the

media tested; LA – co-growth prediction. Values in red indicate a change >±0.05 in growth and growth ratio in

comparison to values predicted for the original IMM.

Reduced metabolite

(Vi, Min_FVA =-10*)

Full removal of the metabolite

(Vi, Min_FVA = 0)

L A A Growth

ratio

(L+A)/L

A

L A A Growth

ratio

(L+A)/L

A

Growth at the

original

IMM**

36

52

83

1.06

36 2 3 1.06

Isoleucine 6 8

9

1.06 36 7 8 1.05

Histidine† 6 9 7 1.1 36 0

6

1

Glutamine 2

8 6 1.05 30 6 3 1.04

Cysteine† 1 2 7 1.08 29

2

5

1.08

Glucose 1 2

5

1.12

0

2 1

0.85

*Similar behavior is observed for additional Vi, Min_FVA (-50 < Vi, Min_FVA < 0).

** For all metabolites in the table Vi, Min_FVA =-50.

† Full reduction of histidine and cysteine has the most drastic effect on the growth

predictions of Agrobacterium tumefaciens and Listeria innocua, respectively.

145

145

Appendix 1::Table SN7-2. Predicted and observed growth and co-growth shifts.

Each cell's color represents the computationally-predicted growth shift in the designed media: red indicates reduced growth (growth predictions section) and reduced growth ratio (growth ratio section); grey represents no growth reduction (growth

predictions section) and no growth ratio change (growth ratio section; black color in the corresponding cells at Table SN7-1); dark green represents elevated ratio (red color in the corresponding cells at Table SN7-1). Observations: Table entries marked

with '√' and 'X' represent corresponding or non-corresponding shifts as observed in laboratory co-growth experiments. Growth

shift is defined as a change of >±0.25 in growth and growth ratio in comparison to values detected at the original IMM. The corresponding experimental results are provided at Table 3. L - Listeria innocua; A - Agrobacterium tumefaciens.

Growth predictions Growth ratio

(L+A)/LA L A

Histidine

(Vi, Min_FVA = 0)

√ √ √

Cysteine

(Vi, Min_FVA = 0)

√ √ √

Glucose

(Vi, Min_FVA =-1

0)

X √ √

Glucose

(Vi, Min_FVA = 0)

√ √ X

Appendix 1::Table SN7-3. Observed growth and co-growth shifts. Values indicate the maximal OD in the experiments.

Experiments were conducted as described in the Supplementary Methods section and in Supplementary Note 1.

L A LA (L+A)/LA

Growth at the

original

IMM**

0.2 .19 0.5 0.8

Histidine 0.43 .13 0.59 0.95

146

146

** For all metabolites at the table Vi, Min_FVA =-50.

(Vi, Min_FVA = 0)

Cysteine

(Vi, Min_FVA = 0)

0.02 .22 0.28 0.86

Glucose

(Vi, Min_FVA =-1

0)

0.21 .21 0.3 1.4

Glucose

(Vi, Min_FVA = 0)

0.01 .19 0.15 1.25

147

147


A.2.1 List of Abbreviations

GEM: genome-scale metabolic model

FBA: flux balance analysis

SUMEX: maximization of the sum of metabolic exchange fluxes (with the

convention that outward fluxes are positive)

PMAX: maximization of proton exchange (from inside to outside the cell)

GR: growth rate

ds66: a dataset of growth rates for 66 organisms in rich media

ds57: a dataset of 57 growth rates of E. coli with varying knockouts and media

ds18: a dataset of growth rates for 6 organisms grown in 3 media

ds24: a dataset of 24 of E. coli grown in 24 media

A.2.2 Supplementary results

A.2.2.1 Sensitivity analysis of SUMEX and Biomass

SUMEX does not assume known uptake rates. This is an important strength of the

metric, because only in cases where uptake rates of key compounds are known, can

traditional methods (most notably, FBA using a biomass objective) predict growth

rate due to the rate-yield relationship (Growth rate = Substrate uptake rate * Yield)

(this is at least true in substrate limited conditions). However, in optimizing any

objective function in a GEM (including SUMEX), it is necessary to set bounds on the

uptake reactions (or at least on some reactions) in order to gain computationally

feasible solutions. We chose to set standard bounds on uptakes of all compounds in a

given medium at a value of -50 units (see Supplementary Methods for full

148

148

characterization of the bounds). We set the same standard bounds for all metrics we

tried, unless otherwise noted. In order to test how dependent SUMEX is on these

bounds, we did a sensitivity analysis across the 3 datasets, testing both SUMEX and

biomass. Briefly, we altered each uptake bound across all models in a given dataset

by a random amount between either ±10% or ±50% (uniformly distributed) of its

standard value, and then re-assessed the correlation of the metric against growth rates

for that dataset (see Fig. S1).

We found the correlation of SUMEX with growth rate to be highly robust to changes

in the uptake bounds, and indeed to be significantly more robust than biomass on two

of the three datasets given the same random distributions of uptakes (P=2e-31 and

P=2e-4 in F-tests on ds18 and ds57, respectively, at 50% variation; there was no

distinguishable difference in ds66 – see Table S1). In the rich media conditions of

ds66, the correlation of SUMEX vs. GR varied less than 10% even with 100%

variance in uptake bounds. For completeness, we repeated the same test on the

secretion bounds and achieved similar results (see Fig. S1 & Table S1).

149

149

ds1

8d

s66

ds5

7

±10%(-45 to -55)

±50% (-25 to -75)

10% (900 to 1000)

50% (500 to 1000)

spearman’s rho, SUMEX vs. GR

spea

rman

’s r

ho

, Bio

mas

s vs

. GR

variance in bounds:

Uptake bounds (open) Secretion bounds (all)

0 0.5 10

0.5

1

0 0.5 10

0.5

1

0 0.5 10

0.5

1

0 0.5 10

0.5

1

-1 0 1-1

-0.5

0

0.5

1

-1 0 1-1

-0.5

0

0.5

1

0 0.5 10

0.5

1

0 0.5 10

0.5

1

0 0.5 10

0.5

1

0 0.5 10

0.5

1

-1 0 1-1

-0.5

0

0.5

1

-1 0 1-1

-0.5

0

0.5

1

Appendix 2::Figure S1. Sensitivity analysis of GEM bounds. The Spearman’s rhos (2-tailed) of growth rate versus

both SUMEX (x-axis) and max Biomass (y-axis) are shown for 3 bacterial datasets (ds18, ds66, and ds57), when uptake

bounds of all open metabolites (i.e., metabolites that are allowed to be taken up in a given medium) are randomly varied

by ±10% (1st column) or ±50% (2nd column) of the standard bound (which is -50 for all allowed uptakes), and when

secretion bounds of all exchanged metabolites are randomly varied by 10% (3rd column) or 50% (4th column) of the

standard secretion bound (+1000). Sumex displays significant robustness to changes in bounds. The green line in each

plot has a slope of 1.

150

150

ds18 ds66 ds57 ds18 ds66 ds57

10% 0.7% 0.3% 16.8% 0.0% 0.2% 5.1%

50% 1.5% 1.3% 35.5% 1.1% 0.7% 7.3%

10% 7.4% 0.4% 53.4% 3.0% 0.0% 62.9%

50% 28.7% 1.8% 628.9% 2.8% 0.4% 55.8%

p-val in F-test that variance of

Spearman's rho vs. growth rate is lower

for SUMEX than Biomass50% 1.9E-31 0.1 1.2E-04 0.8 1.0 1.6E-07

RSD, Spearman's rho of Biomass vs. GR

RSD, Spearman's rho of SUMEX vs. GR

Uptake bounds (open) Secretion bounds (all)amt

bounds

varied*:

Appendix 2::Table S1. Statistics of GEM sensitivity analysis. Summary statistics are presented from Fig. S1. The top

four rows show the Relative Standard Deviation, RSD = abs((std(rho)/mean(rho)))*100, of SUMEX or Biomass versus

GR across random variations in model uptake bounds or variations in secretion bounds (as labeled). Cases in which

RSD is less than 10% of the variation in bounds are highlighted grey. The bottom row shows the significance (p-val) of

an F-test that the correlation of SUMEX versus growth rate varies less across 50% variations in model bounds than the

correlation of Biomass versus growth rate. The F-test shows high significance for uptake bounds in ds18 and ds57, and

secretion bounds in ds57. *As in Fig. S1, uptake bounds were varied to ±(%) while secretion bounds were varied

between the standard value and –(%).

A.2.2.2 Expanded analysis of obligate fermenters and

respirers in ds66

As noted in the main text, we split ds66 into two groups: obligate fermenters and

organisms that respire (see Table S4 for the breakdown). We found that SUMEX is

predictive of growth rate for the respirers, but not for the obligate fermenters. Of

note, although SUMEX does not significantly correlate with growth rate for the 9

obligate fermenters, it also does not show significantly less significance than

randomly chosen sets of 9 organisms from ds66. Unlike SUMEX, biomass yield

does show a significant correlation versus GR for the 9 obligate fermenters (rho =

0.66, p = 0.03 in 1-sided Spearman test), although the significance is also not

significantly above that expected if we choose 9 organisms at random. This suggests

that while biomass is a poor predictor of growth rate in respirers, it may be

appropriate for predicting the GR of obligate fermenters. Among the set of 9

obligate fermenters, there was one organism, Lactobacillus plantarum, for which

evidence has been found for respiration when the organism is provided exogenous

heme and menaquinone [84]. Therefore, it is possible that L. plantarum should be

re-categorized as a respirer. Removing L. plantarum from the fermenter set and

151

151

calculating Spearman correlations on the remaining 8 organisms resulted in

significance for both biomass and SUMEX versus GR (see Fig. 3d). Due to these

considerations, a larger dataset of obligate fermenters will be required in order to

allow more definite statements about the application of SUMEX or biomass to

predicting their growth (none of the other datasets treated in this chapter include

obligate fermenters).

Interestingly, SUMEX and PMAX significantly under-predict the growth rates of

obligate fermenters compared with respirers (all fermenter datapoints lie below the

trendline in Fig 3a-b). This suggest that, since growth of fermenters relies on

mechanisms independent from their ability to produce a strong proton gradient, a

proton gradient-dependent predictor (such as SUMEX) under-represents their

capability for fast growth.

A.2.2.3 The relationship between flux and molecular

weight in SUMEX

In order to check if maximizing SUMEX indeed causes uptake of high molecular

weight compounds and the output of low molecular weight compounds, we

calculated the correlation between the molecular weights of exchanged metabolites

(with nonzero fluxes) and their average exchange fluxes (as determined by flux

variability analysis) when calculating SUMEX for E. coli on rich medium, as well as

for all exchanged metabolites in all models across the ds18 dataset. We achieved

strong negative correlations between molecular weight and outward exchange flux in

both cases (ρ=-0.73, P = 4e-23 and ρ=-0.56, P = 1e-34 for the two analyses),

confirming our hypothesis.

A.2.2.4 Ranging of biomass% lower bound

Cellular growth involves an intrinsic tradeoff between growth rate and biomass yield.

In calculating SUMEX, we enforce a small flux (5% of the maximum possible in the

given condition) through the biomass yield reaction, since some yield is necessary to

152

152

sustain growth. In order to more fully understand the relationship of SUMEX with

growth yield, we varied this lower bound on biomass yield between 0 and 100% of

the maximum (i.e., the maximum biomass yield computed in the model on a given

media) in all of the datasets dealt with in this chapter.

In the bacterial datasets, we found the correlation of SUMEX with growth rate to be

typically robust to changes in the yield, except for when biomass approaches 100%,

at which point the correlation drops off in several datasets (see Fig. S2). This

suggests that the correlation of SUMEX with growth rate is robust to changes in

yields in the model, at least within physiological ranges 16,24

. On the contrary, the

dropoff near 100% biomass yield poses an interesting parallel with the results of [80]

(see Fig. S2). Flux variability analysis of ds18 confirmed that flux variability in

maximal SUMEX decreases as the lower bound on biomass yield increases from

70% to 100% (see Fig. S3).

Interestingly, in the NCI60 cell lines, the correlation of SUMEX with growth rate

was more sensitive than in the bacterial datasets to the lower bound on biomass yield,

and dropped off sharply as biomass yield lower limit increased (see Fig. S2). This

suggests that cancer cells, which are not as evolutionarily well-tuned as bacteria for

efficient growth, might have to sacrifice more of their biomass yield than bacteria to

attain maximal growth rates.

Also intriguingly, we found a peak in the correlation between SUMEX and growth

rate in ds66, ds18 and ds24 (peaks were at 90%, 55%, and 80% max biomass for the

three datasets; see Fig. S2). These peaks in correlation with growth rate suggest that

certain percentages of maximum biomass yields may be dominant across the

different conditions in each dataset.

153

153

-1

-0.5

0

0.5

1

0 20 40 60 80 100

ρ, S

UM

EX v

s. G

R

Biomass LB (%max)

-0.2

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

ρ, S

UM

EX v

s. G

R

Biomass LB (%max)

0.53

0.54

0.55

0.56

0.57

0.58

0.59

0 20 40 60 80 100

ρ, S

UM

EX v

s. G

R

Biomass LB (%max)

0.48

0.49

0.5

0.51

0.52

0.53

0.54

0 20 40 60 80 100

ρ, S

UM

EX v

s. G

R

Biomass LB (%max)

0

0.1

0.2

0.3

0.4

0.5

0.6

0 20 40 60 80 100

ρ, S

UM

EX v

s. G

R

Biomass LB (%max)

ds66 ds18

ds24ds57

a.

b.

b.

d.

e.NCI60 cancer cell lines

Appendix 2:Figure S2: Effect of biomass lower bound on SUMEX. The correlation of SUMEX versus growth rate as

lower bound (LB) on biomass is varied. (a) ds66, (b) ds18, (c) ds57, (d) ds24 (E. coli grown on 24 carbon sources, from

[35]), and (e) NCI60 cancer cell lines. ds24 was calculated with the iAF1260 E. coli model.

A.2.2.5 Network flexibility in SUMEX as biomass lower

bound approaches 100%

In order to assess the flexibility of the metabolic networks as biomass approaches

100%, we did a flux variability analysis (FVA) of all reactions in ds18 under optimal

154

154

SUMEX conditions with biomass set to equal 100%, 90%, 80%, and 70% of its max,

and then we assessed the change in flux variability (∆FV) of the flux range of each

reaction across the 18 conditions (6 organisms x 3 media). The ∆FV metric was

calculated for each reaction/condition as the slope (change) of the flux range when

the biomass value increases from 70% to 100% of its max. A positive ∆FV means

that, as biomass increases between 70% and 100%, the range within which fluxes of

a given reaction can vary increase, and the magnitude of ∆FV indicate the strength of

the increase/decrease. Fig. S3 shows the results: it is clear that FV decreases as

biomass trends towards 100%.

155

155

0 500

exchange

Porphyrin and chlorophyll metabolism

Reductive carboxylate cycle (CO2 fixation)

Streptomycin biosynthesis

Fructose and mannose metabolism

Tryptophan metabolism

Fatty acid metabolism

Sulfur metabolism

Glycerophospholipid metabolism

One carbon pool by folate

Valine, leucine and isoleucine degradation

Arginine and proline metabolism

Thiamine metabolism

Lysine biosynthesis

Valine, leucine and isoleucine biosynthesis

Pantothenate and CoA biosynthesis

Glycerolipid metabolism

Butanoate metabolism

Cysteine metabolism

Glycolysis / Gluconeogenesis

Methionine metabolism

Glyoxylate and dicarboxylate metabolism

Aminosugars metabolism

Folate biosynthesis

Glutathione metabolism

Propanoate metabolism

Nitrogen metabolism

Citrate cycle (TCA cycle)

Pentose phosphate pathway

Glutamate metabolism

Pyruvate metabolism

Glycine, serine and threonine metabolism

Nicotinate and nicotinamide metabolism

Carbon fixation in photosynthetic organisms

Starch and sucrose metabolism

Urea cycle and metabolism of amino groups

Pyrimidine metabolism

Purine metabolism

none

FV increases as biomass -> 100%

FV decreases as biomass -> 100%

# reactions

Appendix 2::Figure S3: Flux variability in SUMEX solution as function of biomass lower bound. FVA was performed on

the optimal solution space of SUMEX at lower bounds of biomass between 100% and 70%. Reactions whose flux

variability increased or decreased more than a set cutoff are binned into pathways and plotted. Overall, this shows a

general decrease in flux variability as biomass approaches 100%.

156

156

rho P

FBAwMC 0.26 1.1E-01

PMAX 0.38 3.6E-02

Biomass 0.40 2.9E-02

MOMENT 0.47 1.0E-02

SUMEX 0.47 1.2E-02

1-sided Spearman test:

Appendix 2::Table S2: Analysis of ds24. We obtained growth rates of E. coli in batch culture under 24 minimal media

conditions from [35]. For this analysis, we used the iAF1260 E. coli model, in order to be consistent with the other

metrics. SUMEX was computed over the 23 media for which the carbon source was present extracellularly in the

standard iAF1260 model (the excluded metabolite was glucosamine). Values listed are for SUMEX with the standard

5% lower bound of biomass. Also listed is the maximization of extracellular proton production (PMAX), which displays

significance, but below that of the top 3 metrics.

A.2.2.6 Summing exchange fluxes in the optimal biomass

solution space predicts growth rate

We were interested in doing an independent validation of SUMEX, based on

changing uptake bounds in the model. Our logic is as follows:

In the bacterial datasets, we did not have detailed measurements of uptake and

secretion fluxes. However, because of the property that biomass yield corresponds to

growth rate at steady state if uptake rates are exactly known (from mass

conservation: 1

i i

ib io m a ss

m vm

, where μ=growth rate and mi and vi are the mass

and flux of each exchanged component, i), we hypothesized that if we tune the

uptake rates to increase the correlation of biomass with growth rate, we would bias

towards realistic uptake rates and be able to independently validate SUMEX by

summing, but not maximizing, exchange fluxes.

Restated, we summed extrapolated uptake and secretion rates without doing a

maximization, rather than computing the maximum achievable sum (as in SUMEX).

In practice, this meant sampling for in silico media variants that gave significant

(P≤0.05) positive correlations between maximal biomass and growth rate, and in

157

157

these cases, checking also the correlation to growth rate of the sum of mean

allowable exchange fluxes (SUMofEX) that support optimal biomass. Because any

individual flux vector supporting maximal biomass is not unique, we calculated

SUMofEX by summing the means of the flux variability ranges (computed by FVA)

of each exchange component, as calculated within the biomass solution space.

We did this for ds18, ds66, and ds57, and achieved significant correlations between

SUMofEX (calculated within the maximal biomass yield solution space) and growth

rate in all three datasets, and indeed generally stronger correlations than the biomass

objective achieved on the same data, as shown by points being to the right of the

green line in Fig. S4. This analysis again independently validated the correlation of

the sum of exchange fluxes with growth rates.

-1

-0.5

0

0.5

1

-1 0 1

-1

-0.5

0

0.5

1

-1 0 1

-1

-0.5

0

0.5

1

-1 0 1

a. b. c.

ρ, sum of extrapolated exchanges supporting max biomass (i.e., SUMofEX) vs. GR

ρ, m

ax b

iom

ass

vs. G

R

ds18 ds66 ds57

Appendix 2::Figure S4: Extrapolating bounds for biomass. Allowed uptake bounds were randomly varied via uniform

distribution in the range [-50, 0] across all models for (A) ds18, (B) ds66, and (C) ds57. The variation was done such that

for a single iteration, the uptake lower bound of a compound C1 was fixed to the same randomly determined value for

all conditions in which C1 could be taken up, while the uptake of a compound C2 would take a different randomly

determined uptake than C1 across all models, etc. Maximum biomass yield was computed for each model. Then, the

sum of exchange fluxes (SUMofEX) was computed as the sum of the means of the flux variability ranges (calculated by

FVA) of all exchange reactions, under the condition of optimal biomass (i.e., SUMEX wasn’t maximized, but rather it

was summed from the means of allowed exchange fluxes that support optimal Biomass). The plots show only media

variants in which biomass correlated significantly (P≤0.05) to growth rate, as we conjectured that these points would

give the most accurate uptake bounds. Each dot represents the correlation coefficient (Spearman ρ) of growth rate vs.

SUMofEX (x-axis) or optimal biomass (y-axis) for a single variant of the medium uptake bounds. Dots show points for

which SUMofEX correlates significantly with growth rate, and red crosses show points for which SUMofEX does not

correlate significantly. The green lines have a slope of 1, so points to the right of the lines denote variants of the uptake

bounds for which SUMofEX correlated better than biomass yield with growth rate.

158

158

A.2.2.7 Gene Expression of pathways contributing to

SUMEX

We obtained gene expression data from [85], in which E. coli was grown on minimal

media supplemented with 6 different carbon sources. We excluded the acetate-

supplemented medium since the SEED‟s automatically-generated model is unable to

use it as carbon source, indicating an evident flaw in the model. Thus, we were left

with data concerning 5 different growth media.

For each medium, we quantified the beneficial or deleterious effect each gene had

with respect to the realization of each of the two hypothetical cellular objectives

(SUMEX vs. biomass yield). This was done by TOX, a variant on the EDGE method

recently developed in our lab (Wagner et al., manuscript accepted for publication at

PNAS).

We considered only genes that are non-essential for biomass production in the E. coli

reconstruction, since SUMEX requires, by definition, nominal biomass yield.

TOX is defined as follows: Given a non-essential gene g, a pre-defined medium m,

and a hypothetical cellular objective f (always biomass yield or SUMEX in this

chapter):

1. Calculate the maximal value of f following KO of gene g, in medium m.

Denote it fKO

.

KO of g is implemented by constraining all of its associated reactions to

carry a zero flux.

2. Calculate the maximal value of f, given that gene g is active, on medium m.

Denote it fUP

. Activation of g is implemented by constraining all of its

associated reactions to carry at least an ε flux (in absolute value), and at least

one of them to carry exactly an flux (in absolute value), maximizing f

under each of these sets of constraints, and then choosing the smallest f

obtained as fUP

. We used ε = 0.1.

3. Return ( , , ) :U P K O

T O X g m f f f .

159

159

( , , ) 0T O X g m f signifies that g contributes towards the realization of f on medium

m, whereas

( , , ) 0T O X g m f signifies that g has a deleterious effect on f under the same

conditions.

We next calculated a one-sided P-value for the Wilcoxon rank-sum test to determine

whether genes with positive TOX scores have significantly higher expression levels

than those with negative scores, which would serve as a confirmation that the a

priori objective f is predictive of actual gene expression. With SUMEX as the

hypothetical objective, significant scores were obtained in all media, save one that

obtained borderline significance (Table S3). With biomass yield as the objective

function, none of the media showed any significance.

We then verified that metabolic pathways predicted to be active by SUMEX are

indeed expressed. We say that “an objective f predicts gene g to be active on medium

m” if ( , , ) 0T O X g m f . Taking the set of highly-expressed genes as the set of active

genes on each medium, we find that the predictions of SUMEX outperform those of

biomass yield, both in terms of precision and of recall (Table S3).

Highly-expressed genes on a given medium are defined as ones whose expression

have a z-score greater or equal to 1

0 .8 on that medium. We verified that our

results were insensitive to the choice of ε and expression z-score threshold within

reasonable bounds.

160

160

Biomass SUMEX Biomass SUMEX Biomass SUMEX

Glucose 0.71 0.01 0.00 0.17 0.00 0.59

Glycerol 0.72 0.06 0.08 0.20 0.01 0.63

Succinate 0.84 0.05 0.13 0.20 0.03 0.63

L-Alanine 0.59 0.02 0.08 0.22 0.01 0.68

L-Proline 0.30 0.04 0.08 0.19 0.01 0.62

Average: 0.63 0.04 0.08 0.19 0.01 0.63

recall:precision:p-val:C-source:

Appendix 2::Table S3: Association of global gene expression with SUMEX. Significance of correlation between (first)

the medium-dependent contribution of each gene to the realization of each of the two hypothetical cellular objectives

(biomass yield or SUMEX) and (second) measured expression of the gene on minimal media supplemented with one of

the specified carbon sources. The leftmost table specifies the p-values obtained for a 1-sided rank-sum test for the

alternative that non-essential genes that are beneficial with respect to the cellular objective (either SUMEX or biomass

yield) have higher expression than those deleterious towards that objective. The two other tables report the precision

and recall values when predicting highly-active genes (whose expression has a score of at least 1

0 .8

by picking genes

that were beneficial towards either SUMEX or biomass yield. Method from Wagner et al., personal communication.

A.2.2.8 Correlation of SUMEX and Biomass:

We calculated the correlation of SUMEX and Biomass (2-sided Spearman test), and

found that they correlate highly significantly on ds66, weakly on ds57, and

insignificantly on ds18 (see Fig. S5).

-1000

1000

3000

5000

7000

0 100 200 300

SUM

EX

Biomass

ds66, SUMEX vs. Biomass

300

500

700

900

2 4 6

SUM

EX

Biomass


0

500

1000

1500

2000

0 30 60 90 120

SUM

EX

Biomass


ρ = 0.90P <1E-7

ρ = 0.41P = 1.2E-3

ρ = 0.36P = 0.14

Spearmancorr:

a. b. c.

Appendix 2::Figure S5: Correlation of Biomass with SUMEX. Plots of SUMEX versus Biomass are presented for (a)

ds66, (b) ds57, and (c) ds18, along with Spearman tests.

161

161

A.2.3 Supplementary Methods:

A.2.3.1 Models

Unless otherwise noted, analyses were done on genome-scale metabolic

reconstructions (GEMs) as obtained from SEED [25], at http://seed-

viewer.theseed.org/. The 66 organisms in ds66 were chosen because (1) their GEMs

were available from SEED and published in [25], and (2) their optimal doubling

times were available from [78]. For analysis of ds24, the iAF1260 E. coli model was

used, and the NCI60 cancer cell line analysis used custom-made models based on the

generic human model (described below). Table S4 lists the names of the ds66

models and organisms.

A.2.3.2 General methods

Linear Programming (LP) and Quadratic Programming (QP) calculations were done

using IBM Cplex software on an Intel based machine running Linux. The Spearman

correlation calculations and other analyses were done using either Matlab software or

Java.

Optimizations were run in in silico environments consistent with the known media,

where all exchange metabolites for a given species were available at a fixed rate of -

50.0. In the case of ds66, the environment was „rich‟, so we allowed uptake flux in

all exchange reactions for all organisms. Other constraints are described in the

following section.

By convention, exchange fluxes denoting entrance of a metabolite into the cell

(uptake) are negative valued, while exchanges denoting exit of a metabolite from the

cell (output / secretion) are positive valued. Therefore, maximizing the total

exchange flux (i.e. the SUMEX metric) would denote maximizing the output at the

expense of the input (output exchanges – input exchanges).



162

162

A.2.3.3 Reactions constraints and optimal environment

setting

Unless stated differently we used the following constraints on the reactions fluxes,

and in the definition of rich media:

For irreversible reactions:

Exchange reactions:

0 ≤ Vi,ex ≤ Vi, Max_ex (Vi, Max_ex = 1000)


0 ≤ Vi ≤ Vi, Max ( Vi, Max = 1000)

For reversible reactions:

Exchange reactions:

Vi, Min_ex ≤ Vi,ex ≤ Vi, Max_ex (Vi, Min_ex = -50 Vi, Max_ex = 1000)


Vi,Min ≤ Vi ≤ Vi, Max (Vi,Min = -1000 Vi,Max = 1000)

A.2.3.4 Building NCI60 cancer cell models

Our method to reconstruct the NCI60 cancer cell lines (based on the yet unpublished

methods in Yizhak et al, personal communication) required several key inputs: (a)

the generic human model [20], (b) gene expression data for each cancer cell line

from [87], and (c) growth rate measurements (Note: the growth rates were used only

to determine which genes should be used in constraining the models, in order to

163

163

obtain models that were as physiologically relevant as possible; they were not used to

determine the weight on the bounds, etc.). The algorithm then reconstructs a specific

metabolic model for each cell line by modifying the upper bounds of reactions in

accordance with the expression of the individual gene microarray values.

Specifically, the model reconstruction process is as follows:

(1) Decompose reversible reactions into unidirectional forward and backward

reactions.

(2) Evaluate the correlation between the expression of each reaction in the

network and the measured growth rate. The expression of a reaction is

defined as the mean over the expression of the enzymes catalyzing it.

(3) Modify upper bounds on reactions demonstrating significant correlation

to the growth rate (after correcting for multiple hypothesis using FDR) in

a manner that is linearly related to expression value.

We were able to produce feasible models for 60 cell lines using this procedure, and

these 60 were used for the analyses presented in the chapter. The NCI60 models thus

described were optimized for either biomass yield or SUMEX to obtain the results in

Fig. 3.

A.2.3.5 Computation of metrics

Following is an explanation of the exact way we calculated each of the metrics listed

in Fig. 1A:

Sum of exchange fluxes (SUMEX):

The sum of exchange fluxes (SUMEX) follows this procedure:

1. In addition to standard uptake constraints (see previous sections), we set a

lower limit on biomass yield at 5% of its maximum, as determined by FBA

on the given medium.

2. We search within this space for the max achievable exchange flux (secretion

– uptake; calculated as the sum of exchange fluxes). SUMEX is the optimal

value.

164

164

This can be represented mathematically as:

1

, m in m a x

m in

:

0

m a x

j j

n

e x c h a n g e

i

j

b io m a s s b io m a s s

j

V

S u b je c t to

V V

v V

S V

v v v

Where S is the stoichiometric matrix of metabolites and V is the vector of reactions

that together define the metabolic model. SV = 0 defines the steady state of the

metabolic model, and the limits on Vi are as defined in the reactions constraints

section. Vbiomass is the flux through the biomass reaction, and Vmaxbiomass is the

maximal achievable biomass yield, as determined through maximization of the

biomass objective function (see next section). Because all exchange fluxes by

definition point outwards (i.e., positive flux denotes secretion), the sum of exchanges

intrinsically minimizes metabolic uptake and maximizes metabolic secretion in a

single optimization. In practice we exclude Vbiomass from the Vexchange vector when

calculating biomass, but adding Vbiomass back in has no significant effect on the

solution (since Vbiomass is typically very small in the SUMEX solution). The process

is illustrated schematically in Fig. S6.

165

165

Catabolism + H+ gradient

Anabolism

2-1=11-2=-1

Sum

of

exc

ha

nge

(S

UM

EX

):

2

12

1

low high

Gro

wth

rate

:

Appendix 2::Figure S6: Schematic of SUMEX. The summing of molar fluxes through exchange reactions, i.e., the

quantity maximized in SUMEX, is illustrated. A high SUMEX value is achieved by high output fluxes and low input

fluxes. This is achieved mathematically through a single optimization, due to the sign convention that all exchange

fluxes by default point outwards.

Maximal biomass objective function

This is the standard method for determining maximal biomass yield in a given

environment using GEMs. We have taken the biomass function defined by the

automatic metabolic models generator [25] and we calculated its value when each of

the organisms was grown in its given media.

The objective function solved was:

, m in m a x

:

0

m a x

j j

b io m a ss

j

j

S u b je c t to

v V

V

S V

v v v

Where S is the stoichiometric matrix of metabolites and V is the vector of reactions

that together define the metabolic model. SV = 0 defines the steady state of the

metabolic model, and the limits on Vi are as defined in the reactions constraints

section. This metric has been described extensively elsewhere (e.g., [5, 149]).

Codon usage bias:

166

166

This metric was described in [78]. Codon usage biases for the 66 organisms of

interest for our study were kindly provided by Vieira-Silva and Rocha.

Uptake exchange reactions count

This topological metric provides a simple sum of the number of uptake exchange

reactions in a model (i.e., exchange reactions through which flux can enter the

organism).

All exchange reactions count

This topological metric provides a simple sum of the total number of exchange

reactions of the organism.

Maximize biomass with all critical uptake metabolites limited

This metric assesses the maximal biomass achievable under a limited uptake

environment. For this analysis, all critical uptake reactions had their fluxes limited to

-10.0 (negative indicating entrance into the cell, by standard convention). „Critical‟

uptake reactions are those whose metabolites are fully consumed when the organism

is grown in an optimal environment. Other than the change in constraints, the

maximization was identical to the maximization of biomass metric.

Minimize molar carbon consumption per biomass unit

This metric is predicated on the hypothesis that evolution has driven selection for the

most efficient usage of carbon in production of biomass. It is based on a metric from

[77], except instead of „glucose‟ we minimize molar carbon uptake, as our models

are grown in complex media.

We calculated this objective function in 2 steps:

Step 1)

Calculate the maximum biomass of the organisms when grown in a given media (see

biomass objective description for details).

167

167

Step 2)

Calculate the maximum of the sum of exchange reactions that contain carbon and

that are able to carry flux while fixating the maximum biomass flux value.

**Note: Because of the sign conventions on fluxes, when maximizing the flux of

uptake exchange reaction we are actually minimizing the uptake of the specific

exchange metabolite represented by this reaction, as uptake fluxes have a negative

value in our models.

The Linear program solved by the second step is:

, m in m a x

:

0

m a x

m a x

j j

i

i s c

b io m a ss b io m a ss

j

j

V

V V

S u b je c t to

V V

v V

S V

v v v

Where Vsc is the group of uptake exchange reactions that are able to carry flux and

that contain carbon in their exchange metabolite.

Reactions count

Here we took the total count of reactions in the model, with the idea that a larger

metabolism might correlate with a faster growth rate.

Maximize sum of network flux

Here we determined the sum of fluxes in the network, as an indicator of the general

activity level of the metabolic network. We computed as follows:

168

168

, m in m a x

:

0

m a x

j j

j

j

j

j

S u b je c t to

v V

V

S V

v v v

Maximal biomass per squared flux unit

This method assesses the ability of a GEM to produce biomass while minimizing

enzyme usage, as measured through the following formula:

2

b io m a ss

i

VM a x

V.


Step 1)

Maximum biomass was calculated in an optimal environment.

Step 2)

Fixing biomass to its maximum value, we minimized the squared sum of all fluxes of

the organism:

2

, m in m a x

:

0

m a x

m in

j j

i

i a ll r e a c tio n

b io m a ss b io m a ss

j

j

V

V

S u b je c t to

V V

v V

S V

v v v

169

169

Maximize biomass under limited phosphate molar uptake

For this metric and a number of others, we assessed maximal biomass under limited

nutrient uptake conditions. In this metric, we limited phosphate uptake. Specifically,

we solved the following optimization problem:

1 0 .0 m a x

, m in m a x

_ _ _ _

:

0

* _ ( )

m a x

p p j

j j

b io m a ss

V

j

j

p se e d e x c h a n g e re a c tio n s c o n ta in in g p h o sp h a t e

S u b je c t to

v P h o s p h a te c o u n t

v V

v V

V

S V

v

v v v

Where Phosphate_count (Vp) is the molar count of phosphate in the uptake exchange

reactions that are able to carry flux.

We limited the total molar amount of phosphate to a value of -10.0, as we observed

that providing higher levels did not limit growth, while reducing the limit was too

limiting for some organisms, leading to minimal growth and reduced correlation.

Maximize biomass under limited nitrogen molar uptake

This metric is the same as the phosphate limitation metric, except nitrogen is limited

instead. We limited the total molar amount of nitrogen to a value of -100.0 for the

rationale described for the phosphate limitation metric.

Maximize biomass under limited carbon molar uptake

This metric is the same as the phosphate limitation metric, except carbon is limited

instead. We limited the total molar amount of carbon to a value of -1000.0 for the

rationale described for the phosphate limitation metric.

Maximize ATP maintenance (i.e., hydrolysis) reaction

170

170

This metric assesses the maximal molar amount of ATP that can be charged from

ADP in the cell, given a set of inputs as media. ATP production, which is a measure

of efficiency of energy production, is often considered as an alternative metric to

biomass in genome-scale models. Production of more energy from a fixed set of

cellular uptakes would thus logically be associated with stronger or faster growth.

The rationale behind this metric is that evolution drives maximal energetic efficiency.

As none of the models contained an ATP maintenance (i.e., hydrolysis) reaction we

added that reaction:

ATP + H2O -> ADP + H + Phosphate.

The linear problem computed is to maximize this ATP hydrolysis (also called „ATP

maintenance‟) reaction:

, m in m a x

:

0

m a x

j j

a tp

j

j

S u b je c t to

v V

V

S V

v v v

(Where Vatp is the ATP maintenance reaction).

Maximal ATP maintenance (i.e., hydrolysis) per squared flux unit

This method is based on a hypothesis that cells operate to maximize ATP

maintenance yield (ie, the total amount of ATP that can be charged in a given

environment) while minimizing enzyme usage. The total metric can be stated as

follows:

2

A T P

i

VM a x

V


171

171

Step 1)

Calculate maximum ATP maintenance flux as described under „Maximize ATP

maintenance reaction.‟

Step 2)

Calculate the minimum square sum of all fluxes of the organism when we fixate the

ATP maintenance of the organism, using the following optimization:

2

, m in m a x

:

0

m in

j j

i

i a l l r e a c tio n

j

j

V

V

S u b je c t to

v V

S V

v v v

where Vall reaction is the set of all the reactions in the metabolic model of the organism.

A.2.3.6 Growth experiments of 6 organisms on 3 defined

IMM media (ds18)

To validate SUMEX, we performed in vitro experiments to measure the growth rates

of a number of organisms (listed in Table S5) in multiple environments. Growth

experiments were conducted in 96-well plates at 30°C, with continuous shaking,

using a Biotek ELX808IU-PC microplate reader. Optical density was measured

every 15 minutes at a wavelength of 595nm. Growth rates were determined during

early to mid exponential growth phase by taking the slope of a linear fit through the

natural log of the data.

Using models taken from SEED [25], we calculated various growth metrics (see Fig.

1A) in in silico environments mirroring the environments from the in vitro

experiments. Table S6 contains the environment [147] used in vitro (and in silico)

and the changes done to it in the different experiments.

172

172

A.2.4 Tables

Appendix 2::Table S4: Description of ds66. Description of the 66 organisms that were used in the article, including

categorization into respirers and obligate fermenters (and the sources used to determine those categories). Biomass and

doubling times are for growth in an optimal environment. (The doubling times are from [78]).

The table will be provided apon request to mail: [email protected].

Organism: Medium:

Growth

Rate:

Agrobacterium tumefaciens str. c58 IMM 0.09

Bacillus subtilis subsp. subtilis str. 168_4 IMM 0.32

Escherichia coli W3110 IMM 0.17

Listeria innocua Clip11262 IMM 0.09

Pseudomonas aeruginosa PAO1 IMM 0.48

Serratia marcescens IMM 0.45

Agrobacterium tumefaciens str. c58 IMM-gt 0.05

Bacillus subtilis subsp. subtilis str. 168_4 IMM-gt 0.13

Escherichia coli W3110 IMM-gt 0.04

Listeria innocua Clip11262 IMM-gt 0.00

Pseudomonas aeruginosa PAO1 IMM-gt 0.58

Serratia marcescens IMM-gt 0.25

Agrobacterium tumefaciens str. c58 IMMxt 0.23

Bacillus subtilis subsp. subtilis str. 168_4 IMMxt 0.32

Escherichia coli W3110 IMMxt 0.15

Listeria innocua Clip11262 IMMxt 0.21

Pseudomonas aeruginosa PAO1 IMMxt 0.36

Serratia marcescens IMMxt 0.40

Appendix 2::Table S5: in vitro growth experiments (i.e.,ds 18). This table provides a list of in vitro growth experiments performed in our lab for validation of SUMEX. The table lists the species and the environments used. Simulations for Serratia

marcescens were done using an in silico model of S. odorifera 4Rx14.796

mailto:[email protected]

173

173

Metabolite In vitro medium In silico medium

Thiamin + +

D-Methionine + +

Magnesium + +

L-Valine + +

L-Isoleucine + +

L-Leucine + +

L-Histidine + +

Calcium + +

D-Glucose-6-phosphate + +

Potassium + +

Citrate + +

L-Arginine + +

L-Tryptophan + +

L-Phenylalanine + +

Biotin + +

Riboflavin + +

Adenine + +

Pyridoxal + +

Nicotinamide_D-ribonucleotide + +

L-Glutamine + +

L-Cysteine + +

Lipoic acid + -

para-aminobenzoic acid + -

Oxygen + +

Cytosine - +

Zinc - +

Cobalt - +

Fe2+ - +

Chloride - +

Sulfate - +

Copper2 - +

Manganese - +

Spermidine - +

gly-asn-L - +

sn-Glycerol-3-phosphate - +

Octadecanoate - +

Additions done for the enlarged IMM environment (IMMxt):

174

174

Xylose C5H10O5

Deoxythymidine C10H14N2O5

Removals done for the reduced IMM environment (IMM-gt):

Thiamin C12H17N4OS

D-Glucose_6-phosphate C6H12O9P

Appendix 2::Table S6: IMM defined medium.

This table provides the IMM defined medium [147] and its in silico representation. IMM was also modified to generate two alternate media.

175

175


A.3.1 Figures

176

176

177

177

Appendix 3::Supplementary Figure 1: KEGG Glycans. a-e: Examples of different types of glycans found in the KEGG

Glycans database. (a) A textual representation of the KCF graph for glycan G00010. (b) A visual representation of the

KCF graph for glycan G00010. (c) A regular glycan. (d) A Linear repeating glycan. (e) A non linear repeating glycan. (f)

Reconstruction of GlyDe reactions. The GlyDe algorithm receives a glycan graph structure and an EC number as input

and generates the appropriate glycan products, while considering the following rules: Glycosidic linkages hydrolyzed,

Endo vs. exo acting enzyme, Degree of polymerization preference, Reducing vs. non-reducing end preference, Contained

sub-glycan, Glycan Released. In the example depicted in the figure, these fields had the following values respectively:

Glc a1-2 Glc, exo, 10+, non-reducing, unknown, TAU00015 (the glycan ID of glucose).

178

178

c.

179

179

Appendix 3::Supplementary Figure 2: Glycan Degradation of the gut microbiota reference genomes. (a) A cross-

validation process was performed to see that GlyDe reaction products are enriched with existing glycans rather than

hypothetical ones. The Venn diagram depicts a significant overlap between the products created by GlyDe (green) and

the original glycans in the KEGG Glycan database (purple). (b) Principle Coordinates Analysis (PCoA) of the glycan

degradation profiles of the species colored according to their respected phyla. (c) The bar chart depicts the median

GlyDe score of the species belonging to each genus and colored according to their respected phyla. (d) A heatmap

denoting the average relative glycan degradation efficiency of the different bacterial genera in the study. Each entry was

calculated based on the average sum of GlyDe scores per genus for a specific degree of polymerization (DP) category

(e.g. Disaccharides) and normalized by the overall sum of GlyDe scores of all the genera for the same DP category.

180

180

-200 -100 0 100 200 300 400

PCo1 = 46.07%

Carnivores

Herbivores

Omnivores

a.

b.

181

181

Appendix 3::Supplementary Figure 3: The connection between glycan degradation and diet. (a) The Muegge et. al.

dataset. Principal Coordinate Analysis of the GlyDe profiles of all the samples. The first principal coordinate shows a

gradient is formed starting from Herbivores (red) to Omnivores (green) and Carnivores (blue). (b & c) The Yatsunenko

et. al. dataset. (b) The box plots represents the variation in GlyDe profiles of the samples over the first principal

coordinate and separated into bins according to the age of the host. (c) Box plots showing the differences between the

Animal to Plants-specific GlyDe score ratios of adults in Malawi, Venezuela and United States of America.

A.3.2 Tables

Due to their size, the supplementary tables will be supplied appon request to mail:

[email protected]

Appendix 3::Supplementary Table 1: CAZymes degredation rules. A description of the manually curated GlyDe rules

that describe each CAZymes (EC number) by the following fields: Enzyme Name, KEGG Reactions, Glycosidic

Linkages Hydrolyzed, Contained Sub-glycan, Glycan Released, Endo vs. Exo, DP preference, Terminal Side Preference,

Enzymatic Reaction, and Comments.

Appendix 3::Supplementary Table 2: Comarison between KEGG and non KEGG Glyde Scores. A comparison of all

GlyDe scores (column 3) to whether they are degraded in KEGG or not (column 4). The first column indicates the SEED

ID of the HMP taxon and the second column indicates the Glycan ID in KEGG.

Appendix 3::Supplementary Table 3: The GlyDe outputs for all the HMP taxa. The GlyDe outputs for all the HMP taxa

including their NCBI Taxon IDs, scientific names, number of unique CAZymes, number of total CAZymes, GlyDe

scores, number of glycans degraded according to KEGG, number of glycans degraded according to GlyDe, the Animal-,

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

United States of America Venezuela Malawi

An

imal

/Pla

nts

Gly

De

Rat

ioc.

182

182

Plant- and Bacteria-specific GlyDe scores, and GlyDe scores for Disaccharides, Oligosaccharides, Short Polysaccharides

and Long Polysaccharides.

Appendix 3::Supplementary Table 4: The GlyDe scores of 8 Human Milk Oligosaccharides (HMOs) available in KEGG. MSMFLNnH (G02805), LNFP-I (G00535), LNFP-II (G00623), LNFP-III (G00557), MFLNH-I (G02935), MFLNH-III

(G00680), MFpLNH-IV (G01926), IFLNH-I (G12225). KEGG glycans were manually classified as HMOs based on

structural similarity to glycans found in Zivkovic et al. [150].

Appendix 3::Supplementary Table 5: The CAZyme table. The values in the table represent the efficiency in which

different CAZymes (rows) break different glycans (columns). A detailed explanation is given in Figure 1b and the

methods section.

Appendix 3::Supplementary Table 6: A detailed account of the glycans used throughout this analysis. The table

includes: Glycan ID, Name, Composition, Degree of Polymerization, Class and Biological Origin.

Appendix 3::Supplementary Table 7: The full list of monosaccharides and other basic chemical entities used as nodes in

the graphs representing glycan structures in KEGG and incorporated into our system. Glycans which contain unknown

nodes were filtered out. Other columns indicate the internal ID used by our platform, the KEGG Compound number

and the SEED cpd number.

Appendix 3::Supplementary Table 8: An OTU table representing HMP taxa. (columns) found in HMP samples (rows)

according to 16S rRNA sequence similarity and reconstructed using QIIME (Methods).

Appendix 3::Supplementary Table 9: The HMP bacterial reference genomes glycan degradation (GlyDe) matrix. Each

entry in the matrix represents the GlyDe score of a glycan (column) by a bacterial taxon (row). The IDs of the bacterial

taxa are taken from SEED [112].

Appendix 3::Supplementary Table 10: The Muegge et. al. samples CAZymes matrix. Each entry in the matrix represents

the abundance of a CAZyme (column) by a sample (row).

Appendix 3::Supplementary Table 11: The Yatsunenko et. al. samples CAZymes matrix. Each entry in the matrix

represents the abundance of a CAZyme (column) by a sample (row).

Appendix 3::Supplementary Table 12: The Muegge et. al. samples GlyDe matrix. Each entry in the matrix represents

the GlyDe score of a glycan (column) by a sample (row).

Appendix 3::Supplementary Table 13: The Yatsunenko et. al. samples GlyDe matrix. Each entry in the matrix

represents the GlyDe score of a glycan (column) by a sample (row).

Supplementary Table 14: The Muegge et. al. samples GlyDe output report. The report includes the sample ID, number

of unique CAZymes, the number of total CAZymes, the Total GlyDe scores, the number of glycans degraded according

to KEGG, the number of glycans degraded according to GlyDe, the Plant-, Animal- and Bacteria-specific GlyDe scores,

and GlyDe scores for Disaccharides, Oligosaccharides, and short and long Polysaccharides.

Appendix 3::Supplementary Table 15: The Yatsunenko et. al. samples GlyDe output report. The fields are similar to

Supplementary Table 14.

Appendix 3::Supplementary Table 16: Host diet predictions for 18 human samples taken from Muegge et. al. An SVM

classifier was trained based on the GlyDe profiles of herbivore and carnivore animals and used to classify the human

samples.

Appendix 3::Supplementary Table 17: Taxa with highly predictable abundance. The table shows the standard error in

predicted abundance of the HMP bacterial taxa in the two abundance-derived clusters.

183

183

Bebliography

1. Kell, D.B., Systems biology, metabolic modelling and metabolomics in drug

discovery and development. Drug Discov Today, 2006. 11(23-24): p. 1085-

92.

2. Hucka, M., et al., The systems biology markup language (SBML): a medium

for representation and exchange of biochemical network models.

Bioinformatics, 2003. 19(4): p. 524-31.

3. Erdi, P.t. and J . Toth, Mathematical models of chemical reactions : theory

and applications of deterministic and stochastic models. 1989, Princeton,

N.J.: Princeton University Press. xxiv, 259 p.

4. Price, N.D., J.L. Reed, and B.O. Palsson, Genome-scale models of microbial

cells: evaluating the consequences of constraints. Nat Rev Microbiol, 2004.

2(11): p. 886-97.

5. Orth, J.D., I. Thiele, and B.O. Palsson, What is flux balance analysis? Nat

Biotechnol, 2010. 28(3): p. 245-8.

6. Kauffman, K.J., P. Prakash, and J.S. Edwards, Advances in flux balance

analysis. Curr Opin Biotechnol, 2003. 14(5): p. 491-6.

7. Mahadevan, R. and C.H. Schilling, The effects of alternate optimal solutions

in constraint-based genome-scale metabolic models. Metab Eng, 2003. 5(4):

p. 264-76.

8. Almaas, E., Z.N. Oltvai, and A.L. Barabasi, The activity reaction core and

plasticity of metabolic networks. PLoS Comput Biol, 2005. 1(7): p. e68.

9. Burgard, A.P., et al., Flux coupling analysis of genome-scale metabolic

network reconstructions. Genome Res, 2004. 14(2): p. 301-12.

10. Reed, J.L. and B.O. Palsson, Genome-scale in silico models of E. coli have

multiple equivalent phenotypic states: assessment of correlated reaction

subsets that comprise network states. Genome Res, 2004. 14(9): p. 1797-805.

11. Segre, D., D. Vitkup, and G.M. Church, Analysis of optimality in natural and

perturbed metabolic networks. Proc Natl Acad Sci U S A, 2002. 99(23): p.

15112-7.

184

184

12. Shlomi, T., O. Berkman, and E. Ruppin, Regulatory on/off minimization of

metabolic flux changes after genetic perturbations. Proc Natl Acad Sci U S

A, 2005. 102(21): p. 7695-700.

13. Ashburner, M., et al., Gene ontology: tool for the unification of biology. The

Gene Ontology Consortium. Nat Genet, 2000. 25(1): p. 25-9.

14. Kanehisa, M., et al., The KEGG resource for deciphering the genome.

Nucleic Acids Res, 2004. 32(Database issue): p. D277-80.

15. Karp, P.D., et al., The MetaCyc Database. Nucleic Acids Res, 2002. 30(1): p.

59-61.

16. Aziz, R.K., et al., SEED servers: high-performance access to the SEED

genomes, annotations, and metabolic models. PLoS One, 2012. 7(10): p.

e48053.

17. Thiele, I. and B.O. Palsson, A protocol for generating a high-quality genome-

scale metabolic reconstruction. Nat Protoc, 2010. 5(1): p. 93-121.

18. Feist, A.M., et al., A genome-scale metabolic reconstruction for Escherichia

coli K-12 MG1655 that accounts for 1260 ORFs and thermodynamic

information. Mol Syst Biol, 2007. 3: p. 121.

19. Forster, J., et al., Genome-scale reconstruction of the Saccharomyces

cerevisiae metabolic network. Genome Res, 2003. 13(2): p. 244-53.

20. Duarte, N.C., et al., Global reconstruction of the human metabolic network

based on genomic and bibliomic data. Proc Natl Acad Sci U S A, 2007.

104(6): p. 1777-82.

21. McCloskey, D., B.O. Palsson, and A.M. Feist, Basic and applied uses of

genome-scale metabolic network reconstructions of Escherichia coli. Mol

Syst Biol, 2013. 9: p. 661.

22. Shlomi, T., M.N. Cabili, and E. Ruppin, Predicting metabolic biomarkers of

human inborn errors of metabolism. Mol Syst Biol, 2009. 5: p. 263.

23. Jerby, L., T. Shlomi, and E. Ruppin, Computational reconstruction of tissue-

specific metabolic models: application to human liver metabolism. Mol Syst

Biol, 2010. 6: p. 401.

24. Zhuang, K., et al., The design of long-term effective uranium bioremediation

strategy using a community metabolic model. Biotechnol Bioeng, 2012.

109(10): p. 2475-83.

25. Henry, C.S., et al., High-throughput generation, optimization and analysis of

genome-scale metabolic models. Nat Biotechnol, 2010. 28(9): p. 977-82.

185

185

26. Maglott, D., et al., Entrez Gene: gene-centered information at NCBI. Nucleic

Acids Res, 2011. 39(Database issue): p. D52-7.

27. Stolyar, S., et al., Metabolic modeling of a mutualistic microbial community.

Mol Syst Biol, 2007. 3: p. 92.

28. Wintermute, E.H. and P.A. Silver, Emergent cooperation in microbial

metabolism. Mol Syst Biol, 2010. 6: p. 407.

29. Monod, J., The growth of bacterial cultures. Annu Rev Microbiol, 1949. 3: p.

371-394.

30. Chao, L., B.R. Levin, and F.M. Stewart, Complex Community in a Simple

Habitat - Experimental-Study with Bacteria and Phage. Ecology, 1977.

58(2): p. 369-378.

31. Whiteley, M., et al., Effects of community composition and growth rate on

aquifer biofilm bacteria and their susceptibility to betadine disinfection.

Environ Microbiol, 2001. 3(1): p. 43-52.

32. Maynard Smith, J., Optimization Theory in Evolution. Annual Review of

Ecology and Systematics, 1978. 9: p. 31-56.

33. Yim, H., et al., Metabolic engineering of Escherichia coli for direct

production of 1,4-butanediol. Nat Chem Biol, 2011. 7(7): p. 445-52.

34. Fong, S.S. and B.O. Palsson, Metabolic gene-deletion strains of Escherichia

coli evolve to computationally predicted growth phenotypes. Nat Genet,

2004. 36(10): p. 1056-8.

35. Adadi, R., et al., Prediction of microbial growth rate versus biomass yield by

a metabolic network with kinetic parameters. PLoS Comput Biol, 2012. 8(7):

p. e1002575.

36. Freilich, S., et al., Competitive and cooperative metabolic interactions in

bacterial communities. Nat Commun, 2011. 2: p. 589.

37. Gause, G.F., experimental studies on the struggle for existance. Journal of

Experimental Biology, 1932. 9: p. 389-402.

38. Darwin, C., On the origin of species by means of natural selection, or, The

preservation of favoured races in the struggle for life. Dover giant thrift ed.

2006, Mineola, NY: Dover Publications.

39. Boer, P.J.d., The present status of the competitive exclusion principle. Trends

in Ecology & Evolution, 1986. 1(1): p. 25-28.

186

186

40. Marx, C.J., Microbiology. Getting in touch with your friends. Science, 2009.

324(5931): p. 1150-1.

41. Fuhrman, J.A., Microbial community structure and its functional

implications. Nature, 2009. 459(7244): p. 193-9.

42. Lotem, A., M.A. Fishman, and L. Stone, Evolution of cooperation between

individuals. Nature, 1999. 400(6741): p. 226-7.

43. Hibbing, M.E., et al., Bacterial competition: surviving and thriving in the

microbial jungle. Nat Rev Microbiol, 2010. 8(1): p. 15-25.

44. Gross, K., Positive interactions among competitors can produce species-rich

communities. Ecol Lett, 2008. 11(9): p. 929-36.

45. Diamond J, M. and E. Gilpin M, Examination of the “null” model of connor

and simberloff for species co-occurrences on Islands Oecologia 1982. 52(1):

p. 64-74.

46. Connor, E.F. and S. D., The assembly of species communities: Chance or

competition? Ecology, 1979. 60: p. 1132-1140.

47. Klitgord, N. and D. Segre, Environments that induce synthetic microbial

ecosystems. PLoS Comput Biol, 2010. 6(11): p. e1001002.

48. Ebenhoh, O. and T. Handorf, Functional classification of genome-scale

metabolic networks. EURASIP J Bioinform Syst Biol, 2009: p. 570456.

49. Freilich, S., et al., Decoupling Environment-Dependent and Independent

Genetic Robustness across Bacterial Species. PLoS Comput Biol, 2010. 6(2):

p. e1000690.

50. Freilich, S., et al., Metabolic-network-driven analysis of bacterial ecological

strategies. Genome Biol, 2009. 10(6): p. R61.

51. Gerhardson, B., Biological substitutes for pesticides. Trends Biotechnol,

2002. 20(8): p. 338-43.

52. Mani, R., et al., Defining genetic interaction. Proc Natl Acad Sci U S A,

2008. 105(9): p. 3461-6.

53. Klitgord, N. and D. Segre, Ecosystems biology of microbial metabolism. Curr

Opin Biotechnol, 2011: p. 541-546.

54. MacArthur, R., Species packing and competitive equilibrium for many

species. Theor Popul Biol, 1970. 1(1): p. 1-11.

187

187

55. Tilman, D., Resource competition between planktonic algae: experimental

and theoretical approach. Ecology, 1977. 58: p. 338–348.

56. Cherif, M. and M. Loreau, Stoichiometric constraints on resource use,

competitive interactions, and elemental cycling in microbial decomposers.

Am Nat, 2007. 169(6): p. 709-24.

57. Orr, H.A., Fitness and its role in evolutionary genetics. Nat Rev Genet, 2009.

10(8): p. 531-9.

58. Elevi Bardavid, R. and A. Oren, Dihydroxyacetone metabolism in

Salinibacter ruber and in Haloquadratum walsbyi. Extremophiles, 2008.

12(1): p. 125-31.

59. Mowery, D.C., J.E. Oxley, and S.B. S., Technological overlap and interfirm

cooperation: implications for the resource-based view of the firm. Science,

1998. 27(5): p. 507-523.

60. Schink, B., Synergistic interactions in the microbial world. Antonie Van

Leeuwenhoek, 2002. 81(1-4): p. 257-61.

61. Wintermute, E.H. and P.A. Silver, Dynamics in the mixed microbial

concourse. Genes Dev, 2010. 24(23): p. 2603-14.

62. Labrenz, M. and J.F. Banfield, Sulfate-reducing bacteria-dominated biofilms

that precipitate ZnS in a subsurface circumneutral-pH mine drainage system.

Microb Ecol, 2004. 47(3): p. 205-17.

63. Kato, S., et al., Stable coexistence of five bacterial strains as a cellulose-

degrading community. Appl Environ Microbiol, 2005. 71(11): p. 7099-106.

64. Chaffron, S., et al., A global network of coexisting microbes from

environmental and whole-genome sequence data. Genome Res, 2010.

65. Bell, T., et al., The contribution of species richness and composition to

bacterial services. Nature, 2005. 436(7054): p. 1157-60.

66. Gomes, N.C., et al., Effects of the inoculant strain Pseudomonas putida

KT2442 (pNF142) and of naphthalene contamination on the soil bacterial

community. FEMS Microbiol Ecol, 2005. 54(1): p. 21-33.

67. Hansen, S.K., et al., Evolution of species interactions in a biofilm community.

Nature, 2007. 445(7127): p. 533-6.

68. Wandersman, C. and P. Delepelaire, Bacterial iron sources: from

siderophores to hemophores. Annu Rev Microbiol, 2004. 58: p. 611-47.

188

188

69. Ley, R.E., et al., Microbial ecology: human gut microbes associated with

obesity. Nature, 2006. 444(7122): p. 1022-3.

70. Hirschman, L., et al., Habitat-Lite: a GSC case study based on free text terms

for environmental metadata. OMICS, 2008. 12(2): p. 129-36.

71. Ogata, H., et al., KEGG: Kyoto Encyclopedia of Genes and Genomes.

Nucleic Acids Res, 1999. 27(1): p. 29-34.

72. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of

protein database search programs. Nucleic Acids Res, 1997. 25(17): p.

3389-402.

73. Oberhardt, M.A., B.O. Palsson, and J.A. Papin, Applications of genome-scale

metabolic reconstructions. Mol Syst Biol, 2009. 5: p. 320.

74. Oh, Y.K., et al., Genome-scale reconstruction of metabolic network in

bacillus subtilis based on high-throughput phenotyping and gene essentiality

data. J Biol Chem, 2007.

75. Schuster, S., T. Pfeiffer, and D.A. Fell, Is maximization of molar yield in

metabolic networks favoured by evolution? J Theor Biol, 2008. 252(3): p.

497-504.

76. Knorr, A.L., R. Jain, and R. Srivastava, Bayesian-based selection of

metabolic objective functions. Bioinformatics, 2006.

77. Schuetz, R., L. Kuepfer, and U. Sauer, Systematic evaluation of objective

functions for predicting intracellular fluxes in Escherichia coli. Mol Syst

Biol, 2007. 3: p. 119.

78. Vieira-Silva, S. and E.P. Rocha, The systemic imprint of growth and its uses

in ecological (meta)genomics. PLoS Genet, 2010. 6(1): p. e1000808.

79. Groeneveld, P., A.H. Stouthamer, and H.V. Westerhoff, Super life--how and

why 'cell selection' leads to the fastest-growing eukaryote. FEBS J, 2009.

276(1): p. 254-70.

80. Schuetz, R., et al., Multidimensional optimality of microbial metabolism.

Science, 2012. 336(6081): p. 601-4.

81. Vazquez, A., et al., Impact of the solvent capacity constraint on E. coli

metabolism. BMC Syst Biol, 2008. 2: p. 7.

82. Soga, N., et al., Kinetic equivalence of transmembrane pH and electrical

potential differences in ATP synthesis. J Biol Chem, 2012. 287(12): p. 9633-

9.

189

189

83. Fischer, S. and P. Graber, Comparison of DeltapH- and Delta***φ***-

driven ATP synthesis catalyzed by the H(+)-ATPases from Escherichia coli

or chloroplasts reconstituted into liposomes. FEBS Lett, 1999. 457(3): p.

327-32.

84. Brooijmans, R., et al., Heme and menaquinone induced electron transport in

lactic acid bacteria. Microb Cell Fact, 2009. 8: p. 28.

85. Liu, M., et al., Global transcriptional programs reveal a carbon source

foraging strategy by Escherichia coli. J Biol Chem, 2005. 280(16): p. 15921-

7.

86. Discovery_Sciences.

http://dtp.nci.nih.gov/docs/misc/common_files/cell_list.html]. Available

from: http://dtp.nci.nih.gov/docs/misc/common_files/cell_list.html.

87. Lee, J.K., et al., A strategy for predicting the chemosensitivity of human

cancers and its application to drug discovery. Proc Natl Acad Sci U S A,

2007. 104(32): p. 13086-91.

88. Lozupone, C., et al., Diversity, stability and resilience of the human gut

microbiota. Nature, 2012. 489(7415): p. 220-230.

89. O'Keefe, S.J., Nutrition and colonic health: the critical role of the

microbiota. Curr Opin Gastroenterol, 2008. 24(1): p. 51-8.

90. Goodman, A. and J. Gordon, Our unindicted coconspirators: human

metabolism from a microbial perspective. Cell metabolism, 2010. 12(2): p.

111-116.

91. Holmes, E., et al., Gut microbiota composition and activity in relation to host

metabolic phenotype and disease risk. Cell metabolism, 2012. 16(5): p. 559-

564.

92. Tremaroli, V. and F. Bäckhed, Functional interactions between the gut

microbiota and host metabolism. Nature, 2012. 489(7415): p. 242-249.

93. Nicholson, J., et al., Host-gut microbiota metabolic interactions. Science

(New York, N.Y.), 2012. 336(6086): p. 1262-1267.

94. Clemente, J., et al., The impact of the gut microbiota on human health: an

integrative view. Cell, 2012. 148(6): p. 1258-1270.

95. Holmes, E., et al., Therapeutic modulation of microbiota-host metabolic

interactions. Science translational medicine, 2012. 4(137).

96. Dirk, G., et al., Bioinformatics for the Human Microbiome Project. PLoS

Computational Biology, 2012. 8.

http://dtp.nci.nih.gov/docs/misc/common_files/cell_list.html%5d

http://dtp.nci.nih.gov/docs/misc/common_files/cell_list.html

190

190

97. Macfarlane, G. and S. Macfarlane, Models for intestinal fermentation:

association between food components, delivery systems, bioavailability and

functional interactions in the gut. Current opinion in biotechnology, 2007.

18(2): p. 156-162.

98. Van den Abbeele, P., et al., Microbial community development in a dynamic

gut model is reproducible, colon region specific, and selective for

Bacteroidetes and Clostridium cluster IX. Applied and environmental

microbiology, 2010. 76(15): p. 5237-5246.

99. Elia, M. and J. Cummings, Physiological aspects of energy metabolism and

gastrointestinal effects of carbohydrates. European journal of clinical

nutrition, 2007. 61 Suppl 1: p. 74.

100. Koropatkin, N., E. Cameron, and E. Martens, How glycan metabolism shapes

the human gut microbiota. Nature reviews. Microbiology, 2012. 10(5): p.

323-335.

101. Mahowald, M., et al., Characterizing a model human gut microbiota

composed of members of its two dominant bacterial phyla. Proceedings of the

National Academy of Sciences of the United States of America, 2009.

106(14): p. 5859-5864.

102. Cantarel, B., V. Lombard, and B. Henrissat, Complex carbohydrate

utilization by the healthy human microbiome. PloS one, 2012. 7(6).

103. Flint, H., et al., Interactions and competition within the microbial community

of the human colon: links between diet and health. Environmental

microbiology, 2007. 9(5): p. 1101-1111.

104. Barcenilla, A., et al., Phylogenetic relationships of butyrate-producing

bacteria from the human gut. Applied and environmental microbiology,

2000. 66(4): p. 1654-1661.

105. Macfarlane, G. and S. Macfarlane, Fermentation in the human large

intestine: its physiologic consequences and the potential contribution of

prebiotics. Journal of clinical gastroenterology, 2011. 45 Suppl: p. 7.

106. Willem, F.B., et al., Prebiotic and Other Health-Related Effects of Cereal-

Derived Arabinoxylans, Arabinoxylan-Oligosaccharides, and

Xylooligosaccharides. Critical Reviews in Food Science and Nutrition, 2011.

51.

107. Sonnenburg, J. and M. Fischbach, Community health care: therapeutic

opportunities in the human microbiome. Science translational medicine,

2011. 3(78).

191

191

108. Borenstein, E., Computational systems biology and in silico modeling of the

human microbiome. Briefings in bioinformatics, 2012. 13(6): p. 769-780.

109. Hashimoto, K., et al., KEGG as a glycome informatics resource.

Glycobiology, 2006. 16(5).

110. Altschul, S.F., et al., Basic local alignment search tool. J Mol Biol, 1990.

215(3): p. 403-10.

111. Cantarel, B., et al., The Carbohydrate-Active EnZymes database (CAZy): an

expert resource for Glycogenomics. Nucleic acids research, 2009.

37(Database issue): p. 8.

112. Aziz, R., et al., SEED servers: high-performance access to the SEED

genomes, annotations, and metabolic models. PloS one, 2012. 7(10).

113. Peterson, J., et al., The NIH Human Microbiome Project. Genome Research,

2009. 19.

114. Marcobal, A., et al., Bacteroides in the infant gut consume milk

oligosaccharides via mucus-utilization pathways. Cell host & microbe, 2011.

10(5): p. 507-514.

115. Sonnenburg, J.L., Glycan Foraging in Vivo by an Intestine-Adapted Bacterial

Symbiont. Science, 2005. 307.

116. Sonnenburg, E., et al., Specificity of polysaccharide use in intestinal

bacteroides species determines diet-induced microbiota alterations. Cell,

2010. 141(7): p. 1241-1252.

117. Marcobal, A. and J. Sonnenburg, Human milk oligosaccharide consumption

by intestinal microbiota. Clinical microbiology and infection : the official

publication of the European Society of Clinical Microbiology and Infectious

Diseases, 2012. 18 Suppl 4: p. 12-15.

118. Muegge, B.D., et al., Diet Drives Convergence in Gut Microbiome Functions

Across Mammalian Phylogeny and Within Humans. Science, 2011. 332.

119. Muegge, B., et al., Diet drives convergence in gut microbiome functions

across mammalian phylogeny and within humans. Science (New York, N.Y.),

2011. 332(6032): p. 970-974.

120. Ley, R.E., et al., Evolution of Mammals and Their Gut Microbes. Science,

2008. 320.

121. Yatsunenko, T., et al., Human gut microbiome viewed across age and

geography. Nature, 2012. 486(7402): p. 222-227.

192

192

122. Judith, R.K. and D.W. Gary, The gut microbiota, environment and diseases of

modern society. Gut Microbes, 2012. 3.

123. Wu, G., et al., Linking long-term dietary patterns with gut microbial

enterotypes. Science (New York, N.Y.), 2011. 334(6052): p. 105-108.

124. Filippo, C.D., et al., Impact of diet in shaping gut microbiota revealed by a

comparative study in children from Europe and rural Africa. Proceedings of

the National Academy of Sciences, 2010. 107.

125. Ng, K.M., et al., Microbiota-liberated host sugars facilitate post-antibiotic

expansion of enteric pathogens. Nature, 2013.

126. FAOSTAT, F., Agriculture Organization of the United Nations. Statistical

Database, 2009.

127. Langille, M.G., et al., Predictive functional profiling of microbial

communities using 16S rRNA marker gene sequences. Nat Biotechnol, 2013.

128. Henry, C., et al., High-throughput generation, optimization and analysis of

genome-scale metabolic models. Nature biotechnology, 2010. 28(9): p. 977-

982.

129. Freilich, S., et al., Competitive and cooperative metabolic interactions in

bacterial communities. Nature communications, 2011. 2: p. 589.

130. Thiele, I., A. Heinken, and R. Fleming, A systems biology approach to

studying the role of microbes in human health. Current opinion in

biotechnology, 2013. 24(1): p. 4-12.

131. Schellenberger, J., et al., Quantitative prediction of cellular metabolism with

constraint-based models: the COBRA Toolbox v2.0. Nature protocols, 2011.

6(9): p. 1290-1307.

132. Doubet, S. and P. Albersheim, CarbBank. Glycobiology, 1992. 2(6): p. 505.

133. Markowitz, V., et al., IMG/M-HMP: a metagenome comparative analysis

system for the Human Microbiome Project. PloS one, 2012. 7(7).

134. Caporaso, J., et al., QIIME allows analysis of high-throughput community

sequencing data. Nature …, 2010.

135. Glass, E.M., et al., Using the metagenomics RAST server (MG-RAST) for

analyzing shotgun metagenomes. Cold Spring Harb Protoc, 2010. 2010(1): p.

pdb prot5368.

136. Hall, M., et al., The WEKA data mining software: an update. ACM SIGKDD

Explorations Newsletter, 2009. 11(1): p. 10-18.

193

193

137. Mahadevan, R., J.S. Edwards, and F.J. Doyle, 3rd, Dynamic flux balance

analysis of diauxic growth in Escherichia coli. Biophys J, 2002. 83(3): p.

1331-40.

138. Wolfram, S., A new kind of science. 2002, Champaign, IL: Wolfram Media.

xiv, 1197 p.

139. Madan Babu, M., S.A. Teichmann, and L. Aravind, Evolutionary dynamics of

prokaryotic transcriptional regulatory networks. J Mol Biol, 2006. 358(2): p.

614-33.

140. Parter, M., N. Kashtan, and U. Alon, Environmental variability and

modularity of bacterial metabolic networks. BMC Evol Biol, 2007. 7: p. 169.

141. Edwards, J.S., R. Ramakrishna, and B.O. Palsson, Characterizing the

metabolic phenotype: a phenotype phase plane analysis. Biotechnol Bioeng,

2002. 77(1): p. 27-36.

142. Varma, A., B.W. Boesch, and B.O. Palsson, Biochemical production

capabilities of Escherichia coli. Biotechnol Bioeng, 1993. 42(1): p. 59-73.

143. Varma, A. and B.O. Palsson, Stoichiometric flux balance models

quantitatively predict growth and metabolic by-product secretion in wild-type

Escherichia coli W3110. Appl Environ Microbiol, 1994. 60(10): p. 3724-31.

144. Livny, D.T.a.T.T.a.M., Distributed computing in practice: the Condor

experience. Concurrency - Practice and Experience, 2005. 17(2-4): p. 323-

356.

145. Burgard, A.P., S. Vaidyaraman, and C.D. Maranas, Minimal reaction sets for

Escherichia coli metabolism under different growth requirements and uptake

environments. Biotechnol Prog, 2001. 17(5): p. 791-7.

146. Suthers, P.F., et al., A genome-scale metabolic reconstruction of Mycoplasma

genitalium, iPS189. PLoS Comput Biol, 2009. 5(2): p. e1000285.

147. Phan-Thanh, L. and T. Gormon, A chemically defined minimal medium for

the optimal culture of Listeria. Int J Food Microbiol, 1997. 35(1): p. 91-5.

148. Maglott, D., et al., Entrez Gene: gene-centered information at NCBI. Nucleic

Acids Res, 2005. 33(Database issue): p. D54-8.

149. Oberhardt, M.A., A.K. Chavali, and J.A. Papin, Flux balance analysis:

interrogating genome-scale metabolic networks. Methods Mol Biol, 2009.

500: p. 61-80.

194

194

150. Zivkovic, A., et al., Human milk glycobiome and its impact on the infant

gastrointestinal microbiota. Proceedings of the National Academy of

Sciences of the United States of America, 2011. 108 Suppl 1: p. 4653-4658.

סאקלר ובברלי ש ריימונד"הפקולטה למדעים מדוייקים ע

בלבטניק ש"בית הספר למדעי המחשב ע

מידול מטבולי של אוכלוסיית בקטריות מגלה תובנות

חדשות על היחסים בתוך האוכלוסיה ובאורגניזמים עצמם

"דוקטור לפילוסופיה"חיבור לשם קבלת תואר

מאת

רפי זרצקי

העבודה בוצעה תחת הנחייתו של

פרופסור איתן רופין

הוגש לסנאט של אוניברסיטת תל אביב

2013 נובמבר

2

2

תמצית

הוכיחו מזה זמן (Genome-scale metabolic model – GSSM)מודלים מטבולים בהיקף גנומי

למרות הישומים הרבים . את תועלתם בחיזוי התנהגותם של אוגניזמים רק על בסיס הידע הגנומי עליהם

. קצב בניית המודלים המטבוליים לא עמד בקצב ריצוף האורגניזמים, בהם משתמשים במודלים מטבולים

עובדה זאת בצרוף העדר שימוש במזהים זהים לתאור מטבוליטים וראקציות במודלים המטבולים אשר

הגבילו עד היום את המחקר החישובי המטבולי בעיקר לחקירת אונגניזמים כישויות , מיוצרים ידנית

בכלל ובקטריות בפרט חיים כאוכלוסיות (prokaryotes)יחד עם זאת ידוע כי פרוקריוטים . מבודדות

והאינטראקציות בין חברי האוכלוסיה משפיעים רבות על הפונקציונליות של הפרטים ,מגוונות

בעוד המודלים המטבולים הקיימים כיום מתאימים בכדי לחזות את ההתנהגות של . והאוכלוסיה

רוב המערכות בטבע דורשות מידול מטבולי של , אוכלוסיות אחידות הבנויות מסוג אחד של אורגניזם

. מספר גדול יותר של אורגניזמים בכדי להבין את המכניזם בו הם פועלים

אוטומטיים לבניית מודלים מטבוליים אשר הופיעו לאחרונה ביחד עם אלפי פרוקריוטים -מחוללים סמי

י "למרות שהמודלים המיוצרים ע. מרוצפים מאפשרים כעת לבנות מודלים של אוכלוסיות של בקטריות

קיומם מאפשר , בחלק מהמודלים אשר יוצרו באופן ידני המחוללים אינם באיכות מקבילה לזו שקיימת

הדבר הפשוט של השוואה בין מספר רב של אורגניזמים על . לענות על שאלות שלא יכולנו לענות בעבר

נעשה , דבר אשר היה כמעט בלתי אפשרי בעידן של מודלים מטבולים ידניים, בסיס מודלים מטבולים

. וכמובן בנית מודל של אוכלוסית בקטריות בכל גודל נהיה פיזיבילי. כעת פשוט

המחקר המוצג בעבודה זו רותם את המודלים המטבוליים תוצרי המחוללים בכדי לענות על מספר שאלות

העבודה כוללת מחקר ופיתוח כלים לחיזוי של . בסיסיות הקשורות למידול אוכלוסיות של בקטריות

מחקר על תכונות כלליות , אינטראקציות מטבוליות בין סוגים שונים של בקטריות בסביבות גידול שונות

י השוואת ההתנהגות המטבולית של מספר רב של "של פרוקריוטים אשר יכולות להתגלות רק ע

. אורגניזמים ומחקר על האינטראקציה בין אוכלוסיות של בקטריות והאורגניזמים המאכלסים אותם

עבודה זו מדגימה את הפוטנציאל העצום הטמון במידול המטבולי של אוכלוסיות של בקטריות לצורך

באופן ספציפי . טכנולוגיות-ולצורך תכנון מערכות ביו, קידום ההבנה הבסיסית של תהליכים ביולוגיים

העבודה כוללת פיתוח של כלי חישובי למידול אוכלוסיות של בקטריות על בסיס המודלים המטבולים של

שיתוף פעולה בין החברים באוכלוסיה בסביבות גידול /החברים באוכלוסיה לצורך חיזוי יחסי תחרות

לחיזוי קצב הגידול של , שיטה חדשה וקלה לחישוב, שונות ואימות ברמה הניסויית והאקולוגית של הכלי

ואלגוריתם חדש לחיזוי אופי פרוק הגליקנים , בקטריות המבוססת על המודלים המטבולים שלהם

(glycans)באוכלוסיות של בקטריות השוכנות במערכות העיכול של יונקים .

3

3

תקציר כללי רקע

במשך תקופה ארוכה השימוש במחשבים בביולוגיה היה בעיקרו לצורך עיבוד של מידע ביולוגי אשר

Systems)ביולוגיה של מערכות "הופעתו של המדע הבין תחומי . י ביולוגים"נאסף ונותח ע

Biology)" , אשר בוחן את המערכות הביולוגיות בראייה כוללת ומנסה לפענח עקרונות בסיסיים שלהם

שינה בצורה משמעותית את היחסים בין חוקרי מדעי המחשב ובין , על מנת להבין ולחזות את התנהגותם

במסגרת הביולוגיה של המערכות חוקרי מדעי המחשב בשיתוף עם חוקרי הביולוגיה . חוקרי הביולוגיה

, פיתחו מודלים חישוביים מורכבים אשר ניסו לבצע סימולציה של תופעות ביולוגיות ברמות פרוט שונות

גישת בניית המודלים החישוביים . וחוקרי הביולוגיה התבקשו לאמת את התחזיות אשר המודלים סיפקו

והגישות היותר " ביולוגיה של מערכות"היא זו שהפרידה בין גישת ה, אשר הינה גישה הוליסטית

מקובלות של הביולוגיה המולקולרית אשר היו יותר רדוקציונריות ואשר היוו את מרבית המחקר

[1] .הביולוגי שנעשה במחצית של המאה הקודמת

ניתוח רשתות מטבוליות

ובאופן ספציפי היא מתרכזת במידול של רשתות " ביולוגיה של מערכות"עבודה זו שייכת לתחום ה

י בניית מודלים המתארים את כלל הרשת המטבולית של "ע. (prokaryotes)מטבוליות בפרוקריוטים

ופותחה עבורה שפה , השיטה לבניית מודלים מטבולים הינה מוגדרת היטב מזה זמן. הישות הנחקרת

כמו כן פותחו כלים למדידת . SBML [2]שפה זו נקראת . י מחשב"ליצוג המודלים הניתנת לקריאה ע

חלבונים וגנים אשר יחדיו משמשים לאימות , ריכוזי מטבוליטים, שטפים של ראקציות מטבוליות

. [1]המודלים המפותחים

ההבנה של התהליכים המטבולים במערכות חיות הינה בעלת פוטנציאל כלכלי רב בתעשיה בכלל

בתעשיה נעשה שימוש במודלים מטבולים לצורך . רפואה וניקוי רעלים בשיטות ביולוגיות-ובתחומי הביו

המטבוליזם משמש , בתחום הרפואה. (כתוספי מזון)ויצור של חומצות אמינו , דלקים, הנדסת מזון

כיום סכרת והשמנת יתר שהינן מחלות הקשורות למטבוליזם . בתפקיד מרכזי במספר גדול של מחלות

. מהוות גורם משמעותי בתמותת ומחלות של אנשים

4

4

רקע ביולוגי

מתייחס לאוסף הטרנספורמציות הכימיות ממרכיבים , (cellular metabolism)המונח מטבוליזם תאי

רוב הראקציות הכימיות בתא מתבצעות באמצעות חלבונים מיוחדים . לתוצרים המתרחש בתוך התא

. כלל הראקציות בתא יוצר רשת מטבולית מורכבת המכילה אלפי ראקציות. הנקראים אינזימים

מידול מתמטי של המטבוליזם התאי

. מידול מתמטי של מטבוליזם יכול להעשות בדרכים שונות

מודלים קינטיים

כ מודלים קינטיים מיוצגים על ידי קבוצה של משוואות דיפרנציאליות אשר מחשבות את נגזרת הזמן "בד

מודלים אלו . של ריכוזי מטבוליטים בהתחשב בקצב פעולת הראקציות הכימיות אותן הם ממדלים

, דורשים ידע של פרמטרים רבים לגבי ריכוזים וקצבים של הראקציות הכימיות אותם רוצים למדל

. ומתאימים בעיקר לניתוח של מספר קטן של ראקציות

(constraint based modeling - CBM)מודלים מבוססי אילוצים

השימוש במודלים , כאשר עוסקים במידול של רשתות מטבוליות המכילות מספר רב של ראקציות

שתאים ביולוגיים CBMהבסיס מאחורי . קינטיים הופך להיות לא ישים ולכן נדרש להשתמש בשיטה זו

י עכיפה של המגבלות הפיזיקליות אנו "ע. [3]כפופים למגבלות פיזיקליות אשר מגבילות את התנהגותם

כיום נעשה שימוש רב . אלו פתרונות הינם אפשריים ואלו לא, יכולים לזהות ברשת מטבולית גדולה

י תאים וקצב הפרשת "קצב של צריכת מטבוליטים ע, לצורך חיזוי קריטיות של גניםCBMומוצלח ב

: י מטריצה" המודל המטבולי מיוצג עCBMב . י תאים"מטבוליטים ע

Sm x n ⋲ Rmm x n

jעמודה . את מספר המטבוליטים הקיימים במודלm- מייצג את מספר הראקציות במודל וnכאשר

Aראקציה הצורכת את המטבוליט : לדוגמא) jמייצגת את המשוואה הלינארית של ראקציה , במטריצה

בהתאמה בשורות המייצגים את המטבוליטים 1+,1- תכיל 1:1 ביחס של Bומייצרת את המטבוליט

. במטריצה מייצג את המקדם הסטיוכיומטרי של המטבוליט בראקציה Si,jהתא . האלו

ניתן לחזות ערכים , באמצעות תיכנות לינארי אשר מופעל על הייצוג המטריציוני של המודל המטבולי

כעת אילוצים ביולוגיים . אפשריים של שטפים העוברים בראקציות המגדירות את הרשת המטבולית

5

5

בין האילוצים הקיימים . אשר שמים על המטריצה מאפשרים לצמצם את מרחב הפתרונות האפשריים

: ניתן למצא

הינו V הינה המטריצה ו S כאשר S · V = 0אילוץ המונע הצטברות של מטבוליטים בתוך התא -

וקטור השטפים של הראקציות במטריצה

דינמים המכתיבים כיוון לראקציות השונות וערכי מקסימום לשטפים -אילוצים תרמו-

מדיה מינימלית לגידול האורגניזם -

תהליך בניית המודלים המטבולים

כימי -מידע ביו, מידע גנומי: תהליך הבנייה של מודל מטבולי מבוסס על מספר מקורות מידע הכוללים

GO[4] כאלו במאגר 20,000ישנם יותר מ )התהליך כולל הגדרת פונקציות ביוכימיות . ומידע פיזיולוגי

-פרוטאין-גן: השלישיה. י גנים"המיוצרים ע( פרוטאינים)ושיוכן לאנזימים , המייצגות ראקציות (

סך השלישיות המייצגות את הגנים . מהווה את אבן הבניין של המודלים המטבולים (GPR)ראקציה

: ידועים הינםGPRמאגרי . המטבולים באורגניזם מהוות את הגרעין של המודל המטבולי שלו

KEGG[5] ,MetaCyc [6]ו - . The Seed[7] ישנו פרוטוקול איטרטיבי ידוע ומוגדר לבניית

זו הייתה . עד לפני מספר שנים בניית מודלים מטבוליים נעשה אך ורק ידנית. [8]מודלים מטבולים

. שנים1-2עבודה ארוכה שארכה כ

מודלים מטבולים של בקטריות

הפרוקריוטים כוללים ברובם את הבקטריות ואת . פרוקריוטים הינם אורגניזמים חד תאיים חסרי גרעין

הם היצורים הראשונים . סיסטם של העולם-הם בעלי תפקיד מרכזי באקו. (archaea)הארכיאות

גוף האדם מכיל פי . שנוצרו וגם היום כמות הביומסה שלהם גדולה מזו של הצמחים ובעלי החיים גם יחד

למרות . הבקטריות הינן קריטיות במחזור חומרי מזון. יותר תאים בקטריאלים מאשר תאים שהם שלו10

רובן יכולות לסייע בהרבה ישומים החל מייצור מזון ודלקים וכלה , שחלק מהבקטריות הינן פתוגניות

הבנת היחסים . בקטריות חיות באוכלוסיות מגוונות וגדולות. בסיוע בניקוי רעלנים ופרוק חומרים מזיקים

בין הבקטריות המרכיבות את האוכלוסיות השונות והיחסים בינן ובין האורגניזם אשר מכיל אותם

הינה קריטית להבנת הביולוגיה שלהם ושל המארח (כדוגמת חיידקי המעי במערכת העיכול של היונקים)

.ולמציאת דרכים יעילות לבניית אוכלוסיות של בקטריות למטרות יעודיות

ולכן הם נחקרו לעומק (eukaryotes)מזוית מחקרית הפרוקריוטים הינם פשוטים יחסית לאיוקריוטים

הידע הרחב שנצבר עליהם הביא לפיתוח מחולל מודלים מטבולים . ברמת הפנוטיפ והגנוטיפ שלהם

6

6

הינו ישום מבוסס Model Seed [9]המחולל המוביל הנקרא . עבורם אשר עליו מתבססת עבודה זו

Webמחולל זה מיישם בתוכו רבים מהשלבים הנדרשים לבניית . המאפשר בנייה אוטומטית של מודלים

. באמצעותו היום ניתן ליצר מודל מטבולי של בקטריה תוך פחות מיום . [8]מודל מטבולי

מידול מטבולי רחב היקף של בקטריות

כעת עם הופעת . רוב המחקר שנעשה היום בתחום המודלים המטבוליים הינו ברמת האורגניזם הבודד

נראה כי זהו הזמן להתחיל , גנומי-מחוללי המודלים עבור פרוקריאוטים והעניין הרב שקיים במידע המטה

. עבודה זו אכן עשתה זאת. ולחקור את התנהגות אוכלוסיות הבקטריות בעזרת המודלים המטבולים

מידול מטבולי של אוכלוסיות של פרוקריוטים

כפי שצויין רוב המחקר שנעשה היום בתחום המודלים המטבוליים של פרוקריוטים הינו ברמת

והוא מניח שהמינים השונים חיים כישויות מבודדות אשר אינן מתקשרות עם ישויות , האורגניזם הבודד

לעומת זאת המציאות היא שהפרוקריוטים חיים באוכלוסיות צפופות ומגוונות אשר מקיימות . אחרות

יחסים אלו והאינטראקציה עם הסביבה בה פועלת האוכלוסיה . אינטראקציה חזקה בין מרכיביהן

ניתן לראות את האוכלוסיה . השרידות והיכולות של האוכלוסיה כמכלול, משפיעות על הפונקציונליות

.עצמה כישות רב תאית בעלת יכולות ויעדים משל עצמה

בעוד שמודלים של אורגניזמים ספציפיים הינם מספקים בכדי לבצע תחזיות על אופן פעולת

נדרשים מודלים אחרים אשר , לצורך ביצוע תחזיות על אוכלוסיות, האורגניזימים בתרבית טהורה

יכולים לבצע סימולציה של אינטראקציה בין אורגניזמים שונים ובין עצמם ובין האורגניזמים השונים

אך בעבודות אלו המחקר [11 ,10]בעבר נעשו מספר נסיונות בסיסיים לבנות מודלים אלו . והסביבה

נעשה בעיקר עבור זוגות ספציפיים של אורגניזמים ולא בוצע מחקר רחב היקף על האינקראקציה בין

מציגה פלטפורמה חישובית ושיטות בכדי 2העבודה שנעשתה בפרק . סוגים שונים של פרוקריוטים

כאשר אבני הבניין של המודלים של האוכלוסיות הם המודלים , לבנות מודלים של אוכלוסיות בכל גודל

. SEEDמחולל המודלים המטבוליים של " האוטומטיים שנבנו ע

7

7

תקציר הפרקים בעבודה

תחרות ושיתוף פעולה באוכלוסיות של בקטריות

:מבוסס על המאמר

Competitive and cooperative metabolic interactions in bacterial communities

Shiri Freilich, Raphy Zarecki, Omer Eilam, Ella Shtifman Segal, Christopher S. Henry, Martin Kupiec, Uri Gophna, Roded Sharan & Eytan Ruppin

:במאמר זה שני הכותבים הראשונים תרמו במידה זהה

[12] 2011 בדצמבר 13: בNature Communicationsהמאמר פורסם ב

ידוע כי פרוקריוטים חיים ומשגשגים באוכלוסיות צפופות ומגוונות והיחסים בין חברי האוכלוסיות ובין

עצמם וכן האינטראקציה של הפרוקריוטים עם הסביבה החיצונית משפיעים באופן מהותי על היכולות

עד היום קשה לחזות האם בקטריה מסויימת יכולה לחיות במקביל או לשתף פעולה . של האוכלוסיה

דבר זה מקשה מאוד על תכנון מלאכותי של . בצורה מטבולית עם בקטריות אחרות באותה אוכלוסיה

. טכנולוגיים ורפואיים-אוכלוסיית בקטריות לצורכים ביו

אנו מציעים מערכת חישובית . אנו מתמקדים ביחסים המטבוליים בין סוגים שונים של בקטריות2בפרק

ואשר בונה מודל מטבולי של אוכלוסית בקטריות ובעזרת מודל , המתבססת בלעדית על המימד המטבולי

. זה מנסה לחזות האם בקטריות שונות מסוגלות לשתף פעולה או מתחרות זו בזו בסביבות גידול שונות

. השיטה הינה גנרית ומאפשרת למדל כל כמות של בקטריות שונות החיות יחדיו

זוגות של בקטריות אשר 118במסגרת אימות החיזויים של הכלי ביצענו בחינה חישובית של היחסים בין

התוצאות הושוו מול נתונים . ואשר איכותם נמדדה ופורסמה[9]י מחולל מודלים מטבולים "נבנו ע

התוצאות שהתקבלו הראו כי ישנה קורלציה חזקה בין בקטריות אשר. גנומי-אקולוגיים של מידע מטה

8

8

כמו כן נעשה שימוש . יכולות לשתף פעולה בינהן ובין הימצאות אותן בקטריות באותן סביבות אקולוגיות

בכלי לחזות את תוצאות התחרות בין בקטריות ידועות בסביבת גידול נתונה וכן נעשה חיזוי סביבות

אחר מכן בוצעו ניסויים . גידול בה בקטריות מסויימות משתפות פעולה וסביבות אחרות בהן הן מתחרות

ניסויים אלו הראו כי נמצאה התאמה גבוהה אם כי לא מושלמת בין . מעבדתיים בכדי לאמת את התוצאות

. התוצאות החזויות לבין התוצאות בניסויים

הכלי משתמש אך ורק במימד המטבולי של . לשיטות ולכלי שפותח ישנן מגבלות אשר פוגעות בדיוק שלו

כמו . האוכלוסיה ומתעלם משפעול של רגולציה אשר יכולה להשפיע על התפקוד המטבולי של הבקטריה

כן הכלי מתעלם מאסטרטגיות אשר מיקרו אורגניזמים רבים פיתחו בכדי לנצח בתחרות עם מיקרו

. י מחולל מודלים"הכלי מתבסס על מודלים מטבוליים אשר חוללו ע. אורגניזמים אחרים כדוגמת רעלנים

מודלים אלו עדיין אינם מדוייקים כמו חלק מהמודלים אשר פותחו ידנית ודבר זה פוגע גם הוא בדיוק

. אך עם כל המגבלות שצויינו לעיל אנו רואים סיגנל ברור בתוצאות שהמערכת חוזה, התחזיות

איכות התוצאות של הכלי שפותח צפוייה להשתפר עם השיפורים שנעשים כל הזמן באכות המודלים

ביוטי של סביבות -י מחוללי המודלים וכן בשל השיפור בידע על המבנה הא"המטבוליים המיוצרים ע

אנו צופים שהכלי שפותח יהווה תשתית חשובה לתכנון של אוכלוסיות יעודיות של . הגידול השונות

לצרכים רפואיים כאשר ניתן יהיה למצוא , טכנולוגיים לצורך יצור חומרים-בקטריות לצרכים ביו

בקטריות אשר יתחרו מול הבקטריות הפתוגניות ויקשו עליהם לסגסג ולצרכי פעולות של ניקוי רעלים

אנו רואים בהנדסה של אוכלוסיות תחליף חלקי להנדסה הגנטית ברמת . בסביבות אקולוגיות שונות

הבקטריה הבודדת אשר הינה בשימוש רב בימים אלו ואשר לה מגבלות רבות כאשר מנסים לבצע

. שינויים רבים במבנה הגנומי של בקטריה נתונה

9

9

אורגניזמים צורכי חמצן בסביבה -חיזוי האנטרופיה המקסימלית של מיקרו

מהווה מדד טוב לקצב הגידול שלהם, נתונה


Maximal Sum of metabolic exchange fluxes outperforms biomass yield as a predictor of growth rate of microorganisms

Raphy Zarecki, Matthew A. Oberhardt, Keren Yizhak, Allon Wagner, Ella Shtifman Segal, Shiri Freilich, Christopher S. Henry, Uri Gophna and Eytan Ruppin


: וכמו כן הוצג בכנס בשםPLoS Computational Biologyהמאמר נשלח לפרסום ב

Predicting cell metabolism and phenotypes (CA, USA 4-6/3/2013)

המאמר מהווה דוגמא לשימוש במספר רב של מודלים מטבוליים לצורךזיהוי תכונות חבויות של

.אורגניזמים

אנו מנסים להסביר ולחזות את קצב הגידול של האורגניזמים על בסיס השוואה בין המודלים 3בפרק

הבנת העקרונות הקובעים את קצב הגידול של תאים הינו סוגיה חשובה ביותר בעלת . המטבולים שלהם

אנו טוענים כי פרוקריוטים מנצלים חלק גדול . טכנולוגיה-השלכות רבות בתחומי הרפואה והביו

וחישוב האחד מאפשר לחזות את , מהפוטנציאל האנטרופי הטמון במזון אותו הם סופגים לצורך גדילה

אנו מציגים נוסחאות המבוססות על חוקי התרמודינמיקה ונוסחה פשטנית יותר הנקראת . השני

SUMEX (SUM of Exchanges) אשר ממקסמת את ההפרש בין השטפים של המטבוליטים אשר

. כמדדים לחיזוי קצב הגדילה, האוגניזמם מפריש לאלו שהוא סופג

אשר רובם אינם )היתרונות המשמעותיים בנוסחאות אלו הוא שהן אינן משתמשות בנתונים אימפיריים

והנוסחה הפשטנית גם אינה זקוקה לנתונים על ערכי האנרגיה , על קצבי השטפים של הראקציות (ידועים

. החופשית של גיבס עבור המטבוליטים השונים

10

10

מניתוח של מידע ביולוגי על קצב גדילה של מספר רב של בקטריות ותאים סרטניים ומניסויים אשר

הראנו כי השיטה שאנו מציעים היתה טובה מכל השיטות החישוביות אשר מבוססות על , עשינו במעבדה

מודלים מטבוליים ואשר יש בהן שימוש כיום לרבות המדד שמחשב את השטף של ראקציית הביומסה

של האורגניזם אשר משמש כמדד המוביל לחיזוי קצב הגידול למרות שהוא מחשב את הספק הגידול ולא

התוצאות של הניסויים והבדיקות שעשינו מראים את החשיבות של . [13]קצב הגידול של האורגניזם

. ראקציות המנצלות את הגרדיאנט של הפרוטונים באיפשור הגדילה באורגניזמים ארוביים

השימוש בתוצאות המחקר בישומים , מעבר לחשיבות של מחקר זה ברמת ההבנה של תהליכים ביולוגיים

טכנולוגיים יכול למשל לקבוע את סביבת הגידול האופטימלית לצורך גידול בקטריות אשר מיועדות -ביו

בתחום הרפואה הוא יכול להמליץ על תזונה אשר תקטין משמעותית את קצב . לייצור חומרים רצויים

.הגידול של פתוגנים ואפילו של תאים סרטניים

11

11

שימוש במטבוליזם של גליקנים באוכלוסיית הבקטריות במעי של יונקים

חוזה את סוגי הבקטריות במעי ומגלה התאמות באוכלוסיה הקשורות לסוג

המזון העיקרי של היונקים


Glycan metabolism of the mammalian gut microbiota predicts bacterial species abundance and reveals diet-specific adaptations

Omer Eilam, Raphy Zarecki, Matthew Oberhardt, Martin Kupiec, Uri Gophna & Eytan Ruppin


: וכמו כן הוצג בכנס בשםNature Methodsהמאמר נשלח לפרסום ב

Exploring human host-microbiome interactions in health and disease

(8-10.5.2012 Cambridge UK)

המאמר מהווה דוגמא לשימוש במודלים מטבוליים בכדי לנתח את אוכלוסיית הבקטריות במערכת

העיכול של יונקים

גליקנים מהווים את מקור התזונה העיקרי של הבקטריות במערכות העיכול של יונקים ולכן הבנת

עד היום המודלים . המטבוליזם הגליקני של הבקטריות הינו קריטי במחקר אוכלוסיית בקטריות זו

של עבודה זו מוצגת מערכת חישובית 4בפרק . המטבולים של הבקטריות לא הכילו התייחסות לגליקנים

חדשנית ומורכבת המשמשת לצורך הכנסת ראקציות השוברות גליקנים למודלים המטבוליים של

-הבקטריות ולצורך ניתוח יכולת הניצול של גליקנים במערכות עיכול של יונקים על בסיס מידע מטה

המערכת החישובית מהווה הרחבה . גנומי שנלקח ממערכות העיכול של המארחים של אותן אוכלוסיות

המערכת החישובית שפיתחנו חוזה בפעם הראשונה . חשובה של מחוללי המודלים המטבוליים הקיימים

י פרוייקט המיקרוביום האנושי "י מאות הבקטריות אשר מופו ע"את יכולת העיכול של אלפי גליקנים ע

(HMP – Human Microbiome Project) ואשר עבורם ישנו מודל מטבולי ומספקת מבט מטבולי

12

12

בעזרת המערכת שפיתחנו אנו מראים כי היכולת של בקטריה לפרק גליקנים מסוג . חדש על אוכלוסיה זו

דבר המראה כי , מתואמת בצורה חזקה עם השכיחות שלה(polysaccharides)סוכרים -של פולי

אנו מסבירים באמצעות יכולת . היכולת לפרק את הגליקנים מהווה יתרון סלקטיבי של אותן בקטריות

בנינו באמצעות . פרוק הגליקנים את ההבדלים באוכלוסיות הבקטריות הנמצאים בצימחונים ובטורפים

אשר יכלה לזהות בדיוק רב את סוג הדיאטה (classifier)מערכת לומדת , המערכת החישובית שפותחה

וזאת בצורה , של היונק על בסיס הגליקנים אותם מפרקת אוכלוסית הבקטריות במערכת העיכול שלו

הרבה יותר טובה ממערכות לומדות אשר השתמשו רק בידע המטה גנומי של אוכלוסית הבקטריות

כאשר הפעלנו את אותה מערכת לומדת על דגימות של אוכלוסיות של בקטריות . במערכת העיכול

במערכות העיכול של אזרחים אמריקאים מצאנו שרוב אוכלוסיות הבקטריות הינן בעלות העדפה לפרוק

גלינקנים שמקורם מהחיי וזאת לעומת אוכלוסיות של בקטריות ממערכות העיכול של אזרחים מונצואלה

. אשר בהן התגלתה העדפה לפרוק גליקנים שמקורם מהצומח

ביוטיות אשר יכולות -ביוטיות ופרה-הפלטפורמה החישובית שמוצגת פותחת את הדלת למתן תחזיות פרו

.להשפיע על בריאותנו

13

13

ניתוח התוצאות

היא שמערכות ביולוגיות הינן מערכות מקושרות ושלימוד ובחינה " ביולוגיה של מערכות"לב הגישה של

אלא נדרשת גישה , של כל מרכיב בניפרד אינו מספיק בכדי להבין את המורכבות הכללית של המערכת

הינם מימוש של (GSSM)המודלים המטבוליים ברמת התא . הוליסטית להבנת הביולוגיה של המערכת

המודלים המטבוליים . ברמת התא הבודד" ביולוגיה של מערכות"י ה"הגישה ההוליסטית המיוצגת ע

רפואיים , טכנולוגיים-הוכיחו את עצמם ככלי שמספק תחזיות להתנהגות אורגניזמים בהרבה תהליכים ביו

המאמץ הרב שהיה נדרש בכדי ליצור מודלים מטבוליים של בקטריות באופן . ובמערכות לניקוי רעלים

הגביל את , ידני בנוסף לכך שלא היה תקן אחיד לשמות המטבוליטים והראקציות במודלים שיוצרו

המחקר באמצעות המודלים המטבוליים לניתוח פעולתן של בקטריות בודדות הפועלות לבד או

לעומת זאת ידוע כי פרוקריוטים חיים ומשגשגים באוכלוסיות צפופות ומגוונות . באוכלוסיות טהורות

והיחסים בין חברי האוכלוסיות ובין עצמם וכן בינם לבין הסביבה משפיעים רבות על פעילות האוכלוסיה

ולכן נדרש היה לבנות תשתית אשר תאפשר למדל את , ככלל ופעילות האורגניזמים הבודדים בפרט

. פעולתם של אוכלוסיות של אורגניזמים בכדי לעלות רמה מעבר לרמת התא הבודד

אשר סיפקה מודלים מטבוליים לאלפי , הופעתם של מחוללי מודלים מטבוליים עבור פרוקריוטים

אורגניזמים תוך שמירה על קונבנציית שמות אחידה למטבוליטים ולראקציות המרכיבים את המודלים

Gemone Scale)מרמת הגנום הבודד " סולם ההוליסטי"איפשרה לבצע את העליה הנדרשת ב

Metabolic Model : GSSM) לרמת האוכלוסיה(Community genome-scale metabolic

model : C-GSSM) . למרות שהמודלים המחוללים אוטומטית אינם עדיין באיכות של חלק מהמודלים

עיקר . הם מאפשרים לשאול שאלות אשר לא ניתן היה לשאול בעבר ברמת האוכלוסיה, המחוללים ידנית

עבודה זו היה במתן מענה חישובי עבור חלק מהשאלות שעצם קיומם של מודלים אלו של פרוקריוטים

.איפשר לשאול

מתן מענה לסוגים שונים של שאלות בעזרת מודלים מטבוליים של אלפי פרוקריוטים

בעבודה זו ניסיתי לענות על דוגמאות של שאלות מסוגים שונים הקשורות לאוכלוסיות של בקטריות

ואשר כעת באמצעות הכמות הגדולה של המודלים המטבוליים של הפרוקריוטים שקיימים ניתן היה

. לענות

הקבוצה הראשונה של השאלות עסקה ביחסים של תחרות ושיתוף פעולה בין סוגים שונים של בקטריות

על האינטראקציות בין (הגדול ביותר עד היום) נעשה מחקר רחב היקף 2בפרק . בסביבות גידול שונות

. זוגות שונים של בקטריות

14

14

הקבוצה השניה של השאלות עסקה בהשוואה בין מודלים מטבוליים של פרוקריוטים שונים ובתכונות

מצאנו דרך לחזות את קצב הגידול של פרוקריוטים בסביבת גידול 3בפרק . שניתן ללמוד מהשוואה זו

השיטה שנמצאה התגלתה כמדוייקת הרבה יותר מהשיטות . נתונה בהתבסס על עקרונות תרמודינמיים

השיטה שמצאנו נתנה תוצאות . הקודמות לחיזוי קצב גידול אשר בהן נעשה שימוש במודלים מטבוליים

דומות לאלו שהתקבלו משיטות קינטיות אשר דרשו ידע רב יותר על פרמטרים בסביבת הגידול והיו

. מסובכות הרבה יותר לחישוב

. הקבוצה השלישית של השאלות עסקה ביחסים בין אוכלוסיות של בקטריות והאורגניזם המארח שלהם

. עסקנו ביחסים בין אוכלוסיית הבקטריות במערכת העיכול של יונקים והמארח שלהם4בפרק

15

15

כיוונים עתידים

בשל הזמינות של אלפי המודלים המטבוליים של , עבודה זו מכילה מספר מחקרים פורצי דרך

העבודה רק גירדה את קצה השטח במרחב השאלות . פרוקריוטים אשר מחוללי המודלים המטבולים יצרו

מציגים הצעות להרחבות במסגרת שאלות המחקר 2-4פרקים . אשר מודלים אלו מאפשרים לענות

אך בסעיף זה אנסה להציג מספר נושאים נוספים אשר אני חושב כי , הספציפיות עליהם הם ניסו לענות

. כדאי לחקור בהתבסס על המחקר שנעשה בעבודה זו ובהתבסס על המודלים המטבוליים הקיימים

שיטות חדשות לסימולציה של אוכלוסיות

ישנן . בכמוסטט (steady state)בעבודה זו התרכזנו במידול של אוכלוסיות במצב של שיווי משקל

נוספות המשתמשות במודלים מטבוליים ואשר משמשות כיום אך ורק למידול אורגניזמים בודדים יטותש

אחת מהשיטות הזו מורידה את הדרישה של מידול במצב של . ואשר ניתן להרחיבם למידול אוכלוסיות

. [14]: ומתוארת בdynamic flux balance analysis – dynamic fbaשיווי משקל והיא נקראת

בשיטה זו אנו מכניסים . cellular automata [15]שיטה זו הינה ישום של אלגורתמים ממשפחת ה

הבעיה העיקרית של גישה זו הינה ריבוי . את מרכיב הזמן והשינויים בסביבה לתוך החישוב המטבולי

אך יחד עם זאת , הפרמטרים האפשריים אשר עלולים להשפיע על פעילות המערכת לה עושים סימולציה

גישה זו דורשת הרבה פחות פרמטרים מאשר אלו הנדרשים בבניית סימולציה דינמית מלאה של

אינה סבירה בעת עבודה עם סביבות מורכבות כאוכלוסיות ולכן " כמוסטט"לדעתי הנחת ה.אוכלוסיות

.cellular automataנדרשת עבודה עם אלגוריתימים ממשפחת ה

בניית אוכלוסיות יעודיות

פותחת הרבה הזדמנויות , היכולת לתכנן מבנים אופטמליים של אוכלוסיות של בקטריות למטרות שונות

בתחומים של פרוק רעלנים וביצירת אוכלוסיות , טכנולוגיה של יצור חומרים-בתחומים של ביו

.המעודדות סגסוג של בקטריות מועילות ותחרות מול פתוגנים

שימוש באוכלוסיות של בקטריות לצרכים האמורים יכול להוות תחליף לפעולות המורכבות של הוצאת

. והכנסת גנים לבקטריות בודדות הנעשות היום במסגרת תהליכי ההנדסה הגנטית

.השיטות והכלים שהוצגו בעבודה זו מאפשרות לתכנן אוכלוסיות יעודיות של בקטריות ליעודים נדרשים

16

16

דוגמאות של תכונות פרוקריוטים אשר יכולים להילמד מתוך הניתוח המשותף של המודלים

המטבוליים שלהם

של עבודה זו בחנו את תכונת קצב הגידול של הפרוקריוטים באמצעות המודלים המטבוליים 3בפרק

להלן מספר דוגמאות . ישנם עוד הרבה שאלות אשר יכולות להיחקר בעזרת אותם מודלים. שלהם

:לשאלות אלו

?האם ישנו פעפוע של תהליכים מטבוליים לאורך העץ הפילוגנטי

? האם ישנה קורלציה בין המרחק הפילוגנטי של פרוקריוטים וסביבת הגידול שלהם

? של פרוקריוטים PHהאם ישנם מרכיבים מטבוללים המשחקים תפקיד בקביעת רגישות ה

כיום המודלים המטבוליים הקיימים אינם מדוייקים מספיק בכדי לתת מענה מדוייק ברמת האורגניזם

חלק מהתכונות יכולות לספק סיגנל ברור במצטבר , אך בניתוח של מספר רב של מודלים, הבודד

.3ולהוביל להבנות חדשות ולתגליות חדשות כפי שקרה בפרק

חקירה של היחסים בין אוכלוסיות ספציפיות של בקטריות והמארח שלהם

התחלנו לחקור את האוכלוסיה של הבקטריות הנמצאות במערכת העיכול של יונקים בכלל ושל 4בפרק

חלקן קשורות ל , ישנם הרבה אוכלוסיות של בקטריות בטבע אשר פועלות בתוך מארח. אנשים בפרט

.מחלות ופתוגנים ואחרות קשורות ליחסים פרזיטים וסימביוטים

דוגמא נוספת אשר אני מאמין כי היא חשובה הינה תכנון של אוכלוסיות של בקטריות אשר יכולות

אין צורך להסביר את החשיבות של מחקר . לשמש כמדשנים של צמחים או כקוטלי מזיקים של צמחים

בכיוון זה בתקופה בה אוכלוסיית העולם גדלה בקצב מהיר וישנן ספקות לגבי היכולת לענות על דרישות

. המזון של אוכלוסיה זו

סיכום

הופעתם של מחוללי המודלים המטבוליים עבור פרוקריוטים ואיתם הופעת אלפי מודלים מטבוליים עבור

השוואה בין . פתחה תת תחום חדש בתוך תחום המחקר של המודלים המטבוליים, פרוקריוטים שונים

( C-GSSMs)מודלים מטבוליים של אורגניזמים שונים ובניית מודלים של אוכלוסיות של פרוקריוטים

מראה פוטנציאל גבוה במתן מענה לשאלות הקשורות להתנהגות פרוקריוטים בטבע ובתכנון אוכלוסיות

17

17

כפי שניתן לראות בתוצאות שמראה עבודה זו וזאת למרות שהיא רק שרטה את קצה , לצרכים יעודיים

.השטח בתת תחום זה

אך ברור כי התוצאות והשיטות שהוצגו בעבודה הינן , התוצאות המוצגות בעבודה הינן מעודדות ביותר

חלקיות ומוגבלות בעיקר בגלל שהן מתרכזות רק בהיבט המטבולי של היחסים בין אורגניזמים תוך

כמו כן צפוי כי הנחת . התעלמות מרגולציה ברמת התא הבודד והתעלמות מהתמחות ברמת האוכלוסיות

אשר הייתה בסיסית בעבודה זו תתבטל ככל שיעשה שימוש במודלים יותר דינמיים " שווי משקל"ה

.במידול אוכלוסיות

פתיחה של תת ענף חדש של מחקר היא תמיד מרתקת ואני מוצא את עצמי בר מזל להיות שותף במעשה

.זה

18

18

ביבליוגרפיה לתקציר העברי

1. Kell, D.B., Systems biology, metabolic modelling and metabolomics in drug discovery and development. Drug Discov Today, 2006. 11(23-24): p. 1085-92.

2. Hucka, M., et al., The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 2003. 19(4): p. 524-31.

3. Price, N.D., J.L. Reed, and B.O. Palsson, Genome-scale models of microbial cells: evaluating the consequences of constraints. Nat Rev Microbiol, 2004. 2(11): p. 886-97.

4. Ashburner, M., et al., Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 2000. 25(1): p. 25-9.

5. Kanehisa, M., et al., The KEGG resource for deciphering the genome. Nucleic Acids Res, 2004. 32(Database issue): p. D27 7-80.

6. Karp, P.D., et al., The MetaCyc Database. Nucleic Acids Res, 2002. 30(1): p. 59-61.

7. Aziz, R.K., et al., SEED servers: high-performance access to the SEED genomes, annotations, and metabolic models. PLoS One, 2012. 7(10): p. e48053.

8. Thiele ,I. and B.O. Palsson, A protocol for generating a high-quality genome-scale metabolic reconstruction. Nat Protoc, 2010. 5(1): p. 93-121.

9. Henry, C.S., et al., High-throughput generation, optimization and analysis of genome-scale metabolic models. Nat Biotechnol, 2010. 28(9): p. 977-82.

10. Stolyar, S., et al., Metabolic modeling of a mutualistic microbial community. Mol Syst Biol, 2007. 3: p. 92.

11. Wintermute, E.H. and P.A. Silver, Emergent cooperation in microbial metabolism. Mol Syst Biol, 2010. 6: p .407.

12. Freilich, S., et al., Competitive and cooperative metabolic interactions in bacterial communities. Nat Commun, 2011. 2: p. 589.

13. Adadi, R., et al., Prediction of microbial growth rate versus biomass yield by a metabolic network with kinetic parameters. PLoS Comput Biol, 2012. 8(7): p. e1002575.

14. Mahadevan, R., J.S. Edwards, and F.J. Doyle, 3rd, Dynamic flux balance analysis of diauxic growth in Escherichia coli. Biophys J, 2002. 83(3): p. 1331-40.

15. Wolfram, S., A new kind of science. 2 002 , Champaign, IL: Wolfram Media. xiv, 1197 p.

19

19

.דף זה הושאר ריק בכוונה

Documents

Cross species modeling of bacterial metabolism reveals new ... · Cross species modeling of bacterial metabolism reveals new insights about their intra-community and inter-host interactions