Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
i
The Raymond and Beverly Sackler Faculty of Exact Sciences
The Blavatnik School of Computer Science
Cross species modeling of bacterial metabolism
reveals new insights about their intra-
community and inter-host interactions
Thesis submitted for the degree of Doctor of Philosophy
by
Raphy Zarecki
This work was carried out under the supervision of
Professor Eytan Ruppin
Submitted to the Senate of Tel Aviv University
November 2013
This work is dedicated to the pursuit after the understanding
of the interplay between living species.
Acknowledgements
I thank my advisor Professor Eytan Ruppin for his help, endless patience and
friendship.
I thank my Mother for her endless love.
I thank my family and specially my wife for their support in this adventure.
I thank all those people who shared this scientific road with me, specifically Shiri
Freilich, Matthew A. Oberhardt and Omer Eilam.
I thank my luck for living in exciting times.
This work was supported by the Edmond J. Safra Program in Tel-Aviv University.
This work was also supported by grants from the Israeli Ministry of Science and
Technology and the James McDonnell Foundation.
Abstract
Genome-scale metabolic models have proven their value in predicting organism
phenotypes from genotypes. But despite their numerous potential applications,
efforts to develop new models have failed to keep pace with genome sequencing.
This, together with the lack of standard nomenclatures for metabolites and reactions
in manually curated models, has restricted the focus of most research to date to
isolated, non-interacting species. However, it is well known that prokaryotes live and
thrive in dense communities, and the interactions of community members with each
other as well as with the environment determine much of the functionality,
adaptability, and capabilities of the whole group. While individual genome-scale
models are adequate to predict the behavior of cells in pure cultures, most natural
systems on earth require modeling of metabolic interactivity between species in order
to capture the most relevant biology.
The appearance of semi automatic tools for the generation of metabolic models, and
the subsequent availability of models for thousands of sequenced prokaryotes, have
recently opened up the possibility of tackling this issue. Although automatically
generated models are still not of equivalent quality to manually curated ones, they do
open new opportunities in the types of questions we may ask. Importantly, these
models solve the problem of differing metabolite and reaction nomenclature between
models -- a major technical hurdle in the past -- and thus enable seamless comparison
and modeling at the community level.
The research presented in this dissertation harnesses the newly emerging automatic
models of bacterial species to address some of the fundamental challenges involved
in modeling bacterial communities: The work includes a study of interactions
between different microbial species, an extraction of prevalent features across
species that are hidden when each species is examined alone, and a study of
interactions between bacterial communities and their hosts. In sum, this work
demonstrates the potential of multi-species metabolic modeling in advancing our
understanding of these issues, at the level of basic science and at the level of its
potential bio-engineering ramifications. Specifically some of the major results are a
large scale ecological based validation of a method for predicting the
cooperation/completion relationships between bacteria community members based
on their metabolic models, a novel easy to calculate method of predicting bacterial
growth rate based on their metabolic models and a novel algorithm for predicting the
glycan metabolism of the mammalian gut microbiota.
v
v
Contents
1. INTRODUCTION .................................................................................. 16
1.1. Systems biology ..................................................................................................................... 16
1.2. Metabolic networks analysis ................................................................................................ 17
1.2.1. Biological background ....................................................................................................... 17
1.2.2. Mathematical metabolic modeling ..................................................................................... 18
1.2.3. Construction of a metabolic model .................................................................................... 21
1.2.4. Potential application of metabolic modeling ..................................................................... 22
1.3. Focusing on modeling bacterial metabolism ...................................................................... 23
1.3.1. Semi automatic tools for metabolic model generation ....................................................... 23
1.3.2. Large scale analysis of a collction of prokaryote metabolic models .................................. 25
1.3.3. Metabolic simulating of prokaryotes communities ............................................................ 25
1.3.4. Finding hidden qualities from large scale analysis of metabolic models ........................... 26
2. COMPETITIVE AND COOPERATIVE METABOLIC INTERACTIONS IN
BACTERIAL COMMUNITIES ...................................................................... 28
2.1. Introduction .......................................................................................................................... 28
2.2. Results.................................................................................................................................... 30
2.2.1. In silico and in vivo description of co-growth patterns ...................................................... 30
2.2.2. Systematic predictions of the competitive potential .......................................................... 32
2.2.3. Systematic predictions of the cooperative potential........................................................... 35
vi
2.2.4. Patterns of interactions across ecological samples............................................................. 38
2.3. Discussion .............................................................................................................................. 42
2.4. Methods ................................................................................................................................. 43
2.4.5. Calculating the resource overlap within a species pair ...................................................... 47
3. MAXIMAL SUM OF METABOLIC EXCHANGE FLUXES
OUTPERFORMS BIOMASS YIELD AS A PREDICTOR OF GROWTH
RATE OF MICROORGANISMS .................................................................. 51
3.1. Introduction .......................................................................................................................... 51
3.2. Results and Discussion ......................................................................................................... 53
3.3. Conclusions ........................................................................................................................... 64
3.4. Materials and Methods ........................................................................................................ 64
3.4.1. Models ............................................................................................................................... 64
3.4.2. Implementation of growth rate predictors .......................................................................... 64
3.4.3. Building NCI60 cancer cell models ................................................................................... 65
3.4.4. Growth experiments of 6 organisms on 3 defined IMM media (ds18) .............................. 66
4. GLYCAN DEGRADATION (GLYDE) ANALYSIS PREDICTS
MAMMALIAN GUT MICROBIOTA ABUNDANCE AND HOST DIET-
SPECIFIC ADAPTATIONS .......................................................................... 67
4.1. Introduction .......................................................................................................................... 67
4.2. Results.................................................................................................................................... 69
4.2.1. The construction of the Glycan Degradation (GlyDe) pipeline ......................................... 69
4.2.2. The usage of the GlyDe pipeline ....................................................................................... 70
4.2.3. Validating the GlyDe pipeline ........................................................................................... 72
vii
4.2.4. Characterization of glycan degradation patterns across the major gut bacterial phyla ...... 73
4.2.5. Glycan degradation patterns can be used to predict bacterial abundance .......................... 74
4.2.6. Glycan degradation profiles of mammalian species are associated with their diet ............ 76
4.3. Discussion .............................................................................................................................. 79
4.4. Methods ................................................................................................................................. 82
4.4.1. Data Retrieval .................................................................................................................... 82
4.4.2. Construction of the CAZyme table (a key step in the GlyDe pipeline) ............................. 90
4.4.3. Data Analysis ..................................................................................................................... 92
5. DISCUSSION ........................................................................................ 98
5.1. Answering new types of questions with the large number of available bacterial
metabolic models ................................................................................................................................. 99
5.1.1. Metabolic Interaction within Bacterial communities ....................................................... 100
5.1.2. Extracting cell qualities from a large scale metabolic analysis across a large number of
species 101
5.1.3. Investigating the relations between gut bacterial communities and their hosts ............... 102
5.2. Future directions ................................................................................................................ 103
5.2.1. New methods for simulating communities ...................................................................... 104
5.2.2. Specialized bacterial communities ................................................................................... 104
5.2.3. Examples of organisms‟ traits that can be extracted using the large number of prokaryotes
metabolic models ........................................................................................................................... 105
5.2.4. Investigating relationships of specific communities of bacteria and their host ................ 105
5.3. Summary ............................................................................................................................. 105
APPENDIX 1. SUPPLEMENTARY DATA FOR CHAPTER 2 ................. 107
viii
A.1.1 Supplementary Figures ...................................................................................................... 107
A.1.2 Supplementary Tables ........................................................................................................ 109
A.1.3 Supplementary Methods .................................................................................................... 119
A.1.3.1 Computing the Maximal Biomass Production Rate (MBR) of species ....................... 119
A.1.3.2 Generation of a multi-species system metabolic model .............................................. 121
A.1.3.3 Computing a Competition inducing Medium (COMPM) for single species and multi-
species systems .............................................................................................................................. 121
A.1.3.4 Computing a Cooperation-inducing Medium (COOPM) ............................................ 122
A.1.3.5 Experimental and computational co-growth analysis ................................................. 124
A.1.3.6 Finding close cooperative loops in real and random networks of give-take interactions
and in real and randomly drawn communities ............................................................................... 126
A.1.4 Supplementary Notes ......................................................................................................... 127
A.1.4.1 Supplementary Note 1: Experimental and computational co-growth analyses for 10
bacterial pairs in interaction-specific media. .................................................................................. 127
A.1.4.2 Supplementary Note 2: using systematic data sources for estimating the ecological
relevance of win-lose predictions ................................................................................................... 132
A.1.4.3 Supplementary Note 3: Simulating co-growth of Salinibacter ruber and
Haloquadratum walsbyi. ................................................................................................................ 133
A.1.4.4 Supplementary Note 4: Relating the designed media to true ecological conditions ... 135
A.1.4.5 Supplementary Note 5: The use of various thresholds for determining a feasible growth
solution in minimal, cooperation-inducing, media ......................................................................... 137
A.1.4.6 Supplementary Note 6: Frequency of directional give-take relationships across
bacterial families (top 10 combinations). ....................................................................................... 141
A.1.4.7 Supplementary Note 7: Experimental and computational growth analyses of Listeria
innocua and Agrobacterium tumefaciens across pre-designed media ............................................ 142
ix
APPENDIX 2. SUPPLEMENTARY DATA FOR CHAPTER 3 ................. 147
A.2.1 List of Abbreviations .......................................................................................................... 147
A.2.2 Supplementary results ........................................................................................................ 147
A.2.2.1 Sensitivity analysis of SUMEX and Biomass ............................................................. 147
A.2.2.2 Expanded analysis of obligate fermenters and respirers in ds66 ................................. 150
A.2.2.3 The relationship between flux and molecular weight in SUMEX............................... 151
A.2.2.4 Ranging of biomass% lower bound ............................................................................ 151
A.2.2.5 Network flexibility in SUMEX as biomass lower bound approaches 100%............... 153
A.2.2.6 Summing exchange fluxes in the optimal biomass solution space predicts growth rate
156
A.2.2.7 Gene Expression of pathways contributing to SUMEX .............................................. 158
A.2.2.8 Correlation of SUMEX and Biomass: ......................................................................... 160
A.2.3 Supplementary Methods: ................................................................................................... 161
A.2.3.1 Models ........................................................................................................................ 161
A.2.3.2 General methods ......................................................................................................... 161
A.2.3.3 Reactions constraints and optimal environment setting .............................................. 162
A.2.3.4 Building NCI60 cancer cell models ............................................................................ 162
A.2.3.5 Computation of metrics ............................................................................................... 163
A.2.3.6 Growth experiments of 6 organisms on 3 defined IMM media (ds18) ....................... 171
A.2.4 Tables ................................................................................................................................... 172
APPENDIX 3. SUPPLEMENTARY DATA FOR CHAPTER 4 ................. 175
A.3.1 Figures ................................................................................................................................. 175
x
A.3.2 Tables ................................................................................................................................... 181
xi
xi
List of Tables
Chapter 4::Table 1: Mammalian host diet prediction by GlyDe profiles. ................. 78
Appendix 1::Supplementary Table S1. Description of model species and selected
properties. ................................................................................................................ 109
Appendix 1::Supplementary Table S6. The list of EnvO niches used in the analysis
and the number of assigned samples. ....................................................................... 117
Appendix 1::Table SM-1. IMM defined medium and its in silico representation. .. 125
Appendix 1::Table SN1-1. Predicted and observed co-growth shifts. .................... 128
Appendix 1::Table SN1-2 Calculated values for predicted and observed co-growth
shifts. ........................................................................................................................ 129
Appendix 1::Table SN3-1. Interactions between Salinibacter ruber and
Haloquadratum walsbyi across different media. ...................................................... 134
Appendix 1::Table SN4-1. Characterization of species-specific metabolic
computationally-designed environments ................................................................. 136
Appendix 1::Table SN4-2. Characterization of pair-specific metabolic environments
................................................................................................................................. 136
Appendix 1::Table SN5-1. Frequency of symmetrical interaction events under
minimal growth media with different thresholds for biomass production of the
system. ..................................................................................................................... 139
Appendix 1::Table SN5-2. Frequency of symmetrical interaction events under
minimal growth media with different thresholds for biomass production of the
system and the compartments in the system. ........................................................... 140
xii
Appendix 1::Table SN6-1. frequency of inter-family give-take interactions .......... 141
Appendix 1::Table SN7-1. Computational predictions for the effect of reducing and
removing computationally-predicted limiting factors from IMM media. ............... 144
Appendix 1::Table SN7-2. Predicted and observed growth and co-growth shifts. . 145
Appendix 1::Table SN7-3. Observed growth and co-growth shifts. Values indicate
the maximal OD in the experiments. ....................................................................... 145
Appendix 2::Table S1. Statistics of GEM sensitivity analysis. .............................. 150
Appendix 2::Table S2: Analysis of ds24. ................................................................ 156
Appendix 2::Table S4: Description of ds66. ........................................................... 172
Appendix 2::Table S5: in vitro growth experiments (i.e.,ds 18). ............................ 172
Appendix 2::Table S6: IMM defined medium. ....................................................... 174
Appendix 3::Supplementary Table 1: CAZymes degredation rules. ....................... 181
Appendix 3::Supplementary Table 2: Comarison between KEGG and non KEGG
Glyde Scores. ........................................................................................................... 181
Appendix 3::Supplementary Table 3: The GlyDe outputs for all the HMP taxa. ... 181
Appendix 3::Supplementary Table 4: The GlyDe scores of 8 Human Milk
Oligosaccharides (HMOs) available in KEGG. ....................................................... 182
Appendix 3::Supplementary Table 5: The CAZyme table. ..................................... 182
Appendix 3::Supplementary Table 6: A detailed account of the glycans used
throughout this analysis. .......................................................................................... 182
xiii
Appendix 3::Supplementary Table 7: The full list of monosaccharides and other
basic chemical entities used as nodes in the graphs representing glycan structures in
KEGG and incorporated into our system. ................................................................ 182
Appendix 3::Supplementary Table 8: An OTU table representing HMP taxa. ....... 182
Appendix 3::Supplementary Table 9: The HMP bacterial reference genomes glycan
degradation (GlyDe) matrix. .................................................................................... 182
Appendix 3::Supplementary Table 10: The Muegge et. al. samples CAZymes matrix.
................................................................................................................................. 182
Appendix 3::Supplementary Table 11: The Yatsunenko et. al. samples CAZymes
matrix. ...................................................................................................................... 182
Appendix 3::Supplementary Table 12: The Muegge et. al. samples GlyDe matrix. 182
Appendix 3::Supplementary Table 13: The Yatsunenko et. al. samples GlyDe matrix.
................................................................................................................................. 182
Supplementary Table 14: The Muegge et. al. samples GlyDe output report. .......... 182
Appendix 3::Supplementary Table 15: The Yatsunenko et. al. samples GlyDe output
report. ....................................................................................................................... 182
Appendix 3::Supplementary Table 16: Host diet predictions for 18 human samples
taken from Muegge et. al. ........................................................................................ 182
Appendix 3::Supplementary Table 17: Taxa with highly predictable abundance. .. 182
xiv
List of figures
Chapter2::Figure 1: Metabolic modeling in a multi-species system. ........................ 31
Chapter2::Figure 2: Metabolic modeling of pairwise growth on a competition-
inducing media (COMPM). ....................................................................................... 34
Chapter2::Figure 3: Distribution of competition and cooperation values. ............... 37
Chapter2::Figure 4: Predicted competitive and cooperative interactions across
different ecological groups. ....................................................................................... 40
Chapter 3::Figure 1: Correlation of different metrics to growth rate. ...................... 56
Chapter 3::Figure 2: Component-wise analysis of SUMEX ..................................... 59
Chapter 3::Figure 3: Prediction of growth in Respirers vs. Fermenters in ds66 ....... 61
Chapter 3::Figure 4: NCI60 cancer cell line growth rates predicted by SUMEX. .... 63
Chapter 4::Figure 1: The Glycan Degradation (GlyDe) platform. ............................ 72
Chapter 4::Figure 2: Glycan Degradation of the gut microbiota reference genomes. 76
Chapter 4::Figure 3: The connection between glycan degradation and diet. ............. 79
Appendix 1::Supplementary Figure S1. Cooperation and competition levels of the
ecological groups at different levels of competition and resource overlap. ............ 107
Appendix 1::Supplementary Figure S2: The frequency of resource overlap values
between ecologically associated (black) and non-associated (white) species pairs. 108
xv
Appendix 1::Figure NS1-1. Growth curves of individual and pair-wise combinations
across different media. ............................................................................................. 130
Appendix 2::Figure S1. Sensitivity analysis of GEM bounds. ............................... 149
Appendix 2:Figure S2: Effect of biomass lower bound on SUMEX. ..................... 153
Appendix 2::Figure S3: Flux variability in SUMEX solution as function of biomass
lower bound. ............................................................................................................ 155
Appendix 2::Figure S4: Extrapolating bounds for biomass..................................... 157
Appendix 2::Table S3: Association of global gene expression with SUMEX. ....... 160
Appendix 2::Figure S5: Correlation of Biomass with SUMEX. ............................. 160
Appendix 2::Figure S6: Schematic of SUMEX. ...................................................... 165
Appendix 3::Supplementary Figure 1: KEGG Glycans. ......................................... 177
Appendix 3::Supplementary Figure 2: Glycan Degradation of the gut microbiota
reference genomes. .................................................................................................. 179
Appendix 3::Supplementary Figure 3: The connection between glycan degradation
and diet. .................................................................................................................... 181
16
Chapter 1
1. Introduction
This chapter presents general background and reviews previous work related to the
studies described in this thesis. In particular, it includes an overview of systems
biology approaches for modeling of metabolic and protein networks. Further detailed
introductions precede each study in the following chapters.
1.1. Systems biology
For a very long time, the use of computers in biology was mostly dedicated to
analysis of collected biological data, which was then returned to biologists for
analysis.
The introduction of the concept of „Systems Biology‟ drastically changed the
relationships between computer scientists and biologists. In this paradigm,
sophisticated computational models are used to simulate biological phenomena in
great detail, and biologists are then asked to validate these models. Systems biology
thus involves an iterative interplay between high-throughput and high-content wetlab
experiments, technology development, theory, and computational modeling. The
involvement of computational modeling in the process sets systems biology apart
from the more traditional and more reductionist approaches of molecular biology,
which dominated biological study for the second half of the last century [1].
17
17
1.2. Metabolic networks analysis
This dissertation belongs to the field of Systems Biology. More specifically, it
focuses on the modeling of prokarytoe metabolic networks. Using the Systems
Biology approach, models describing the entire metabolic network of an organism of
interest are built. This approach is well defined, and computer readable generalized
representations of metabolic and biochemical model schemas such as SMBL[2] exist
and are stable. There are also tools and methods to measure concentrations of a broad
range of metabolites (metabolomics), as well as to collect other large-scale „omics
type data (e.g., proteomics, transcriptomics, fluxomics), which enables us to validate
some of our predictions[1].
The understanding of metabolic processes within living cells is of great potential
economic importance, and holds industrial, biomedical, and bio-remediation
potential. In industry, metabolic processes are relevant to the production of foods,
fuel, antibiotics, and amino-acids (as food supplements). In the context of healthcare,
metabolism plays a central role in many human diseases, especially with the
emergence of metabolic diseases such as diabetes and obesity as top sources of
morbidity and mortality.
1.2.1. Biological background
Cellular metabolism refers to the set of chemical transformations of substances that
take place within a cell. Most of the chemical reactions within the cell are catalyzed
by specific proteins called enzymes. These reactions typically convert several
metabolites, called reactants (or substrates), into several other product metabolites.
These collected reactions form a highly complex metabolic network.
18
18
1.2.2. Mathematical metabolic modeling
Mathematical modeling of metabolism can be presented in different ways.
Kinetic models
The kinetic approach is focused on the description of stationary states and time
courses[3]. Kinetic models are commonly formulated as a set of differential
equations that compute the time derivative of metabolite concentrations depending
on reaction rates (which, in turn, depend on the concentrations of some of the
metabolites and enzymes). The major limitation of such kinetic models is that
reaction rate equations contain many parameters whose values are unknown and are
experimentally hard to collect. This is why the applicability of kinetic models is still
limited to relatively small-scale systems.
Constraint Based Modeling
An alternative approach to kinetic modeling, when handling large scale metabolic
networks, is called Constraint Based Modeling (CBM). This approach is the one used
in this work. CBM is based on the observation that cells are subject to various
physical constraints that limit their behavior [4]. By enforcing these constraints on
the space of possible metabolic behaviors, it is possible to determine which
metabolic states are valid and which are not in a large-scale model and also possible
to select states according to defined criteria, and find what constraints do they imply
on the model. CBM has been shown to successfully predict gene essentiality, growth
rate, nutrient uptake rates and product secretion rates in a variety of studies [4].
In CBM, the metabolic network is represented as a matrix:
Sm x n ⋲ Rmm x n
,
in which n is the number of reactions in the model and m is the number of
metabolites that participate in the reactions of the model. Each column j represents
the linear equation representing reaction j (for example, the column for a reaction
19
19
consuming 2 metabolites of type A and producing 1 metabolite of type B would have
a -2 and +1 (respectively) for the matrix rows representing the metabolites A and B).
Each cell Si,j represents the stoichiometric coefficient of metabolite i in reaction j.
Reactants have negative stoichiometric values and the products have positive values.
Using linear programming methods along with this network representation, we can
predict metabolic states as represented by the feasible flux distribution through all
reactions in the network. The constraints imposed on the matrix that represent the
metabolic model are:
Mass balance – We assume quasi-steady state, i.e., that there is neither
accumulation nor depletion of metabolites within the metabolic network
(exchanged metabolites are allowed to deplete or accumulate outside of the
system). This means that the production rate of each metabolite is equal to its
consumption rate. This balance is formulated mathematically based on a
stoichiometric matrix. The mass balance constraint is enforced by the equation:
S · V = 0, in which V is the set of all metabolic reaction fluxes that fall within a
sub-space of Rn. As noted above, the model allows for uptake and secretion of
metabolites in and out of the model. This is done via exchange reactions that are
added to the model for this purpose, and which, unlike reactions within the
model, do not need to adhere to mass balance, as they represent ultimate external
sinks or sources.
Thermodynamic limitations - The model supports the directionality of reactions.
Directionality of many biochemical reactions is limited based on thermodynamic
constraints. For these reactions flux can only go in one direction, while for other
reactions flux can go from reactants to products or vice versa.
Flux rates - For some reactions, the maximum possible flux rate can be estimated
based on cell physiology data. These constraints are imposed by setting upper
and lower bounds on the rate of specific reactions. We represent these limitations
this way: ∀ 𝑣 ∈ 𝑉 𝛼 ≤ 𝑣 ≤ 𝛽 , where V is the set of all reactions, and 𝛼,𝛽 are
the lower and upper bounds of the reactions fluxes.
20
20
Growth media – In order to predict the behavior of an organism under given
growth conditions, the model allows the definition of available external
metabolites existing in the organism‟s growth environment. This is achieved by
constraining the fluxes through exchange reactions, which represent the
availability of extra-cellular metabolites within the growth medium.
The set of constraints form a convex solution space to the matrix S. The analysis of
these types of convex solution spaces is commonly done via linear programming
(LP) computational optimization methods. Examples include:
Flux Balance Analysis (FBA) [5], which searches for an optimal flux distribution
via a linear programming optimization. FBA assumes that the metabolism of an
organism is optimized for maximization of a certain objective function. The most
commonly used objective function for micro-organisms is biomass production[6].
Biomass production is calculated by adding a new pseudo-reaction representing
the production of essential biomass compounds from known metabolites acting
as reactants. The stoichiometric values for this reaction (Vbiomass) are based on
experimentally derived proportions of metabolic precursors to the parts of a cell
(e.g., lipid, sugar, amino acid), which can be measured as proportions of the dry
weight of a pure culture. Aside from biomass optimization, FBA can also be
used to examine production of other compounds of interest, such as ATP or
important external metabolites (or, indeed, any metabolite in the system). The
basic mathematical formulation of FBA is:
, m in m a x
:
0
m in / m a x _
j jj
j
S u b je c t to
v V
O b je c t iv e fu n c t io n
S V
v v v
Where S is the stoichiometric matrix reprenting the metabolic model at head, V is
the vector of reaction fluxes. Vj,min and Vj,max represent the lower and upper
bounds on the possible fluxes of reaction Vj.
21
21
It should be noted that in many cases the flux distribution solution is not unique;
finding a unique solution requires flux variability analysis (described below).
Flux Variability Analysis (FVA), which searches for the range of alternative flux
solutions for a given set of constraints [7]. This methods find the possible
minimal and maximal fluxes for a given set of reactions, and learns from it about
the size of the fluxes solution space based given a set of constraints.
Sampling methods. Randomly sampling the solution space for a given set of
constraints may reveal patterns in the distribution of allowed solutions [8].
Flux coupling, which can identify dependencies between sets of fluxes in the
solution space (such as „always occur together‟ or „are totally
independent/uncoupled‟) [9, 10].
In addition to LP, other optimization methods are often used with regard to metabolic
modeling. Quadratic Programming (QP) was used, for example, in a method that
tried to predict the minimization of metabolic adjustment (MOMA), after a
perturbation in the metabolic model[11]. Mixed Integer Linear Programming (MILP)
method was used, for example, in an algorithm for predicting the regulatory on/off
minimization of metabolic flux changes after genetic perturbations (ROOM) [12].
1.2.3. Construction of a metabolic model
The construction of a large-scale metabolic network model is based on various
biological data sources, including genomic, biochemical, and physiological data. It
involves the definition of a set of biological functions, termed ontology, and the
association of the gene products (enzymes) with ontology terms. The most
comprehensive and commonly used ontology is the Gene Ontology (GO), consisting
of over 20,000 terms and numerous associated gene products [13]. Some genes
encode proteins called enzymes. Biochemical reactions are catalyzed by enzymes,
and together a triplet of Gene-Protein-Reaction (GPR) is annotated. These GPR sets
are the core of metabolic models. Well known repositories for GPRs are KEGG[14] ,
MetaCyc [15] and „The Seed‟[16].
22
22
Model construction involves a well defined protocol [17] containing a series of
iterations in which a model‟s predictions are experimentally tested are then used to
improve it.
Manual construction of metabolic models
Up until a few years ago, genome-scale metabolic models were only constructed
manually. This was a labor-intensive process that required a lot of time (~1-2 years
work of a few people) to produce a working model. Many of the manually curated
large-scale models were constructed in Bernhard Palsson‟s lab in UCSD
(http://sbrg.ucsd.edu/Downloads), or in labs of his former students. These include
large-scale models for the bacterium E. coli [18], the yeast S. cerevisiae [19], and the
first large-scale human metabolic network model[20]. As of now there are less than
200 manually constructed models.
1.2.4. Potential application of metabolic modeling
Having a predictive model, is essential for engineering new products. Metabolic
modeling have already proved its predicting capabilities in gene deletion & addition,
gene over and under expression, prediction of phenotypes based on changes of
media, and prediction of possible growth media optimal for different phenotypic
goal[21]. Many health related application belong to the bio-markers family, where
the models can predict high/low concentrations of certain metabolites when certain
deseases occur[22, 23]. A bio-remediation usage example of metabolic modeling
helping removing uranium from a contaminated lake, can be found in [24].
23
23
1.3. Focusing on modeling bacterial metabolism
Prokaryotes are organisms whose cells lack a membrane-bound nucleus, including
both bacteria and archaea.
Prokaryotes have a fundamental role in the world's ecosystem. They were the first
form of life found on earth and are present in nearly every habitat on the planet.
There are approximately 5×1030
prokaryotes on Earth, forming a biomass that far
exceeds that of all plants and animals together. A healthy human harbors
approximately ten times as many bacterial cells as his own cells. Bacteria are vital in
recycling nutrients, and while some bacteria are pathogenic, others can be exploited
for a wide range of applications, from food and fuel production to clinical uses and
bioremediation. Elucidating the way these species interact with their surrounding and
their neighbors is crucial for our understanding of their biology and ours.
From the research point of view, prokaryotes are relatively simple compared to
eukaryotes, and thus they are ideal for the piloting studies done in this thesis.
Extensive research has been done on many Prokaryotes at the phenotypic and the
genotypic levels, and the relative simplicity of prokaryotes has importantly led to the
development of semi automated tools for the construction of metabolic models for
them.
1.3.1. Semi automatic tools for metabolic model generation
Genome-scale metabolic models have proven to be important resources for
predicting organism phenotypes from genotypes. However despite their numerous
applications, efforts to develop new models have failed to keep pace with genome
sequencing. To address this, new tools have been developed to aid with the creation
of metabolic models, automating the parts that could be automated. The leading tool
is „Model Seed‟ (hereafter referred to as SEED) [25]. This is web-based resource for
high-throughput generation, optimization, and analysis of draft prokaryotes genome-
scale metabolic models (available at http://www.theseed.org/models/). SEED
24
24
integrates existing methodologies and introduces new techniques to automate many
of the steps of the metabolic reconstruction process[17], enabling generation of
functioning draft models from assembled genome sequences in less than 48 hours. A
validation of 22 SEED-generated draft models was done against available gene
essentiality and Biolog data, with average model accuracy determined to be 66%
before optimization and 87% after optimization. The following chapters are based on
the models generated by SEED.
The major steps of the automatic model reconstruction in prokaryotes
Automatic model reconstruction for a given organism requires the following
information:
The genome sequence and its breaking into genes
A GPR (Gene-Protein-Reaction) mapping repository for all known genes
A databasee of all known reactions, with their full stoichiometry and chemical
formulla
For higher species and better mapping we also need enzyme localization information,
i.e., the compartment in which each enzyme operates. In prokaryotes, the SEED
assumed that all enzymes rest in the Cytosol.
The steps in automatic reconstruction of a model include:
Mapping as many genes as possible from the genome to genes that have a GPR
mapping, and thus creating a „core‟ set of GPR-mapped reactions.
Adding a biomass reaction containing all biomass pre-requisites as reactants.
Performing a Gap-Filling process in which a minimal set of reactions outside of
the „core‟ (i.e., reactions that are not mapped to the organism‟s genes) is added to
the model in order to enforce that the organism can „grow‟ (i.e., can have flux in
its biomass reaction while adhering to the model‟s constraints, including the
steady state assumption). The selection of Gap-filling reactions is usually done
by solving a MILP problem that tries to minimize the number of added reactions
while preferentially adding reactions belonging to established pathways enriched
25
25
with „core‟ reactions, and especially minimizing the addition of exchange
reactions. As a technical alternative, some Gap-Filling algorithms aim also to
maximize the number of core reactions that can carry flux.
The output of such a process is a working Model -- „working‟ being defined by the
model‟s basic functioning and its ability to carry biomass -- that should later be
validated and corrected.
Following this process, 2500 metabolic models, representing nearly every sequenced
prokaryote in NCBI (and spanning both bacteria and archaea) [26], were built via
SEED.
1.3.2. Large scale analysis of a collction of prokaryote
metabolic models
Most of the work that is currently done with metabolic models is still performed at
the level of a single species. However, with the newly emerged resource just
described and with recent advances, interest, and focus on metagenomics, it is timely
to begin exploring multiple-organism analyses using these models. This analysis
includes a comparison of the models and extraction of common features as shown in
Chapters 2, 3, and 4, the construction of more complex structures such as
communities, and an analysis of the interactions between species.
1.3.3. Metabolic simulating of prokaryotes communities
Most of the research using metabolic modeling has focused on single cell models,
and has assumed that each species is an individual entity, isolated from any
interaction with others. However, prokaryotes most typically live and thrive in dense
communities in which the interactions community members with each other and with
the environment determine the functionality, adaptability and capabilities of the
group as a whole. While individual models are adequate to predict the behavior of
cells in pure cultures, simulation of any realistic community requires consideration of
26
26
possible metabolic interactions between different species. Although there have been
several previous attempts to model particular consortia [27, 28], the work done was
focused on specific pairs and did not include a global analysis of the interaction
between species. The work done in Chapter 2 suggests a computational platform,
protocols, and resources for creating and analyzing models of communities in an
easy and systematic way. For that purpose, we have used the models curated using
the SEED semi-automatic workflow, as the building blocks of our communities.
1.3.4. Finding hidden qualities from large scale analysis of
metabolic models
In Chapter 3, we aim to explain and predict the phenotypic feature of growth rate
using metabolic modeling across different species. Growth rate has long been
considered one of the most valuable phenotypes that can be measured in cells [29].
Aside from being highly accessible and informative in laboratory cultures, maximal
growth rate is often a prime determinant of cellular fitness [30, 31], and predicting
phenotypes that underlie fitness is key to both understanding and manipulating life
[32-34]. Despite this, current methods for predicting microbial fitness typically
focus on yields [e.g., predictions of biomass yield using GEnome-scale metabolic
Models (GEMs)] or notably require many empirical kinetic constants or substrate
uptake rates, which render these methods ineffective in cases where fitness derives
most directly from growth rate [34, 35]. In Chapter 3 we present a new method for
predicting cellular growth rate, termed SUMEX, which does not require any
empirical variables apart from a metabolic network (i.e., a GEM) and the growth
medium. SUMEX is calculated by maximizing the SUM of molar EXchange fluxes
(hence SUMEX) in a genome-scale metabolic model. SUMEX successfully predicts
the growth rate of microbes across species, environments, and genetic conditions,
outperforming traditional cellular objectives (most notably, the convention assuming
biomass maximization). The success of SUMEX suggests that the ability of a cell to
catabolize substrates and produce a strong proton gradient enables fast cell growth.
Easily applicable heuristics for predicting growth rate, such as what we demonstrate
with SUMEX, may contribute to numerous medical and biotechnological goals,
27
27
ranging from the engineering of faster-growing industrial strains, modeling of mixed
ecological communities, and the inhibition of cancer growth.
1.3.5. Using metabolic modeling for the analysis of Gut
bacterial communities
In Chapter 4, we use metabolic modeling in order to predict bacterial species
abundance and diet-specific adaptations in the mammalian gut microbiota, by
analyzing gut bacterial glycan metabolism. Glycans form the primary nutritional
source of microbes in the mammalians‟ gut. Understanding the metabolism of
glycans by the microbiota is therefore a key target of microbiome research. In
Chapter 4 we present a novel computational pipeline for modeling Glycan
Degradation, providing a broad view of the usage of these compounds on genome
and metagenome scales. Our platform predicts, for the first time, the usage patterns
of thousands of glycans by all the sequenced individual gut bacteria deposited in the
Human Microbiome Project (HMP) database, giving a new metabolic view of the gut
community. Using our new platform we show that the ability of a bacterial species to
degrade polysaccharides is highly correlated with its abundance, suggesting a
potential selective advantage for primary glycan degraders. We further demonstrate
that differences in community composition carry functional importance, i.e., that the
microbiota of herbivores and carnivores have stronger affinities to plant- and animal-
derived glycans, respectively. We show that our platform can be used to train an
extremely accurate classifier to predict the diet type (plant vs. animal) of a host based
on its glycan degradation profile, going markedly beyond a classification based on
enzymatic content alone. Applying our classifier to microbiota samples from US
residents, we show they mostly favor animal-derived glycans, while those of
individuals from Malawi and Venezuela shift towards plant-derived glycans in
adulthood. Our platform opens the door for a systematic prediction of microbiota-
specific dietary patterns.
28
28
Chapter 2
2. Competitive and cooperative
metabolic interactions in bacterial
communities
Based on an article with the same title by the authors:
Shiri Freilich, Raphy Zarecki, Omer Eilam, Ella Shtifman Segal, Christopher S.
Henry, Martin Kupiec, Uri Gophna, Roded Sharan & Eytan Ruppin
In this article the first 2 authors had equal contribution.
Published in: Nature communication, Dec 13 , 2011[36]
2.1. Introduction
A fundamental question in ecology is how different species can co-exist in nature.
Darwin's famous documentation of the nutritional divergence within a family of
finches resulted in the principle of competitive exclusion, which asserts that co-
existence is made possible through divergence and the subsequent reduction in
resource overlap[37, 38]. However, the observed phenotypic similarity between co-
occurring species has led to renewed questioning about the role of competitive
interactions in shaping communities; it has been suggested that the carrying capacity
of many environments is sufficient to allow the co-existence of closely related
29
29
species[39]. In addition to competition, growing evidence supports the prevalence of
cooperative interactions between organisms [40-43]. Yet, despite their prevalence,
the consequences of cooperative interactions for species diversity are still poorly
understood[44].
The analysis of species' co-occurrence data has long been used by ecologists to
discern the forces that dictate community structure[45, 46]. Yet, to date empirical
records of species' distribution have been highly fragmented, and a systematic
approach for estimating the corresponding levels of inter-species competitive and
cooperative interactions has been lacking. Within bacterial communities, competitive
(where two species consume shared resources) and cooperative (where the
metabolites produced by one species are consumed by another and, potentially, vice
versa) interactions are to a large extent derived from metabolism. Stoichiometric-
based metabolic models were recently shown to provide accurate predictions for the
patterns of metabolic interactions in bacterial two-species systems[27, 28, 47],
making these approaches a useful tool for exploring ecological concepts[43]. Beyond
focusing on a few well-defined case studies, stoichiometric Constraint-Based
Modeling (CBM) was already used for the systematic design of cooperation-
supporting media for all pair-wise combinations formed between seven
microorganisms represented by genome-scale metabolic models[47]. Yet, the
relative scarcity of such manually curated models has precluded the conductance of
larger scale explorations. Moreover, the ecological significance of these interactions
has not been examined on a large scale. The publishing of an automatic high-
throughput reconstruction pipeline has generated more than 100 genome-scale
metabolic bacterial models spanning 13 bacterial divisions[25]. This development,
complemented by the accumulation of meta-genomics data from environmental
surveys, has provided a golden opportunity to perform systematic inter-species in
silico studies on an ecological scale.
30
30
Previous large-scale computational studies of microbial ecology and metabolism
relied solely on network representations of enzymes and reactions [48-50] (rather
than representation by an operative stoichiometric-based metabolic model) and such
studies lacked the tools for systematically describing pair-wise interactions in a
media-dependent manner. Presented in this chapter are the results of the first
integrative computational and ecological study that aims to provide a global-scale
description of bacterial metabolic interactions between geographically co-occurring,
mutually exclusive, and randomly-distributed species pairs. To this end, a conceptual
computational framework for characterizing the levels of metabolic competitive and
cooperative interactions between pairs of species was defined. Subsequently, an
exploration was done for the distribution patterns of species as derived from
environmental samples in order to relate their ecological co-occurrences to the types
of interactions inferred.
The ability of predicting the type of relationships between community members can
help in bio-remediation task[24], it can help fighting pathogens by adding their non-
patogen competitors or by finding a media that favors their non-pathogen
neighbours. This can be used in human medicine, and in plant pesticides[51].
2.2. Results
2.2.1. In silico and in vivo description of co-growth patterns
Starting from a collection of 118 genome-scale metabolic-models of bacteria that
were automatically generated and published[25] , a systematical use of CBM for
computing the biomass production rate for each of the individual species and their
corresponding 6903 pair-wise combinations (Methods) was done. Analogously to
the computation of genetic interactions[52], it was assumed that there are three types
of potential interactions (Chapter2::Figure 1): negative, where two species consume
shared resources (competition); positive, where the metabolites produced by one
species are consumed by another, and potentially vice versa (representing mutualism,
commensalism or parasitism -- that is positive/positive, positive/neutral or
31
31
positive/negative interactions), hence producing a synergic co-growth benefit; and
neutral, where co-growth has no net effect (Chapter2::Figure 1). As in genetic
interactions, the extent and type of interactions occurring between two species can be
described by comparing the total biomass production rate in the pairwise system to
the sum of the corresponding individual rates recorded in their individual growth.
Chapter2::Figure 1: Metabolic modeling in a multi-species system.
The scheme on the left is an illustrative example of potential interaction types occurring between
species in a pairwise system. No interaction is expected when species A and species B use non-
overlapping resources of the corresponding environment; Negative interaction/Competition: decrease
in the overall growth is expected when species A and species B share the same resources; Positive
interaction/Cooperation: increase in the overall growth is expected when the products of one species
are the substrates of the second species. On the right: co-growth experiments of Listeria innocua and
Agrobacterium tumefaciens in three interaction-specific (no interaction, competition and cooperation),
computationally pre-designed media. Species were grown in a defined medium modified for Listeria
growth (Methods & supplementary section A.1.3.5). Computational predictions for the experiments:
32
32
no interactions SIG(83.1)=~CG(87.5); competition: SIG(109.2)>CG(97.3); cooperation:
SIG(0.0)<CG(19.0). SIG: Sum individual Growth; CG: Co Growth. SIG(0.0) means no growth at the
given media. OD represents optical density which is used as a measure for growth rate.
Naturally, interactions between a pair of species are expected to vary significantly
depending on the given growth environment. Consequently, for a given pair of
species different media types were designed, and as expected different types of
interactions were revealed. The predictive power of our simulation in inducing shifts
from neutral to negative and positive interactions was experimentally tested for 10
bacterial pairs, representing all possible pair-wise combinations between five species
capable of growing in the same defined media (IMM, Methods). For all
combinations, we simulated co-growth in the original defined media as well as in a
range of modified media formed by the addition and subtraction of specific nutrient
combinations, leading to the selection of two media compositions that induce
maximal negative and positive shifts, respectively (Methods). Laboratory co-growth
experiments were then conducted for all species pairs across the three designed
media (original, negative and positive) where positive and negative shifts were
correctly predicted in 65% of the experiments (precision 0.75, recall 0.8,
Supplementary Note 1). The observed and predicted interactions between Listeria
innocua and Agrobacterium tumefaciens, demonstrating a close to neutral interaction
in the original defined media, are shown in Chapter2::Figure 1. As evident, shifts
from neutral to negative and positive interactions between the two species are
successfully induced in the designed media, testifying to the model's predictive
ability. Notably, one should bear in mind that our experiments only cover a small
subset of all potential pairwise interactions. Yet, our experiments together with a
growing number of studies are testifying for the ability of metabolic-driven
computational approaches to describe the metabolic interaction between two species
[27, 28, 47, 53].
2.2.2. Systematic predictions of the competitive potential
33
33
Since interactions are condition specific, and because nutrient concentrations in
specific natural niches are mostly unknown and subject to significant variations, it
was subsequently aimed to design simulated media that, for each given pair of
species, can efficiently uncover their potential capacity to compete or cooperate. To
design a medium that maximizes potential competitive interactions a traditional
perception of competition as a situation with a high level of resource requirement
overlap was taken. This approach precludes resource sharing [54-56] and yielded
6903 pair-specific in-silico minimal optimal media, termed Competition-inducing
Media (COMPM, Methods). For each pair, COMPM includes the minimal set of
metabolites, provided at their minimal quantity, yet still allowing each species to
individually grow at its maximal possible growth yield, leading to the full
consumption of external resources (Methods). Thus, when resources overlap, this
medium will uncover potential competition.
For each pair of species placed in its respective competition-inducing medium, a
prediction of the win-lose relationships was done by comparing the individual
biomass production (growth yield) rates within the pair-wise system. Winners (faster
species in the pair-wise system) tend to be species with higher potential biomass
production rates (the latter determined in a single-species system, Figure 2A), in
accordance with the notion that faster species out-grow their competitors[57].
Looking at the identity of the frequent winners in-silico, it was observed that there
exists a clear correspondence between computed predictions and ecological data,
where winners include fast growing, ecologically versatile species such as
Escherichia coli, Salmonella typhimurium, Vibrio cholerae and Pseudomonas
aeruginosa (in accordance with earlier observations[50]). Similarly, in-silico losers
include slow growing specialists such as Mycoplasma genitalium and Buchnera
aphidicola. The identification of winners as species with higher individual growth
rates is also maintained when considering the experimentally recorded doubling
times (Figure 2B). In correspondence with the ecological observation that the faster
growing species are the ones exploiting the shared resources[57], Figure 2C shows
34
34
that the in-silico faster species tend to grow closer to their full capacity than the slow
growers.
Chapter2::Figure 2: Metabolic modeling of pairwise growth on a competition-inducing media (COMPM).
The matrices describe the outcome of competition between all species pairs. Rows and column represent species (sorted
differently in each matrix) where each cell shows the win/lose outcome of the column species following co-growth with the row
species. (A, B) Green, red and blue represent win/lose/inconclusive outcome predictions, respectively. Briefly, the winner is
defined as the species with the higher predicted growth in a two species system (see Methods). (A) Species in rows and
columns are sorted according to their computed biomass production rates. Winner-loser relationships were determined for more
than 90% of the pair combinations. (B) Species in rows and columns are sorted according to their experimentally measured
doubling times (retrieved as described at Supplementary Note 2). The predicted win-lose division in B is found to be
significantly more distinct than in permuted matrices (P value 0.002, Supplementary Note 2). (C) The ratio of biomass
production rate of each species in the pairwise system relatively to its biomass production rate when grown alone (Methods).
Cells are sorted as in A. The full list of species (including the computed and measured doubling times) is provided at
Supplementary Table 1. Growth rates (computed) of each species across all pairwise combinations are provided in
Supplementary Table 2.
35
35
Going beyond win-lose predictions, a Potential Competition Score (PCMS, Methods)
was designed to quantify the level of competition predicted among the species in the
tested collection, by comparing their individual and combined biomass production
rates across simulated Competition-inducing Media (COMPM). A PCMS value of 0
represents no competition and PCMS of 1 indicates maximal competition, while
negative PCMS values denote cooperation and synergic co-growth. 98% of the
PCMS values are positive (competitive) with a mean PCMS of 0.77 (Figure 3A). As
expected, it was observed that PCMS values strongly correlate with the degree of in-
silico resource overlap, the latter determined by the level of intersection between the
minimal media sufficient for maximal growth rate of the two species (Figure 3B and
Methods).
2.2.3. Systematic predictions of the cooperative potential
Due to the rich nature of the competition-inducing media, which is likely to conceal
inter-species metabolite transfer and cooperation[28], only very few positive
interactions (negative PCMS values) are revealed (Figure 3A). For example, the
documented cooperative interaction between the two halophylic species Salinibacter
ruber and Haloquadratum walsbyi[58] is only revealed in a simulation setting when
reducing their in-silico growth medium, inducing the reported dependence of H.
walsbyi in S. ruber for the supply of dihydroxyacetone (DHA) (Supplementary Note
3). Thus for each pair of species an in-silico minimal medium was designed to
support a predetermined small level of growth of both species together, termed a
Cooperation-inducing Medium (COOPM) (Methods), taking a similar approach as
in[47]. Potential Cooperation Scores (PCPS) are then computed according to the
ratio between the sum of individual growth rates and the co-growth rate, where
positive values indicate cooperation and negative values indicate competition
(Methods). Whereas in rich in-silico media almost none of the pairs exhibit positive
interactions (negative PCMS), about 35% of the pairs show a cooperative potential
(positive PCPS) in the in-silico cooperation-inducing media (with scores > 0.05,
Figure 3C).
36
36
Unlike the monotonic association between similarity in media requirements and the
competitive potential described above (Figure 3B), resource overlap and cooperative
potential demonstrate an inverted-U relationships (Figure 3B), where a moderate
level of similarity in the required resources maximizes the potential for collaboration,
and the cooperative potential declines at higher levels of resource overlap. This is
likely to stem from the increasing competition on available resources, combined with
the scarcity of differing resources that can be shared. Typically, cooperation inducing
media lack amino-acids (Supplementary Note 4), enhancing the need to exchange
these metabolites, which were suggested to be transferred between species in
mutualistic interactions by[28]. It was observed that a moderate association between
competition and cooperation for intermediate levels of competition exists (Figure
3D).
37
37
Chapter2::Figure 3: Distribution of competition and cooperation values.
(A) The distribution of predicted potential competition scores (PCMS) across the 6903 non-redundant species‟ pairs grown in
competition-inducing environments (COMPM). (B) The relation between resource overlap and competition (white) and
cooperation (black) scores. Resource overlap and competition: Spearman rank correlation 0.4, P value < 2.2e-16. Resource
overlap and cooperation: correlation coefficient for a second order polynomial regression 0.3, P value < 2.2e-16. IS (the
extreme right bars) indicates Intra-Species interaction (competition and cooperation values recorded when a species is paired
with itself). (C) The distribution of predicted potential cooperation scores (PCPS) across the 6903 non-redundant species pairs
grown in cooperation-inducing media (COOPM). (D) The relation between cooperation and competition levels: The Spearman
correlation between competition and cooperation is significant but very low (0.04, P value 8e-4). When limiting to intermediate
competition values of 0.1<PCMS<0.8 this correlation is more substantial but still quite moderate (0.2, P value < 2.2e-16). The
computed PCMS, PCPS and resource overlap are provided at Supplementary Table 3, 4 and 5, respectively.
Interestingly, an inverted-U relationship between resource overlap and cooperation
38
38
has been reported in economical models describing the likelihood of forming inter-
firm alliance versus the corresponding degree of technological overlap. As suggested
here for bacterial communities, such economical models suggest that although some
degree of technological overlap is necessary to support a successful alliance, at some
point such overlap yields diminishing and perhaps even negative returns[59].
Notably, a cooperative potential denotes an overall gain at the pair-wise, system
level, though at the species level we can observe a benefit either for both species
(mutualism) or to only one of them. Examining the gain of each species in a pair-
wise system, it was observed that the large majority of in-silico cooperative
interactions are unidirectional, i.e., there is a single species that benefits from the
interaction, where the other species is not affected (commensalism, Methods).
Similar results were obtained when using alternative approaches for modeling
cooperation (Supplementary Note 5). This is in agreement with a recent
investigation of computationally predicted pair-wise interactions between seven
microbial species across a wide range of environments[47], and to numerous
experimental observations of syntrophic interactions[40, 58, 60, 61]. As displayed
in Supplementary Note 6, one can observe a high tendency of Clostridia species to
be involved in cooperative interactions as the giving side. Indeed, Clostridia are
known to be involved in the fermentative digestion of cellulose and lignin leading to
the subsequent release of easily degradable carbohydrates to other community
members[62, 63].
2.2.4. Patterns of interactions across ecological samples
Since the benefits to the giver in predicted unidirectional cooperative interactions are
not obvious, their relevance for species‟ co-existence may be questioned. To directly
relate the computational predictions to patterns of species co-existence, 16S data
from environmental surveys across 2801 samples belonging to 59 different
ecological niches[64] was used. Two categories of ecologically-associated pairs
were defined: pair members that show a similar distribution pattern across the 59
39
39
ecological categories are termed niche-associated (648 pairs versus 2512 non niche-
associated pairs); some of the niche-associated pair-members further show a similar
distribution pattern across the 2801 individual samples composing the different
ecological categories (niches) and are termed co-occurring pairs (84 pairs, Methods).
Competition scores recorded for ecologically-associated, and in particular co-
occurring, species are significantly higher than those of non-associated pair members
(Figure 4A). This is in agreement with the dominant ecological perception of high
level of competition between neighboring species making use of the same
resources[39, 64]. It was also observed that a significantly higher rate of cooperative
give-take interactions between ecologically-associated (at both niche and sample
level) versus non-associated species (Figure 4B). This observation is retained when
compared at different levels of competition and resource overlap (Supplementary
Figure 1), in line with existing ecological theory[44].
40
40
Chapter2::Figure 4: Predicted competitive and cooperative interactions across different ecological groups.
The level of predicted interaction potential was calculated across randomly distributed and ecologically associated species pairs.
Three categories of ecological associations were considered (Methods): association at the level of ecological niche; association
at the level of the sample (co-occurring pairs) and an antagonistic pattern of distribution at the sample level (mutual exclusive
pairs). (A) Competition scores. (B) Cooperation scores. The difference between ecologically associated and non-associated
groups for both competition and cooperation is highly significant (P value < 2.2 e-16, one sided Kolmogorov-Smirnov test). (C)
The mean cumulative number of loops across 1000 reconstructions of the networks versus the number of species in a network
of give-take interactions, for networks of ecologically associated versus networks of non-associated species (Methods). (D)
Parent similarity, calculated as the fraction of common givers. Bars in (A, B, D) represent standard deviations. The ecological
association between species pairs is provided in Supplementary Table 6.
Although the accumulation of some end-product metabolites can be toxic, the
advantages for the giver species, remains obscure. To explore the role of cooperative
41
41
interactions at the level of the community, we constructed the inter-species network
of predicted directional (give-take) interactions (Methods); within this network
motifs of closed cooperative loops were identified, e.g., A gives to B; B gives to C; C
gives to A (Methods). The occurrence of these closed cooperative loops across
natural communities (the 2801 samples described above) was compared to their
occurrence across randomly generated communities preserving the original size and
rank of species' distribution. Remarkably, the frequency of loops predicted in natural
communities (194) is an order of magnitude higher than in randomly drawn samples
(maximum 95 in 1000 random data sets, mean 10, Supplementary Note 7). Thus,
cooperative interactions in nature are likely to be beneficial, forming cooperative
cycles. Furthermore, there is a rapid increase in the number of cooperative loops as
more species are added in, in particular for ecologically-associated species (Figure
4C). This may suggest an explanation to the observed rise in the population size
when the species‟ diversity increases[44, 65].
A closer examination of the cooperative loops found in natural communities sheds
light on how cooperation and competition are intricately intertwined: An illustrative
case is that of Pseudomonas putida and Nocardia farcinica, each forming an
analogous loop with Streptomyces coelicolor and Bacillus anthracis in two distinct
natural samples. As can be expected from their equivalent location inside the loop,
the literature suggests that P. putida exhibits a similar role to Nocardia species in the
degradation of oil contamination, where the synthetic introduction of P. putida
suppresses the enrichment of indigenous degraders such as Nocardia species[66]. To
systematically explore the consequences of analogous network-positioning for
species co-existence we defined a third group of ecologically associated pairs:
mutually exclusive species, referring to pairs of species whose level of co-existence
across samples is lower than expected by chance despite the fact they inhabit similar
niches (28 pairs, Methods). Figure 4 reveals some interesting trends: We observe that
mutually exclusive pairs exhibit high similarity in their network positioning
(competing for common givers, Figure 4D) as well as high levels of resource
competition (Figure 4A), providing systematic evidence for the association of
42
42
exclusion and competition[37]. Notably, co-occurrence and mutual-exclusion
relations may be interchangeable, and the choice between these contradictory fates is
determined by the carrying capacity of their environment[39]. Accordingly, similar
levels of competition are observed between mutually exclusive and co-occurring
pairs (Figures 4A, 4D). Strikingly, the highest level of cooperative interactions is
recorded between mutually exclusive pairs (Figure 4B, and Supplementary Note 6).
This may suggest that under true, natural conditions, cooperative potential,
describing the propensity of a species pair to be involved in a unidirectional give-
take interaction, might be obscured by competition even to the level of exclusion of
one of the pair members. Such is the case with Pseudomonas putida and
Acinetobacter sp., two highly competing species which were also predicted to have
cooperative potential; when these species were grown experimentally in a deprived
environment with benzyl alcohol as the sole carbon source, the benzoate excreted by
Acinetobacter sp. was used by P. putida, which subsequently suppressed the growth
of Acinetobacter [67].
2.3. Discussion
To date it has been difficult to predict which bacteria can stably co-exist, let alone
cooperate metabolically, making the artificial design of beneficial microbial
consortia extremely difficult. Here, we suggest a generic approach for the systematic
description of inter-species interactions, making use of recently available data. Our
approach is obviously not without limitations. First, it is solely aimed at the
metabolic dimension while putting aside regulation as well as the numerous
strategies that microorganisms have evolved to augment the acquisition of resources.
Antimicrobial production, motility and predation can tip the competitive balance,
resulting in outcomes that significantly differ from those predicted by simulations
restricted to passive nutrient consumption[43]. Moreover, several mechanisms for
nutrient sequestration function directly to actively restrict or remove a nutrient from
one organism and supply it to another[68]. Second, the analysis lacks information on
43
43
the true metabolic composition of the environments considered and hence focuses on
predicting the overall potential inter-species interactions, rather than providing a
direct account of their actual in-vivo communications in one specific environment.
Finally, although the automatic reconstruction procedure results in a significant
increase in the number of genome scale metabolic models and although such models
have been proven useful in the prediction of a variety of phenotypes, yet they are
typically less accurate than manually curated models[25]. Yet, despite these
significant limitations, our generic approach succeeds in delineating clear differences
in the interaction patterns of ecologically associated and randomly to fundamental
ecological principles in a systematic fashion. With the increasing efforts to provide
an a-biotic description of different environments, together with the expected rapid
rise in the number of metabolic models as well as the improvement in their quality,
the utilization of metabolic modeling for community-level modeling framework such
as the one laid down here provides a computational basis for many exciting future
applications. These include the artificial design of 'expert' communities for
bioremediation, where currently the selection of community species is done by
intelligent guesswork. Similarly our work may be applied to the rational design of
probiotic administration, as well as to the identification of species that may
metabolically out-compete pathogenic species. The ability to design and test novel
interactions, and to study existing ones, means that microbial experiments can be
used to complement and extend classical plant and animal ecology, in which many of
the principles of biological interactions were first described[61].
2.4. Methods
2.4.1. Metabolic simulations
118 operative metabolic models were retrieved from The Seed's metabolic models
section (http://seed-viewer.theseed.org/seedviewer.cgi?page=ModelViewer)[25]. The
models are automatically constructed by a pipeline that starts with a complete
44
44
genome sequence as an input and integrates numerous technologies such as genome
annotation, reaction network annotation and assembly, determination of reaction
reversibility, and model optimization to fit experimental data. The list of species and
their corresponding identifiers in the SEED database is provided in Supplementary
Table 1. Briefly, in these models, a stoichiometric matrix (S) is used to encode the
information about the topology and mass balance in a metabolic network, including
the complete set of enzymatic and transport reactions in the system and its biomass
reaction. Our approach for generating multi-species models follows the definition
employed by[27]. We converted the model of each organism into a compartment in a
multi-species system. Applying the multi-species system analysis to all possible
pairwise combinations we examined 6903 unique pairs whose growth can be
simulated under a range of environments. For a single species model A, the
competition-inducing medium (COMPM) is defined as the ranges of fluxes of the
exchange reactions that supports its maximal biomass rate (MBR), when all
exchange metabolites are provided at the minimal required amount. For the multi-
species system of A and B, COMPMAB allows A and B to reach their MBR at individual
growth. However, at co-growth, any resource overlap will prevent species A and B
from simultaneously reaching their MBR, and reveal potential sources of
competition. A cooperation-inducing medium (COOPM) for a multi-species system
is defined as a set of metabolites that allows the system to obtain a small positive
growth rate (above a certain predetermined threshold, which may yet be far from
optimal), and such that the removal of any metabolite from the set would force the
system to have no such solution. A feasible solution in this context is defined as one
achieving at least 10% of the joint MBR obtained when grown on a rich medium
(COMPM). The process of model selection, calculation of maximal biomass
production rate (MBR), construction of pair-wise systems and computation of pair-
specific environments (COMPM and COOPM) are fully described at Supplementary
Note 8. To relate the computed environments to real ecological conditions we
verified that species inhabiting similar environments tend to have similar metabolic
profiles, as previously demonstrated in[64]. As documented in many laboratory
experiments, typical limiting factors in COMPM environments include oxygen,
glucose and nitrogen sources (Supplementary Note 4). Finally, computational
45
45
simulation providing predictions for the effect of removal of chosen metabolites on
species growth were experimentally tested, supporting the ability of the models to
identify growth limiting factors (Supplementary Note 9).
2.4.2. Experimental and computational co-growth analysis
Co-growth experiments were conducted between all co-growth combinations formed
between five species, all non-pathogenic and capable of growing in IMM. The
species and their seed models are the following: Listeria innocua Clip11262
(Core272626_1), Agrobacterium tumefaciens str. C58 (Core176299_3), Escherichia
coli K12 (Core83333_1), Pseudomonas aeruginosa PAO1 (Core208964_1), Bacillus
subtilis str. 168 (Opt224308_1). Exprimental procedure and media selection are fully
described at Supplementary Note 1.
2.4.3. Determining win-lose and give-take relationships
In a multi-species system the CBM solver aims to maximize the total growth
potential of all species, and provides a range of potential solutions for the
contribution of each compartment to the total growth. We define:
Equation 1:
, ,
{ , }
, m in m ax
:
0
m a xB M A B B M m
i i
A B
i
i
v
m
S u b je c t to
v V
v
S V
v v v
46
46
where VBM,m is the maximal biomass production rate in a system m, corresponding to
species A and B.
We define A as a winner when the lowest value of its predicted maximal growth is
higher than the highest value predicted for species B.
Equation 2:
Awinner Vmin,BM,COMPM,compartment_A > Vmax,BM,COMPM,compartment_B
Where Vmin,BM,COMPM,compartment_A and Vmax,BM,COMPM,compartment_B (maximal biomass
production rate of organism A and B respectively, in the multi species system) are
calculated when running FVA for the multi species system and fixating the VBM,AB to
its maximal value on the given media.
To determine give-take relationships in a multi species system of species A and B we
look at the individual benefit of each species/compartment when the species are
grown together. We define A as a "Taker" if its maximal growth in the multi-species
system is higher than its individual maximal growth in a minimal medium. In this
case we call B a "Giver"
Equation 3:
Ataker, Bgiver Vmax,BM,COOPM,compartment_A > Vmax,BM,COOPM,A
It is of course possible within the same system that species A and B are both "givers"
and "takers" (a symmetrical interaction). Overall, we observe that 94% of the
interactions are unidirectional (commensalism) (Supplementary Note 6), i.e., no
unidirectional interactions affect the growth of the giver (neutral interactions).
The directional network of give-take interactions is provided at Supplementary Table
11. Within this network we looked for close cooperative loops (that is A->B, B->C,
47
47
C->A) up to a size of 4 species. We compared the results with results when taking
random pairs, and compared the number of cycles created in this case. The number
of random pairs matched exactly the number of cooperating pairs as ofound by our
method to predict cooperation.
2.4.4. Determining the level of competition and cooperation
Potential Competition Scores (PCMS) are calculated as:
Equation 4:
, , , , , ,
, , , , , , , ,
m a x ( , )1
m a x ( , )
B M C O M P M A B B M C O M P M A B M C O M P M B
A B
B M C O M P M A B M C O M P M B B M C O M P M A B M C O M P M B
V V VP C M S
V V V V
Potential Cooperation Scores (PCPS) are calculated as:
Equation 5:
, , , ,
, ,
1B M C O O P M A B M C O O P M B
A B
B M C O O P M A B
V VP C P S
V
where VBM,x,y is the flux through the biomass reaction of species y in medium x.
Computed PCMS and PCPS values are provided at Supplementary Table 3 and
Supplementary Table 4, respectively.
2.4.5. Calculating the resource overlap within a species pair
The resource overlap (RO) between a pair of species is calculated as the ratio
between the intersection and union sizes (Jaccard index) of the set of uptake
reactions included in their individual competition-inducing media (COMPM).
Equation 6:
48
48
A B
A B
A B
C O M P M C O M P MR O
C O M P M C O M P M
Computed RO values are provided at Supplementary Table 5.
2.4.6. Collection of ecological distribution data
Data of Operational Taxonomic Units (OTUs) distribution in environmental samples
were retrieved from[64], using their 97% identity threshold for sequence clustering.
Each sequence is mapped to a "sampling event" defined as the unique concatenation
of the three annotation fields "author" + "title" + "isolation_source". For example, 51
sampling events are mapped to the publication ' Microbial ecology: human gut
microbes associated with obesity' [69]. These sampling events refer to 15 different
individuals, each under a different diet, at different time points (considering the
beginning of the experiment). Notably, in samples from host-associated
metagenomic studies, an "isolation source" is individual specific, as in[69].
„„isolation_source‟‟ fields are further mapped to an Environment Ontology
(EnvO)[70] (e.g., "agricultural soil" and "Rocky Mountain alpine soil" are mapped to
the term "soil"). Overall 2662 samples are mapped to 183 EnvO categories, termed
"niches". Since many niches contained only a few samples, we strived to group
similar niches together to obtain a better signal. Using the hierarchical clustering in
EnvO we automatically mapped samples from lower order niches to higher order
ones. This process continued iteratively until reaching a barrier of predefined niches
with no biological significance. Using this approach, the ultimate set contained 59
ecological niches (Supplementary Table 12). Full length 16S rRNA sequences
corresponding to the 118 species with metabolic models that were used throughout
the analysis were manually retrieved from the Kyoto Encyclopedia of Genes and
Genomes (KEGG)[71]. BLAST[72] was then used to map the models to OTUs,
requiring 97% sequence identity and 95% alignment overlap, considering the length
of the query sequence. In case of multiple matches for a given OTU, we map it to the
model represented by the highest-ranking sequence, thus resulting in a one-to-many
49
49
mapping between models and OTUs. That is, each OTU can only be mapped to a
single model, but a model can be mapped to many OTUs. Overall, 80 models were
identified across the environmental samples. Supplementary Table 13 lists the
samples tested, their mapping to niches and the detected array of species.
2.4.7. Determining ecological association between species
To identify ecologically associated species we examined the distribution pattern of
the 80 OTU-mapped models across the 59 niches[70]. The probability that two
species co-occur together at a rate higher than chance expectation was determined by
calculating a cumulative hypergeometric P-value. Significance cut-off was
determined by setting a False Discovery Rate threshold of 10%.
Similarly, we looked at the pattern of species' distribution across the 2662 samples.
We identified 111 non-redundant combinations of co-occurring species and 39 non-
redundant combinations of mutually exclusive species from the pairs – that is species
for which the co-existence in samples is higher or lower than expected by chance,
respectively. As can be expected, the large majority of co-occurring pairs is observed
between ecologically-associated species (84/111). Less trivial is the identification of
a significant part of the mutually exclusive species pairs (28/39) as ecologically
associated species, implying that the pattern of distribution of species in nature is far
from being random. Only niche-associated pairs are further analyzed as co-occurring
(84) or mutually-exclusive (28) combinations. The ecological association types
determined for the species pairs tested are provided at Supplementary Table 6. The
distribution of resource overlap values between ecologically associated and non-
associated pairs is shown at Supplementary Figure 2 demonstrating that ecologically
associated pairs differ in the pattern of distribution of their resource overlap values,
supporting both the observed high level of competitive and cooperative interactions.
The identification of close cooperative loops in real and random networks of give-
50
50
take integrations and in real and randomly drawn communities is fully described at
Supplementary Notes.
51
51
Chapter 3
3. Maximal Sum of metabolic
exchange fluxes outperforms
biomass yield as a predictor of
growth rate of microorganisms
Based on an article with the same title by the authors:
Raphy Zarecki, Matthew A. Oberhardt, Keren Yizhak, Allon Wagner, Ella Shtifman
Segal, Shiri Freilich, Christopher S. Henry, Uri Gophna and Eytan Ruppin
In this article the first 2 authors had equal contribution.
The article was submitted to Genome Biology (currently under review), and was
presented in the conference on predicting cell metabolism and phenotypes in CA
USA (4-6/3/2013)
3.1. Introduction
In the data-rich landscape of present-day biology, large-scale network-based models
are being increasingly tapped to make sense of the deluge of available data. Towards
52
52
this end, genome-scale metabolic models (GEMs) have proven highly successful
[73]. Incorporating gene-protein-reaction associations and stoichiometric reaction
detail for the majority of known metabolic genes in an organism, GEMs have
achieved high accuracies in predicting essentiality of gene knockouts (~90%),
growth phenotypes on a variety of substrates (~90%) [34], growth yields, and
metabolic fluxes [74]. These predictions typically rely on an assumption that single-
celled organisms are optimized to maximize yield (for example: dry weight of
biomass per unit of glucose consumed), following deep-rooted theories about
evolutionary tuning towards optimal fitness [32], but as has been shown previously,
maximization of molar yield is by no means a universal principle [75].
Metabolic phenotypes in GEMs are typically computed by a linear optimization
method termed Flux Balance Analysis (FBA), in which a biomass objective is
optimized while various network-defined constraints are upheld. Non-biomass
objectives have also been tried, with varying powers of prediction [11, 12, 76, 77],
but these objective functions are common in that they link metabolic models to
growth yield or to a global flux distribution, rather than predicting growth rate.
Growth yield (units of [g biomass produced]/[g substrate consumed]) is different
from growth rate (units of 1/[hour]), although they are related by the substrate uptake
rates of an organism growing at steady state (for growth on a single carbon source,
for example, Growth rate = Substrate uptake rate * Yield). Prediction of yield using
GEMs applies most rigorously to highly defined conditions such as in a chemostat in
which one nutrient is limiting, and it is unclear how broadly applicable the
„maximization of yield‟ principle actually is [75]. In many conditions (including
standard laboratory batch growth, growth of cancer cells displaying the Warburg
effect, and competition of organisms for certain environmental niches), cells do not
necessarily maximize their yield, yet their growth rate cannot be predicted without
empirical data (e.g., substrate uptake rates). There is currently no framework for
predicting cellular growth rates akin to the GEM-based methods available for
predicting growth yields, which does not require extensive additional kinetic
53
53
parameters. It would therefore be of significant value if a predictor of growth rate
could be determined using genome-scale properties of GEMs that do not necessitate
the arduous measurement of substrate uptake rates. In a large number of conditions,
especially in competitive niches, growth rate is a better measure for fitness than
yield, so the ability to predict growth rates could significantly increase the utility of
GEMs.
3.2. Results and Discussion
In this study we explore novel large-scale methods to predict growth rates from
GEMs grown on rich or defined media, and in some cases with gene knockouts. We
focus on environments in which cells are expected to be optimizing their growth rate,
such as maximal listed growth rates for species in rich media, or careful growth rate
measurements of isogenic cultures in early exponential phase of batch growth. Our
approach was inspired by an article by Vieira-Silva and Rocha [78], which
investigated a number of bioinformatics-based measures for predicting the maximal
growth rate across species. Vieira-Silva and Rocha collected from the literature the
maximal growth rates in rich medium of over two hundred bacterial species, and then
searched for a genomic measure that correlated best with these data. The genomic
property of codon usage bias yielded their most promising correlation, but this
property is not dependent on the growth medium, so it will fail when assessing
growth rate of a species across media or other conditions. Furthermore, in cases of
different cells of the same organism, such as human cancer cells, the cells share the
same codons, and thus codon bias cannot be used to predict specific growth rate.
Analogous to Vieira Silva and Rocha, we explore a new class of metabolic
objectives, related to maximizing the total metabolic secretion of a cell, which
predict growth rate directly from GEMs. We focus on exchange fluxes because they
are the missing gap between growth yield (which can be calculated relative to uptake
rates by a GEM, e.g., in [34]) and growth rate, and because there is an observed
54
54
strong positive correlation between cellular surface-to-volume ratio and growth rate,
as well as additional evidence suggesting that cell surface metabolism exerts most of
the control of a cell over growth rate [79]. The exemplar of predictors we tested is a
novel method called “SUMEX,” which predicts growth rates of cells under different
media conditions without requiring substrate uptake rates, kinetic constants, or any
other empirical parameters. SUMEX is computed by maximizing the total molar
output exchange minus input exchange of metabolites (which, given the sign
convention in GEMs that all exchange reactions point outwards, is calculated as the
„maximal SUM of EXchange fluxes‟), while setting a nominal lower bound on
biomass production in order to ensure that some flux runs through biomass-
producing pathways (see Fig. S6). SUMEX represents a simple heuristic to
maximizing catabolic activity of a cell, focusing exclusively on exchange reactions,
and still ensuring a nominal production of biomass (we discuss a sensitivity analysis
of this and other necessary bounds later in the chapter, and in the supplement).
The SUMEX formulation is:
1
, m in m a x
m in
:
0
m a x
j j
n
e x c h a n g e
i
j
b io m a s s b io m a s s
j
V
S u b je c t to
V V
v V
S V
v v v
It is explained in greater detail in the methods part of the supplementary data.
To test SUMEX and other methods, we collected two datasets of measured cellular
growth rates from the literature: the previously mentioned Vieira Silva and Rocha
dataset of maximal growth rates on rich media reported for 66 organisms (ds66)
[78], and growth rates in early exponential phase of batch growth of 57 Escherichia
coli wild type (WT) and knockout (KO) strains evolved for growth on a number of
55
55
minimal media (ds57) [34]. We generated a third dataset in the lab, by measuring
growth rates in vitro in the early exponential phase of batch cultures of 6 organisms
on 3 defined media (ds18). Using automatically generated models from SEED [25],
we then computed various growth-rate predictors for each of the models and
conditions in these three datasets (ds66, ds57, and ds18). We compared SUMEX (as
the exemplar of exchange-based metrics we had experimented with) against several
metrics presented in a previous experimental study in E. coli of the optimal
objectives of GEMs for predicting metabolic flux distributions [77]. Strikingly,
SUMEX outperformed every previous metric in all three datasets in predicting
growth rates with only one exception in one dataset (codon usage bias from [78]
correlated better than SUMEX with growth rates in ds66, but was non-predictive in
the other datasets as it inherently cannot account for changes in the medium or gene
knockouts). Overall, SUMEX was the only metric among those tested to
significantly correlate with growth rate across all three datasets (see Fig. 1d).
56
56
Chapter 3::Figure 1: Correlation of different metrics to growth rate. (A-C) Spearman correlations of SUMEX vs.
growth rate in three datasets. Colors in (B) represent media (green triangles, IMMxt; blue diamonds, IMM; red
squares, IMM-gt; see Table S6 for details). Colors in (C) represent strains. Trend-lines in (C) are shown for strains that
individually show significance (*P≤5e-2, **P≤5e-3). Correlation values for SUMEX and Biomass vs. growth rate are
listed below. (D) Significant (P-val≤5e-2) Spearman correlations (i.e., ρ values) across three bacterial datasets for all
tested metrics (non-significant correlations are not shown). Metrics are listed in descending order of the sum of ρ across
the three datasets. Vertical lines denote rhos for SUMEX.
57
57
Notably, the maximization of biomass yield, the aforementioned fitness metric used
in hundreds of GEM studies, failed to significantly predict growth rates in two out of
the three datasets (ds18 and ds57). This is despite previously noted strong
correlations between GEM-predicted biomass yields and growth rates in ds57 when
accounting for experimentally measured glucose uptake rates [34], which emphasizes
the difference between predicting rate and predicting yield. In contrast, biomass
yield was predictive of growth rate in ds66 (although not as predictive as SUMEX).
This suggests that in rich media and when looking across a large range of organisms,
both the growth rate and yield depend greatly on the capacity of an organism to take
up many substrates -- an observation supported by the strong correlation between
“count of uptake exchange reactions” and growth rate, as well as by the strong
observed correlation between SUMEX and biomass yield, in ds66 (see Fig. 1a and
Fig. S5). Despite this, SUMEX correlates significantly with growth rate in ds66
even when controlling for biomass yield (ρ=0.38, P=1.6e-3 in partial Spearman
correlation), showing that SUMEX provides information beyond that obtained from
maximizing biomass. Surprisingly, maximization of ATP hydrolysis correlated
poorly with growth rate, even though it has been previously shown to be predictive
of intracellular fluxes in E. coli [77, 80]. These results suggest that while biomass
and ATP hydrolysis are appropriate for measuring growth yield, they are not
necessarily suited to measure growth rate using GEMs. A full description of metrics
we tested is provided in the Supplement.
As previously mentioned, SUMEX requires no kinetic parameters, substrate uptake
rates, or other empirical values to predict growth rate. To further benchmark
SUMEX, we also tested it against previous methods for predicting growth rates that
do require empirical parameters. A few such methods, which include several
hundred kinetic constants or molecular crowding constraints, were introduced in
recent years for E. coli [35, 81]. We tested the ability of SUMEX to predict growth
rates reported in [35] for E. coli grown on 24 minimal media (henceforth: ds24), and
achieved equivalent results to the state of the art (for consistency with the previous
analyses, SUMEX was calculated for this dataset on the manually curated model,
iAF1260 [18]; SUMEX and MOMENT, the method described in [35] and achieving
58
58
the best previous result, each attained ρ=0.47 and P=0.02 in 2-sided Spearman tests;
see Table S2). Because SUMEX uses only the stoichiometry of metabolic reactions
but no empirical parameters, it has the clear advantage that it can be easily computed
across many species (if their metabolic models are available), as shown in the
analyses of ds66 and ds18.
To understand in more detail the mechanisms linking SUMEX to growth, we studied
the relative contributions of different exchanged compounds to SUMEX. We did
this by analyzing the effect of either leaving out or of individually optimizing the
flux of each individual exchange metabolite. We found that the compounds that
contribute most to SUMEX (those shown in Fig. 2) are H+ and several TCA-cycle
intermediates, in addition to CO2. CO2, the main product of cellular catabolism, was
necessarily released from the cell in nearly all conditions when SUMEX was
optimized (Fig. 2C).
59
59
Chapter 3::Figure 2: Component-wise analysis of SUMEX (A-B) Spearman correlations of SUMEX versus growth rate
(GR) across the 3 bacterial datasets when different exchange reactions are (A) removed from SUMEX or (B) optimized
individually. Horizontal lines and rightmost set of columns show SUMEX ρ values. The components presented are all of
those whose removal affected SUMEX ρ by >5% or that came within 5% of the SUMEX rho when maximized alone, for
any of the 3 datasets. (C) The difference between the percent of models (per dataset) that must uptake vs. that must
excrete a component in order to achieve maximal SUMEX.
Interestingly, the removal of proton exchange from the SUMEX objective reduces
the correlation of SUMEX with growth rate more than removal of any other
component (it severely reduced the predictiveness of SUMEX in both ds18 and ds57
datasets – see Fig. 2A). Additionally, we found that maximizing the production of
0
0.2
0.4
0.6
0.8
1Sp
earm
an's
rh
o v
s. G
R
Leaving components out of SUMEX
ds18ds66ds57
0
0.2
0.4
0.6
0.8
1
Spea
rman
's r
ho
vs.
GR
Optimizing individual components of SUMEX
ds18ds66ds57
a.
b.
c.
SUMEX after removal of key components
Optimization of individual key SUMEX components
-100
-50
0
50
100
% c
on
dit
ion
s (t
akin
g u
p -
secr
etin
g) t
his
co
mp
ou
nd Directions of allowed flux in optimal SUMEX
ds18ds66ds57
se
cre
tio
n |
up
take
Flux directionality of key SUMEX components
60
60
protons alone is nearly as predictive as SUMEX across the three bacterial datasets
(see Fig 2B). Protons are the smallest metabolites in the metabolic models and can
be readily produced from many different sources, and thus can account for a large
portion of the total SUMEX flux (as we confirmed by flux variability analysis[7]).
The strong correlation between maximal proton production and growth rate led us to
hypothesize that if a cell has abundant resources for producing free extracellular
protons, the strong resulting pH gradient may help drive ATP synthesis and gradient-
driven transport, thus increasing overall growth rate and thus also contributing to the
predictive power of SUMEX. It has been shown in E. coli and other species that
when flux ranges are below saturation, the rate of ATP synthesis relates
approximately linearly to the electrochemical gradient, which in respiring bacteria is
determined primarily by the proton (i.e., pH) gradient [82, 83]. Therefore, we would
expect the proton-related contribution to SUMEX to be more predictive in respirers
than in obligate fermenters, for which the production of ATP does not depend on the
membrane gradient.
To test the fermenters vs. respirers hypothesis, we categorized the organisms in ds66
into two groups: 9 obligate fermenters (ds66f) and 57 organisms that can respire
(ds66r). We found that the correlation of SUMEX with growth rate is stronger
among only the respirers than among all organisms in ds66 (see Fig. 3a), that
SUMEX is not significantly predictive of growth rate for obligate fermenters (also
Fig. 3a), and that these same trends also apply when we instead compare
maximization of proton production (PMAX) vs. growth rate (Fig. 3b). PMAX
correlates strongly with SUMEX in models of both respiring and obligate fermenting
organisms, despite the observation that neither is predictive of growth rate for
obligate fermenters (see Fig 3c). This emphasizes the strong interdependency of
SUMEX and PMAX.
61
61
Chapter 3::Figure 3: Prediction of growth in Respirers vs. Fermenters in ds66 Maximization of (A) SUMEX or (B) H+
production is plotted against growth rate for ds66 organisms, categorized into obligate fermenters (blue diamonds) and
respirers (red circles) with trendlines shown. Rho and pvals are for 2-sided Spearman correlations. (C) Maximization
of proton gradient correlates strongly with SUMEX in both respirers and fermenters. (D) SUMEX and Biomass as
calculated on obligate fermenters are plotted vs. GR. Trendlines and Spearman correlations (1-sided) exclude L.
plantarum, which can respire in the presence of heme and menaquinone (L. plantarum is shown on the plot as an orange
asterisk (SUMEX) and a green “X” (Biomass)).
When we remove a borderline case from the set of obligate fermenters (Lactobacillus
plantarum, which has been shown to respire if provided heme and menaquinone
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 0.5 1 1.5no
rmal
ize
d S
UM
EX o
r b
iom
ass
in vitro GR (1/h)
Biomass and SUMEX vs. GR for fermenters
rho p
respirers 0.97 1.2e-6
obligate fermenters 0.70 0.04
rho p
SUMEX* 0.73 2.4e-2
Biomass* 0.92 2.7e-3
rho p
respirers 0.60 8e-7
obligate fermenters 0.49 0.18
-1000
1000
3000
5000
7000
9000
0.01 0.1 1 10
SUM
EX (
arb
itra
ry u
nit
s)
max in vitro growth rate (1/h)
SUMEX vs. GR
-2000
0
2000
4000
6000
8000
10000
0.01 0.1 1 10
H+
pro
du
ctio
n (
arb
itra
ry u
nit
s)*
max in vitro growth rate (1/h)
max H+ production vs. GR
-100
1900
3900
5900
7900
9900
-100 4900 9900H+
pro
du
ctio
n (
arb
itra
ry u
nit
s)*
sumex (arbitrary units)
max H+ production vs. SUMEX
a. b.
c. d.
rho p
respirers 0.53 1.7e-5
obligate fermenters 0.04 0.92
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 0.5 1 1.5
norm
aliz
ed p
redi
ctor
s
in vitro GR (1/h)
Biomass and SUMEX vs. GR, fermenters
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
0 0.5 1 1.5
norm
aliz
ed p
redi
ctor
s
in vitro GR (1/h)
Biomass and SUMEX vs. GR, fermenters
*Excluding L. plantarum ( , ):
62
62
[84]), both SUMEX and biomass maximization became predictive of fermenter
growth rates (Fig. 3d). Therefore, a larger dataset of growth rates of obligate
fermenters than currently at our disposal will be needed to unequivocally determine
whether SUMEX can be used to predict growth rates of obligate fermenters. See our
continued analysis in the Supplementary data.
SUMEX activates a variety of pathways not predicted by the typical biomass yield
objective. To test whether these pathways are reflected in global gene expression, we
analyzed the relationship between (a) the measured expression of a gene on 5
different growth media [85] and (b) the medium-dependent contribution of reactions
associated with the gene to a cellular objective, comparing between biomass yield
and SUMEX (a reaction's contribution was determined by measuring its effect on the
cell objective when forcing an incremental flux through it; see Supplement). We
found that genes that contribute positively to SUMEX have significantly higher
expression levels than genes detrimental to SUMEX across 4 of the 5 media (with
borderline significance for the 5th
medium), whereas no significant association was
seen in the same analysis done using the biomass objective (p≤0.05 on 4 media and
p=0.06 on the fifth for SUMEX, vs. p≥0.30 on all media for biomass, in 1-sided
ranksum tests; see Supplement). Furthermore, SUMEX outperformed biomass yield
in the prediction of highly-active genes both in terms of precision and recall on all
media (avg. precision and recall were 0.19 and 0.63 for SUMEX, vs. 0.08 and 0.01
for biomass – see Table S3). This strongly suggests that many of the genes predicted
to be active by SUMEX, but which are detrimental or indifferent with respect to
biomass yield, are actually important for unaccounted-for cellular processes.
As a final test of SUMEX, we wished to assess its ability to predict growth rates of
NCI60 human cancer cell lines, as cancer cells are expected to maximize growth as a
fitness objective. To do this, we produced GEMs for 60 NCI60 cancer cell lines
based on the full human metabolic model [20], by altering bounds of reactions based
on the expression of 222 genes that had significant correlation with growth (see
Supplement for details of how the models were built). As a basic validation of the
models, we checked the correlation of the biomass yield objective against published
NCI60 growth rates [86], and found that it indeed correlated highly significantly
63
63
(ρ=0.68, P=2.7e-9). Finally we checked the ability of SUMEX to predict these
growth rates, and found that it obtained even slightly higher correlations (ρ=0.74,
P=2.6e-11) (see Fig. 4), thus extending our results to cancer cell lines and
emphasizing the predictive power of SUMEX.
Chapter 3::Figure 4: NCI60 cancer cell line growth rates predicted by SUMEX. Maximization of (A) biomass yield and
(B) SUMEX both correlate highly significantly with growth rate. Spearman correlation values vs. growth rates are
overlaid on plots.
Limitations must be set on certain reaction bounds in a GEM in order to obtain
feasible solutions (we used standard flux bounds of -50 for all allowed uptakes in
SUMEX), which is a confounding factor in any attempt to produce parameter-less
metrics in GEMs. Therefore, in order to ensure that the results seen for SUMEX
were not simply due to the particular bounds we chose, we performed a sensitivity
analysis. This test revealed that the correlation of SUMEX with growth rate was
highly robust even up to 50% (or more) random variations imposed across all uptake
(or secretion) bounds; we furthermore found that biomass is significantly less robust
than SUMEX in 2 of the 3 datasets (see Fig. S1 and Table S1). In addition to these
bounds set on exchange reactions, we performed a sensitivity analysis on the nominal
lower bound set on biomass production within SUMEX, and found that the results
were relatively stable for large ranges of this bound (see Fig. S2).
27
27.5
28
28.5
29
29.5
30
30.5
31
0 0.035 0.07
SUM
EX (
arb
itra
ry u
nit
s)
growth rate (1/h)
a. b.
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.2
0 0.035 0.07
Bio
mas
s (a
rbit
rary
un
its)
growth rate (1/h)
ρ = 0.68P = 2.7e-9
ρ = 0.74P = 2.6e-11
SUMEX vs. GR, NCI60Biomass vs. GR, NCI60
64
64
3.3. Conclusions
SUMEX represents a maximization of cellular catabolic activity as a cellular
optimality principle, as outlined at the start of this chapter. Notably, albeit its
simplicity, SUMEX predictions correlate significantly with growth rate on every
suitable dataset we were able to find in literature, as well as a set of growth rate data
we measured ourselves. Accurately predicting cell growth rate is critical for
understanding the dynamics of microorganism-dominated ecosystems, and may lead
to improved biotechnology applications and perhaps even to new functional insights
into cancer, as well as filling in a basic gap in our understanding of cellular function.
This exploration of a promising alternate objective for predicting cell growth rate
will hopefully stimulate future research in this area, and lead to better predictive
models in the future.
3.4. Materials and Methods
3.4.1. Models
Unless otherwise noted, analyses were done on genome-scale metabolic
reconstructions (GEMs) as obtained from SEED [25], at http://seed-
viewer.theseed.org/. The 66 organisms in ds66 were chosen because (1) their GEMs
were available from SEED and published in [25], and (2) their optimal doubling
times were available from [78]. For analysis of ds24, the iAF1260 E. coli model was
used, and the NCI60 cancer cell line analysis used custom-made models based on the
generic human model (see Supplement IIId for full details). Table S5 lists the names
of the ds66 models and organisms.
3.4.2. Implementation of growth rate predictors
Optimizations were run in in silico environments consistent with the known media,
in which all exchange metabolites for a given species were available at a fixed rate of
65
65
-50.0 (with output bounds of 1000). A sensitivity analysis was done to determine if
these bounds affected the performance of SUMEX, and SUMEX was found to be
robust to random changes in the bounds (and significantly more robust than biomass
yield optimization; see Fig. S1 and Table S2). In the case of ds66, the environment
was „rich‟, so we allowed uptake flux in all exchange reactions present in each
organisms.
By convention, exchange fluxes denoting entrance of a metabolite into the cell
(uptake) are negative valued, while exchanges denoting exit of a metabolite from the
cell (output / secretion) are positive valued. Therefore, maximizing the total
exchange flux (i.e. the SUMEX metric) would denote maximizing the output at the
expense of the input (output exchanges – input exchanges).
For simulation of maximal proton production (PMAX) (e.g., in Fig. 3 and for the
NCI60 models), we increased the upper bound on proton production to +inf in order
to avoid capping total protons produced. Manipulating this bound while running
SUMEX did not significantly affect SUMEX results (data not shown). More details
of the model constraints are provided in Supplementary methods.
3.4.3. Building NCI60 cancer cell models
Reconstructing the NCI60 cancer cell lines required several key inputs: (a) the
generic human model [20], (b) gene expression data for each cancer cell line from
[87], and (c) growth rate measurements (Note: the growth rates were used only to
determine which genes should be used in constraining the models, in order to obtain
models that were as physiologically relevant as possible; they were not used to
determine the actual bounds). Specific metabolic models were produced for each
cell line by modifying the upper bounds of reactions in accordance with the
expression of the individual gene microarray values. See Supplement IIId for more
details.
66
66
3.4.4. Growth experiments of 6 organisms on 3 defined IMM
media (ds18)
To validate SUMEX, we performed in vitro experiments to measure the growth rates
of a number of organisms (listed in Table S6) in multiple environments. Growth
experiments were conducted in 96-well plates at 30°C, with continuous shaking,
using a Biotek ELX808IU-PC microplate reader, on variants of IMM medium, as
detailed in Table S7. Optical density was measured every 15 minutes at a
wavelength of 595nm. Growth rates were determined during early to mid
exponential growth phase by taking the slope of a linear fit through the natural log of
the data.
67
67
Chapter 4
4. Glycan Degradation (GlyDe)
analysis predicts mammalian gut
microbiota abundance and host diet-
specific adaptations
Based on an article with the same name by the authors:
Omer Eilam, Raphy Zarecki, Matthew Oberhardt, Martin Kupiec, Uri Gophna &
Eytan Ruppin
In this article the first 2 authors had equal contribution.
The article was submitted to Genome Research (currently under review) and has
been presented in the conference: Exploring human host-microbiome interactions in
health and disease (2012)
4.1. Introduction
The human gastrointestinal tract harbors an extensive array of commensal
microorganisms. Species composition is highly diverse both within and between
individuals [88] and the activities of these organisms affect the host through many
pathways, including the production of short chain fatty acids (SCFA) that regulate
68
68
epithelial cell growth and immune system development, displacement of potential
pathogens, detoxification of protein fermentation products, and gas production [89-
93]. The beneficial or detrimental outcomes of these effects depend largely on the
community structure, environmental factors, diet and the genetic background of the
host [94, 95]. Large-scale metagenomic studies have uncovered associative
relationships between these factors, yet typically provide limited insights into the
underlying mechanisms [96]. Simplified in-vitro models aim to bridge this gap, but
the reliance of these models on a limited number of strains and on results from
defined culture media make them difficult to relate to the complexities of the actual
gut environment [97, 98].
While host tissues and other substrates of endogenous origin such as mucins are
continually being broken down and recycled by intestinal bacteria, the composition
and metabolic activities of gut bacterial communities are primarily determined by our
diet [89]. Since the efficiency of the digestive system is remarkably high, very few
simple metabolites escape digestion in the small intestine [99]. Therefore, complex
carbohydrates and their derivatives, collectively termed glycans, which are not
digested by the host‟s endogenous pathways higher in the gastrointestinal tract, are
the predominant nutrients for microbes in the colon [100, 101]. Modern Western
diets incorporate a large variety of food sources, resulting in a nutrient-rich colonic
environment that supports a tremendous diversity of species [102]. With recent in-
vitro studies showing that the breakdown of a given substrate can be highly species-
specific [103], the overall picture of the human gut is that of a diverse bacterial
community in which different microbial groups occupy distinct metabolic niches.
While some human colonic bacteria simply require acetate or branched chain fatty
acids [104], the detailed growth requirements for the majority of gut bacteria remain
unknown [103]. Characterizing these requirements will shed light on the different
metabolic niches organisms fill, and may enable the design of dietary interventions
that promote growth of particular beneficial microbes, an approach collectively
termed “prebiotics” [105]. While several glycans are currently marketed around the
world as prebiotics, few have been validated through high-quality human trials [105,
106]. Furthermore, dietary enrichment of a specific prebiotic compound may permit
69
69
preferential expansion of a microbial group that is well adapted to its use, but the
outcomes for the gut community as a whole can be unpredictable [107].
In this study we investigate the connections between diet and glycan metabolism of
the human gut microbiota. Whereas the study of the metabolic activity conducted by
gut microbiota is at the focal point of a wide range of computational studies [96,
108], current approaches have been highly limited in their ability to analyze glycan
degradation. We present a novel algorithm (termed GlyDe) and computational
pipeline for predicting the glycan degradation patterns of bacteria based on nearly
150 Carbohydrate Active enZymes (CAZymes) and 10,000 glycans, and apply it to a
cohort of 203 microbial genomes and nearly 10,000 glycan structures. Given a
particular bacterial (meta-) genome, GlyDe can `reverse-engineer' the predicted
efficiency by which it degrades a variety of different glycans. These predictions
correlate with known KEGG reactions and expand upon the limited, previously
available glycan degradation data 100-fold. We determine that the microbiota of
herbivores and carnivores have stronger degradation affinities for plant-derived and
animal-derived glycans respectively, and that a Western diet in humans correlates
more strongly with meat-derived glycans than a non-Western diet. Finally, we show
that species-specific glycan degradation profiles are associated with and can be used
to predict that species abundance, making GlyDe a valuable tool for the future
rationale-design of novel prebiotics, by deliberately manipulating the microbiome
based on nutrient availability.
4.2. Results
4.2.1. The construction of the Glycan Degradation (GlyDe)
pipeline
Although the exact biochemistry of glycan degradation is missing from all currently
available databases, considerable knowledge is embedded in the descriptions of
Carbohydrate-active Enzymes (CAZymes) that catalyze these degradation reactions,
and is typically represented by enzymatic commission (EC) numbers. We leveraged
70
70
this knowledge to develop a new computational pipeline that uses enzymatic and
structural data sources to predict the degradation of every glycan in KEGG [109]
given a sequenced (meta-) genome. That is, given (meta-) genomic data as input,
GlyDe yields phenotypic (glycan degradation) data as output. The construction of the
pipeline comprises two steps: (1) The first step relies on a novel algorithm that we
developed which we term Glycan Degradation (GlyDe). The algorithm takes as input
a manually curated annotation of all the rules defining for each known CAZyme its
capablities of performing Glycan degradataion, and a graph representation of the
structure of all the glycans in KEGG, in which the nodes are the monosaccharides
and the edges are the glycosidic linkages (Chapter4::Figure 1a). We convert the
CAZyme annotations to computer-based rules dictating their mechanism for breaking
a given glycan into two sub-components. The manual curation of this critical step
was done using the help of experts with knowledge in the biochemistry of glycan
metabolism. GlyDe then executes these rules recursively on all the glycans to
generate 141,561 GlyDe reactions, each linking a specific enzyme to a glycan
substrate and its products (an example reaction is given in Figure 1a and a more
detailed explanation is provided in the Methods). (2) In the second step GlyDe
reactions are mapped back to CAZymes in order to produce a table where the rows
are CAZymes, the columns are glycans, and each entry contains a CAZyme score for
CAZyme i and glycan j, calculated as follows: if CAZyme i is unable to break glycan
j then the score is 0, otherwise the score is
𝐶𝐴𝑍𝑦𝑚𝑒 𝑠𝑐𝑜𝑟𝑒𝑖𝑗 =1
𝑔𝑖
where gi is the number of glycans that are broken by CAZyme i. The entire
construction process is summarized in Figure 1b. A CAZymes table which contains
all of the CAZyme scores can be found in Supplementary Table 5.
4.2.2. The usage of the GlyDe pipeline
71
71
microbial (meta-) genomes are annotated for CAZymes using BLAST [110] against
three reference databases: The Carbohydrate Active enZymes (CAZy) Database
[111], the Seed - RAST annotation [112], and KEGG [71] (Methods). Then,
CAZyme scores can be assigned to genes and a (meta-) genome-specific GlyDe
score calculated for each glycan as follows:
𝐺𝑙𝑦𝐷𝑒 𝑠𝑐𝑜𝑟𝑒𝑖𝑗 = 𝑛𝑗𝑘
𝑔𝑘 ,∀ 𝑒𝑘
where ek is an enzyme that can degrade glycan j, nik is the number of genes in the
(meta-) genome which translate to enzyme ek, and gk is the number of glycans broken
by enzyme ek.
The GlyDe score represents the predicted efficiency with which the glycan can be
degraded by that (meta-) genome, taking into account how many CAZymes can
degrade the glycan, and decrementing the score of promiscuous enzymes with low
specificities (Methods). For example, an organism containing three enzymes that
degrade maltotetraose, each of them degrading also four other glycans, would have a
GlyDe score of 3/5 (see two examples in Figure 1c). The use of GlyDe is captured in
Figure 1d.
72
72
Chapter 4::Figure 1: The Glycan Degradation (GlyDe) platform. (a) A visual representation of the glycan degradation
reaction performed by EC 3.2.1.115 breaking down Kojitriose into Kojibiose and Glucose. (b) A schematic
representation of the construction of the computational pipeline. Information is taken from multiple databases and
analyzed as follows: Step 1 (left red arrow) using CAZyme information and the GlyDe algorithm, glycan degradation
reactions are reconstructed. Step 2 (right red arrow) a CAZyme table is constructed that represents the efficiency in
which different CAZymes break different glycans. (c) GlyDe score calculation. Top: The organism has one enzyme
(yellow pacman) dedicated to the degradation of one glycan (purple), therefore the GlyDe score for the purple glycan
equals 1. Bottom: The organism has two enzymes capable of degrading 3 and 4 glycans respectively, therefore the GlyDe
score for the purple glycan equals 7/12. (d) GlyDe utilization: (meta-) genomes are annotated for CAZymes using CAZy,
SEED and KEGG databases, and using the CAZyme table a GlyDe score can be calculated, reflecting the capacity of a
(meta-) genome to degrade a specific glycan. GlyDe: Glycan Degradation. CAZymes: Carbohydrate Active Enzymes.
4.2.3. Validating the GlyDe pipeline
To assess the biological relevance of GlyDe, we performed a cross-validation
procedure that examines its consistency in capturing known degradation reactions in
KEGG (Methods). We found that the products of GlyDe reactions were highly
enriched with known rather than hypothetical glycans (p-value = 10-19
in hyper
geometric test; see Supplementary Fig. 2a). As further validation, we compared the
predicted genome-specific GlyDe scores of each bacterial strain with the glycans
that, according to KEGG, that strain is able to break. Since the above information
a. b.
b.
c.
d.
(Meta-) Genome sequence
BLAST against CAZy to infer CAZyme
(meta-) genome GlyDe score table
(samples x glycans scores matrix)
73
73
from KEGG was not used to construct the set of GlyDe reactions, a circular
argument is avoided. We find a significantly higher mean GlyDe score for KEGG
glycans across all strains when comparing it to the mean GlyDe score of non-KEGG
glycans (Figure 2a). Notably, our analysis produced GlyDe scores for over 100x the
number of unique glycan degradation reactions that are currently reported in KEGG
(116,388 vs. 1374), highlighting the limited scope of glycan metabolism captured in
the KEGG database.
4.2.4. Characterization of glycan degradation patterns across
the major gut bacterial phyla
We first applied GlyDe to a cohort of 203 reference gut microbial genomes retrieved
from the Human Microbiome Project (HMP [113]). We used GlyDe to study the
extent to which different microbial phyla metabolize different glycans. Initially, we
examined whether phylogenetic clusters are reflected in glycan degradation patterns.
We therefore computed for each of the HMP genomes a GlyDe profile, i.e. a vector
of its GlyDe scores for all glycans (Methods). This species specific GlyDe `signature'
describes the efficiency by which a given species can catabolize each of the ~10,000
reference glycans in our database, and hence provides an overall view of its glycans
utilization capabilities. We then mapped each species to its respective phylum and
performed Principal Coordinates Analysis (PCoA) on the Bray-Curtis dissimilarities
between the species GlyDe profiles (Methods). This yielded clusters of phyla that
were statistically distinct (Supplementary Fig. 2b, MANOVA test, Wilke's
Lambda<0.001). Still, there are apparent significant differences in the average glycan
degradation capacities of genera belonging to a given phylum (Supplementary Fig.
2c).
Inspecting individual phyla, Bacteroidetes display the highest GlyDe scores over a
range of glycan categories (Figure 2b), and all nineteen of the highest GlyDe score
ranking species belong to the Bacteroides genus, consistent with their known role as
primary glycan degraders in the gut [114-116]. Recent papers have shown that
74
74
glycans found in human milk, i.e. Human Milk Oligosaccharides (HMOs), are
utilized mainly by a few Bifidobacterium and several Bacteroides species [114, 117].
According to GlyDe, 21 out of 23 HMP genomes capable of degrading HMOs
belonged to Bacteroides species (the other degraders were Parabacteroides sp. D13
and Bifidobacterium bifidum.
We next analyzed the relative efficiency of the different bacterial genera in
degrading glycans of various degrees of polymerization (Supplementary Fig. 2e). We
found a large variability in the predicted GlyDe profiles of different genera within
each phylum. For example, within the Bacteroidetes phylum, species belonging to
the Bacteroides genus are predicted to be far better glycan degraders than those
belonging to the genera Parabacteroides or Prevotella. Similarly, while Firmicutes
are generally poor glycan degraders, Rosburia species are predicted to be highly
efficient in breaking down polysaccharides. Notably, Roseburia intestinalis has one
of the highest predicted GlyDe scores, an observation also supported by recent
literature [100]. Thus, it is important to assess the glycan degradation capacity of
individual taxa within the larger community.
4.2.5. Glycan degradation patterns can be used to predict
bacterial abundance
We studied the relationship between the glycan degradation scores of a given
bacterial taxon and its abundance in the gut. We matched the abundance of 16S
rRNA marker gene sequences from 325 human individual gut samples found in the
HMP database with the aforementioned 203 microbial reference genomes (Methods).
For each taxon we extracted 6 features characterizing its glycan degradation capacity
(Methods), including: Plant-specific, Animal-specific, Disaccharides,
Oligosaccharides, Short Polysaccharides and Long Polysaccharides. Each feature
represents the sum of GlyDe scores for the glycans that belong in the class. Based on
these features we built a linear regression model for the abundance of these taxa in
the samples. In order to apply the linear regression model we filtered out taxa that
75
75
were not detected in any sample and taxa that were highly varied (see Methods for
criteria), resulting in 48 predictable taxa for the analysis. This regression yielded a
correlation coefficient of 0.46 (Figure 2c), a score markedly higher than the
correlation achieved using a model based on CAZymes abundances in a genome
(r=0.11, Methods). We next built similar regression models independently for each
class of bacteria. Remarkably, the Clostridia class had the highest combination of R
(0.76) and p-value (0.0001), while other classes with significantly predictive models
were Bacteroidia and Fusobacteria (Figure 2d). These results suggest that glycan
supplements can be tailored to control certain species abundances, especially those of
potential pathogenic Clostridia.
In an effort to include taxa that were initially omitted in the analysis above due to
their high variation , we first clustered the HMP samples according to their 16S
rRNA data into 2 main groups using KMeans (Methods), and recalculated the
average taxa abundance separately for each cluster. The same procedure for building
predictors of bacterial taxa abundance based on their genome-specific GlyDe
features was then calculated, yielding 53 predictable taxa in the first cluster and 71
predictable taxa in the second cluster, with concomitant increases in the correlation
coefficients (0.51 and 0.57, respectively). Based on the two clusters we assembled a
list of 25 strains with highly predictable abundance (Methods), and with only one
exception, all the strains belong either to the Bacteroidia or Clostridia classes (see
Discussion). Notably, based on the regression formulas of all three models, the
degradation capacity of long polysaccharides had the highest effect on bacterial
abundance. It is therefore likely that because most long polysaccharides are not
digested by the human host prior to reaching the colon, the ability to degrade them
provides a significant selective advantage for gut microbes.
76
76
Chapter 4::Figure 2: Glycan Degradation of the gut microbiota reference genomes. (a) Distribution of species-specific GlyDe
scores (y axis) for all the glycans in KEGG. GlyDe scores with corresponding reactions in KEGG appear on the left while those
with no KEGG reactions appear on the right (student's t=6.14, p<0.0001). (b) The bar plot compares the glycan degradation
potential of the different bacterial phyla for different glycan dietary categories. Each bar depicts the sum of GlyDe scores of
organisms belonging to their respected phylum. Red indicates glycans derived from animals, blue indicates glycans derived
from bacteria and green indicates glycans derived from plants. The purple bar represents the overall number of CAZymes. The
height of the colored bar represents the median while the error bars reflect the lower and upper quartiles. Asterisks denote
significant p-values when comparing Bacteroidetes to the other phyla. (c) The log-log scatter plot shows the average abundance
of 48 HMP strains within 325 human fecal samples (Y axis) and the linear regression- predicted abundance of each individual
strain (X axis). (Linear Regression correlation coefficient=0.46, p=0.0016). (d) A bar chart denoting the correlation value
(height of the bar) between actual and predicted abundance from linear regression models built for each class of bacteria (X
axis) based on its taxa's GlyDe features. The color of the bar reflects the number of species in the class. The feature extraction is
explained in the methods.
4.2.6. Glycan degradation profiles of mammalian species are
associated with their diet
Because diet is the prime determinant of colonic glycan composition and gut
microbiota vary according to general dietary patters[118], we expected that glycan
degradation would systematically vary between the microbiota of different
mammalian hosts based on their diet. To test this, we analyzed variation in diet and
c.
b.
R = 0.46
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
0 0.5 1 1.5 2 2.5 3 3.5
log(
Ave
rage
Ab
un
dac
e i
n
HM
P F
eca
l Sam
ple
s)
log(Predicted Abundance)
Gly
De
Sco
re (l
og)
***
Not Degraded in KEGGDegraded in KEGG
Phylum
BacteriaAnimalsPlantsTotal
Sum
of
Gly
De
Sco
res
*
*
a.
*
*
*
d.
77
77
glycan degradation profiles across different mammalian species, using metagenomic
sequencing data from 57 fecal samples across 34 different species, including 18
human samples [119] (Methods). According to the host‟s diet, each sample is
characterized as being either herbivorous, carnivorous or omnivorous. To correct for
research biases arising from uneven annotations of CAZymes between species, we
normalized each sample by the total number of CAZymes in it before running PCoA
on the GlyDe profiles (Methods). The PCoA analysis revealed a clear spectrum of
samples over the first principal coordinate, from herbivores, through omnivores to
carnivores (Supplementary Fig. 3a). Using a subset of glycans that were categorized
into either plant-derived or animal-derived (Methods), we discovered a striking
relationship between the diet of a host organism and the glycans predicted to be
degraded by its gut microbiota: microbiota from herbivores tend to degrade plant-
derived glycans (p=0.04 compared to carnivores using Wilcoxon test, Figure 3a),
while microbiota from carnivores prefer animal-derived glycans (p=0.0001 compared
to herbivores using Wilcoxon test, Figure 3a). Interestingly, the degradation
efficiencies of omnivores and human gut microbiota places them as intermediates
between herbivores and carnivores (Figure 3a). To further explore where humans
stand with respect to the dietary spectrum, we trained an SVM classifier to
distinguish between herbivore and carnivore samples based on their inferred glycan
degradation profiles (Methods). The classifier predicted all but one sample correctly
in a leave-one-out cross validation (AUC=0.93, F=0.96), and notably outperformed a
classifier based only on the abundance of CAZymes found in each sample (which
had three misclassifications: AUC=0.71, F=0.84), confirming the added predictive
value of GlyDe. We next applied this classifier to the 11 available non-human
omnivore samples, and classified 6 and 5 of the samples as herbivores and
carnivores, respectively. These samples are missing direct dietary labeling, however,
a comparison of these classifications versus the Fiber Index of these mammals [120]
shows a nice correspondence with the predicted dietary regimes of the animals
(Table 1). We next applied the classifier to predict the dietary habits of the human
samples, which are unknown, resulting in 15 out of the 18 samples labeled as
carnivores. Thus, at least in the small population sample analyzed here, humans may
be closer to carnivores in some functional aspects of their gut microbiota.
78
78
Sample Mammalian Species SVM Diet Predicted Fiber Index Correspondence
4461343 Hamadryas baboon Herbivore 50-500 ✓
4461344 Hamadryas baboon Herbivore 50-500 ✓
4461347 North American black bear Carnivore 0-50 ✓
4461348 Black lemur Carnivore N/A N/A
4461351 Goeldi's marmoset Carnivore 0-50 ✓
4461353 Chimpanzee Herbivore 50-500 ✓
4461354 Chimpanzee Herbivore 50-500 ✓
4461374 Ring-tailed lemur Herbivore 50-500 ✓
4461375 White-faced saki Herbivore 0-50 ✗4461376 Spectacled bear Carnivore 50-500 ✗4461378 Prevost's squirrel Carnivore 0-50 ✓
Chapter 4::Table 1: Mammalian host diet prediction by GlyDe profiles. An SVM classifier was trained based on the
GlyDe profiles of herbivores and carnivores. A diet Fiber Index for these species was obtained from Ley et. al. [120],
defining the percentages in each diet of acid-detergent fiber (ADF) and neutral-detergent fiber. A higher index suggests a
more plant-based diet. The last column displays the correspondence between the GlyDe predicted diet of the animal and
its Fiber Index, where a value of 0-50 corresponds to carnivores and a value of 50-500 corresponds to herbivores.
We next explored whether humans who live in different geographies with markedly
different diets exhibit different GlyDe profiles. We analyzed the Yatsunenko et al.
dataset [121], which contains metagenomic sequences from fecal samples of 110
human individuals who live in Venezuela, Malawi and the USA. Malawian and
Venezuelan diets are dominated by plant-derived polysaccharides, while typical US
diets contain large quantities of meat [121]. As before, we ran GlyDe and performed
PCoA on all the GlyDe profiles. Because infants display large variability over the
first coordinate (Supplementary Fig. 3b), and have an unusual diet relative to adults,
we filtered out all samples from individuals younger than 2 years old. This led to a
clear separation over the first coordinate between low meat consumers (Malawians
and Venezuelans) and high meat consumers (Americans) (Figure 3b). The ratio of
animal to plant-specific GlyDe scores revealed significant differences between
samples from different countries of origin (ANOVA's F=6.56, p<0.005), with a
larger animal/plant ratio in USA (p<0.003) and Venezuela (p<0.03) compared to
Malawi. The ratio in USA was slightly but not significantly higher than in Venezuela
(Tuckey-Kramer HSD, Supplementary Fig. 3c).
79
79
Chapter 4::Figure 3: The connection between glycan degradation and diet. (a) GlyDe profiling analysis of the Muegge
dataset. Bars showing the average sum of Plant-specific and Animal-specific GlyDe scores of the samples grouped
according to their hosts' diet and normalized by the number of CAZymes in each sample. A fourth group is created to
segregate humans from all other omnivores. The plant- and animal- specific GlyDe scores of herbivores and carnivores
are significantly different (p=0.04 and p=0.0001, respectively). (b) The Yatsunenko dataset. A scatter plot showing the
samples' projection on the first principal coordinate and colored according to the country of origin. Samples from
individuals younger than 2 years old were omitted (see text).
4.3. Discussion
In this analysis we aimed to determine the association between diet and microbial
glycan metabolism in the gut. We detected diet-driven adaptations at both the level
of single species (Figure 2b) and of communities (Figure 3a). We found species
belonging to Bacteroidetes to be the most efficient degraders of animal-derived
glycans and human milk oligosaccharides. While this trend is apparent in both the
-2000 -1500 -1000 -500 0 500 1000
PCo 1 = 56.69%
Malawi United States of America Venezuela
a.
b.
0
50
100
150
200
250
Carnivore Omnivore Human Herbivore
Plant-specific GlyDe Score
Animal-specific GlyDe Score
80
80
Bacteroides and Parabacteroides genera it is absent from Prevotella, another key
member of that phylum. Diets that are high in animal protein have been associated
with high levels of Bacteroides, whereas enrichment of Prevotella was associated
with diets rich in plant-derived carbohydrates and very low in animal protein [121-
123]. Given that many dietary animal glycans are derived from proteins (i.e.
glycoproteins and proteoglycans), we propose that the high capability of Bacteroides
and Parabacteroides to degrade animal glycans can explain why their abundance is
increased in Westerners [123, 124].
The plethora of novel glycans and their predicted glycan degradation efficiencies
supplied by our method may prove to be a highly important tool for designing
prebiotic interventions. As a striking example, a linear regression model based on
GlyDe-related features was capable of accurately predicting the abundance of
bacterial strains that displayed low inter-samples variance. Of the features in the
regression model, degradation of long polysaccharides was the most predictive, an
unsurprising result considering the importance of these glycans as the main carbon
and energy source for colonic bacteria. Finally, our results were improved
significantly when dividing the HMP samples into two clusters and re-analyzing each
cluster individually. This supports the notion that microbiome analysis should not be
general, but rather be based carefully on the background community structure.
Our GlyDe profiling revealed that the relative abundance of many taxa, especially
those of Clostridia, is significantly correlated with their ability to degrade glycans. It
was recently shown that Clostridium difficile and other pathogenic gut bacteria rely
on microbiota-liberated mucosal glycans during their expansion in the gut following
antibiotic treatment [125]. Thus, it may be possible to design prebiotics that help
increase the levels of beneficial Clostridia and prevent the expansion of pathogenic
strains. More generally, since the breakdown of a given substrate can be highly
species-specific [103], the prediction of bacteria‟s glycan degradation efficiencies
may prove to be an important tool for designing nutritional interventions to help alter
microbial communities.
81
81
In analyzing mammalian fecal samples data from Muegge et al. [118], we
demonstrated that differences in microbial community composition carry functional
importance -- namely, that the microbiota of herbivores and carnivores have stronger
affinities to plant- and animal-derived glycans, respectively. To the best of our
knowledge, this is the first time that a computational framework has been able to
provide such observations. The lack of large scale in-vitro glycan utilization assays
makes straightforward validation of many of our predictions difficult at present.
Nevertheless, our ability to train an accurate classifier to predict the diet of a host
based on its microbiota glycan degradation profile, and the correspondence between
the classifier's predictions and animal nutrition (Table 1), both provide strong
operative testimony to the veracity and utility of the GlyDe pipeline.
Although humans are generally thought of as omnivores, there is an ongoing debate
on the subject of our dietary history and adaptations. Tackling this question through
the lens of our microbiota, we used the aforementioned binary herbivore-carnivore
classifier in order to classify humans. Remarkably, the classifier predicted 15 out of
18 human subjects to be carnivores. This result is less surprising considering that all
of the human subjects were US residents, and that the US is the most meat-
consuming country per capita in the world [126]. In contrast to the US population,
the diets of individuals from Malawi and Venezuela mainly include plant-derived
polysaccharides (they consume, on average, 8.3 and 76.8 kilograms of meat per year,
as opposed to 120.2 in the US [126]). We consequently find a lower ratio of animal-
to plant- degradation efficiency in the microbiota of individuals from these countries
(Supplementary Fig. 4c). Notably, GlyDe does not predict a reduced efficiency of
plant degradation within the US population. Therefore, it seems that the capacity of
Western individuals to degrade glycans has not diminished over the course of
evolution, but merely shifted towards the direction of carnivores.
Taken together, these results further advance our understanding of human diet-
specific adaptations but, as always, conclusions must be drawn with caution. First,
extrapolating from animal data to humans is problematic because of countless
genetic and environmental factors. Secondly, the data we rely upon is often
incomplete. For instance, only 74 out of the 146 CAZymes in GlyDe were mapped to
82
82
at least one HMP genome (Supplementary Fig. 2c). Furthermore, 31 CAZymes were
not capable of breaking any glycan, either because some glycan structures are
missing from the database or because of inaccurate enzymatic annotation
(Supplementary Fig. 2c). Finally, the GlyDe platform does not take into account
many important factors such as enzyme transcription levels and downstream
biochemical pathways for glycan utilization. Nevertheless, GlyDe is the first
computational analysis framework that successfully enables one to directly model
how the microbiota can respond to dietary glycans from a mechanistic point of view.
We expect that future studies will integrate GlyDe into routine 16S rRNA analysis
(e.g. with the help of PICRUSt [127]), as well as incorporate GlyDe within the larger
framework of genome scale metabolic modeling (e.g. [108, 128-131]), further
advancing our understanding of human dietary needs and the design of novel
nutritional interventions.
4.4. Methods
4.4.1. Data Retrieval
Information about glycans and the enzymes that might break them is spread across
many databases and tools. In this section we list the sources of the data used later to
infer genome-based glycan degradation capacity.
Bacterial taxa
A catalog of 281 taxa was downloaded on 10/08/11 from The Human Microbiome
Project (HMP) website (http://www.hmpdacc.org/) using the following filters:
NCBI Superkingdom: Bacteria, HMP Isolation Body Site: Gastrointestinal Tract,
Project Status: Complete, NCBI Submission Status: annotation (and sequence)
public on NCBI site. The catalog contains the following annotation fields: HMP ID,
GOLD ID, Organism Name, Domain, NCBI Taxon ID, NCBI Superkingdom, NCBI
Phylum, NCBI Class, NCBI Order, NCBI Family, NCBI Genus, NCBI Species, All
Body Sites, All Body Subsites, Current Finishing Level, NCBI Project ID, Genbank
83
83
ID, Gene Count, Size (KB), GC Content, Greengenes ID, NCBI 16S ACCESSION,
Strain Repository ID, Oxygen Requirement, Cell Shape, Motility, Sporulation,
Temperature Range, Optimum Temperature, Gram Stain, and Type Strain
Genome Annotations
All of the HMP taxa were searched against The Seed database
(http://pubseed.theseed.org/) using the key of NCBI Taxon ID as a cross-reference.
204 matches were detected and their RAST genome annotations were extracted using
the web services API.
Glycans
The entire KEGG Glycan database (http://www.genome.jp/kegg/glycan/) was
downloaded on 01/07/11. The database contains 10978 glycans. We used the
following annotation fields from the annotations: G number, Name, KCF file and
Class. An additional Biological Origin field was retrieved from an external source,
as described later.
The KCF file for each glycan describes a graphical representation of its 2D structure.
This representation takes into account the monomeric building blocks (nodes) and
the glycosidic linkages (edges) of the glycan. Textual and visual representations of
the KCF graph for glycan G00010 are given in Supplementary Figure 1a and 1b.
Glycan Filtering
The glycans database was subsequently filtered according to the following criteria:
Since the database contains more than 800 nodes denoting glycan-related "building
blocks", many of which are extremely rare, we chose to focus on a subset of 35
nodes corresponding to the most prominent sugar monosaccharides, prevalent
modifications, amino acids found in glycoproteins, and Ceramide found in
glycolipids. Therefore, we removed from the analysis all of the glycans which
contained nodes not part of this subset.
84
84
Similar to the nodes, an edge connecting two nodes in KEGG Glycan mostly has a
standard form denoting whether the sugar at the non-reducing side is in alpha or beta
conformation, as well as the number of the carbons participating in the glycosidic
linkage, e.g. "Glc a1-3 Glc" denoting Glucose alpha 1-3 Glucose. However, there are
some rare edges that have a different form. In order to maintain consistent reaction
rules we defined an edge to be legal if it has the common form of 'R z$-$ R' where:
R - is any node except for 'Thr' (Threonine), 'Ser' (Serine), 'Ser/Thr' (Serine or
Threonine), 'Asn' (Asparagine), 'S' (Sulfate) or 'P' Phosphate,
z - is either 'a' or 'b'
$ - is any number.
Since threonine, serine, asparagines, sulfate and phosphate are not monosaccharides
the glycosidic linkages they are involved in are not done via a carbon atom and
therefore the edge description is different. In these case the rule we used is 'R z$-
R*', where R is a regular node and R* is a non-monosaccharide node. All other edges
were marked as illegal and their containing glycans were omitted from the analysis.
Some glycans in the database have different IDs but identical structures and therefore
we denote them as "Synonym Glycans". Synonym Glycans were grouped together
and one glycan from each group was chosen to represent the entire group for further
analyses.
Glycan Structure Definition
In order to process the database to conform to our subsequent Glycan Degradation
(GlyDE) reactions (see Section: the Reconstruction of Glycan Degradation
Reactions) we identify several types of glycans:
Regular glycans: these glycans are of a fixed and known length (Supplementary Figure 1c).
Linear repeating glycans: these glycans are built completely from a repeating sugar segment.
Repeating parts are marked by (*) and [] (Supplementary Figure 1d).
85
85
Non Linear repeating glycans: these glycans have a repeating linear segment but also
modifications on some of the sugars, which makes them non-linear (Supplementary Figure 1e).
Polysaccharides: a glycan was defined as polysaccharides if it met one of the following
conditions:
- The glycan is a repeating glycan.
- The glycan has the value "Polysaccharide" in its Class field in the KEGG Glycan
database.
- The glycan has more than 10 nodes.
Enzyme Commission (EC) Numbers and Glycan Degradation rules
We obtained a list of 146 Carbohydrate-Active EnZymes (CAZymes) with a textual
description of their enzymatic function using the Carbohydrate Active Enzymes
(CAZy) database (http://www.cazy.org/). CAZy describes families of structurally-
related catalytic and carbohydrate-binding modules (or functional domains) of
enzymes that degrade, modify, or create glycosidic bonds. We retrieved from the
database all the EC numbers that belong to following families:
EC 2.3.1 – Acyltransferases, transferring groups other than amino-acyl groups;
EC 2.4.1 – Glycosyl transferases;
EC 3.1.1 – Carboxylic Ester Hydrolases;
EC 3.2.1 – Glycoside hydrolases;
EC 3.5.1 – Hydrolases acting on carbon-nitrogen bonds, other than peptide bonds, in
linear amides;
EC 4.2.2 – Polysaccharide lyases.
Based on the information available for these EC numbers in ExPASy
(http://expasy.org/) and KEGG (http://www.genome.jp/kegg/) we manually
generated a table that links each EC number with the following fields: EC number,
Enzyme Name, Linkages Broken, Contained Sub-glycan (linkage must be part of the
86
86
sub-glycan), Contains Only (nodes), Glycan Released, Endo vs. Exo, DP preference
(# of nodes), Terminal Side Preference, Enzymatic Reaction, and Comments.
These fields were later used to generate glycan degradation reactions by defining and
implementing a set of rules analyzing the KCF file of all the glycans (described in the
section: Reconstruction of Glycan Degradation Reactions). Below is a description of
the fields.
EC number – The number of the enzyme reflecting its catalytic activity.
Enzyme Name – The accepted name of the enzyme.
KEGG Reactions - The reactions from KEGG mapped to this EC number.
Linkages Broken – In the case of glycosidic linkages the value is a string
representing two nodes and an edge that connects them based on the KCF graph
representation of the glycan structure. In the case of deacetylation reactions the value
given is "Ac-R", denoting the removal of an acetyl group from node R.
Contained Sub-glycan - A G# identifier of a glycan which structure must be
contained within the structure of a larger glycan, e.g. "Glc b1-4 Glc" is a sub-glycan
of "Man a1-3 Glc b1-4 Glc".
Contains Only (nodes) - A G# identifier for one or more nodes which the glycan
must contain and only contain.
Glycan Released - A G# identifier that defines a glycan that must be one of the
products after the reaction of this enzyme takes place.
Endo vs. Exo - "exo" enzymes can remove only terminal sugars (the edges of the
terminal nodes), "endo" enzymes can break all glycosidic bonds except for terminal
ones (remove all edges except the ones of the terminal nodes), and "both" enzymes
can break any bond (remove any edge).
DP preference - This field reflects the degree of polymerization of the glycans this
enzyme works upon. This number is also the exact/minimal ("+" sign)/ maximal ("-"
87
87
sign) number of nodes that the KCF graph denoting the glycan structure must
contain.
Terminal Side Preference – This field is unique for exo-acting enzymes and
describes their specificity towards the reducing or non-reducing end. The KCF graph
is directional, hence "reducing" means only the removal of the rightmost node is
allowed, "non-reducing" refers to the leftmost, and "both" allows the removal of both
edges. Notice that the terminal node of the reducing end is always at position 1 in the
KCF graph, except in repeating glycans.
Enzymatic Reaction - A textual description of the reaction performed by this EC
number taken from http://enzyme.expasy.org/
Comments - Specific comments about this EC number taken from
http://enzyme.expasy.org/
The above fields contain special characters which are described below:
$ - signifies any number
R - signifies any type of the following sugars (nodes): Ara, Araf, D/LAra, D/LAraf,
LAra, LAraf, D/LAraf, D/LAra, Api, Apif, D/LApi, D/LApif, 3,6-Anhydro-LGal,
L3,6-Anhydro-Gal, 3,6-Anhydro-Gal, GalA, D/LGalA, GalNAc, GalfNAc,
D/LGalNAc, GalN, GlcNAc, D/LGlcNAc, GlcA, D/LGlcA, GlcN, D/LGlcN, Glc,
Glcf, D/LGlc, Fru, Fruf, D/LFru, D-Fruf, Man, Manf, D/LMan, ManA, Rha,
D/LRha, LRha, D/LRha, Gal, Galf, D/LGal, D/LGalf, Fuc, D/LFuc, Fucf, LFuc,
D/LFuc, Xyl, D/LXyl, Xylf, Neu, Neu5Ac, Neu5Gc, MurNAc.
# - denotes an OR association
& - denotes an AND association
88
88
Overall, this workflow resulted in 141561 glycan degradation reactions, of which
9325 are reactions that degrade KEGG glycans and newly reconstructed glycans, and
132236 intermediate glycan degrading reactions.
CAZyme annotation
We used sequence similarity to match the genes which belong to the HMP taxa with
specific Carbohydrate Active enZymes (CAZymes). We therefore BLASTed all of
the genomes of the HMP taxa against the bacterial protein sequences found in the
CAZy database. Each enzyme family in CAZy contains a set of manually curated
enzymes determined to execute a specific catalytic function. We used the NCBI Blast
utility and filtered errors of the level of 10 e-10 and matches bellow 97% exactness .
At this point we had a mapping between genes in the HMP taxa and CAZyme
families. Because many families contain a one-to-many mapping between a family
and its associated EC numbers we had to refine this annotation. We therefore
extracted the genes predicted enzymatic annotations from „The SEED‟ and KEGG
databases. While the CAZyme annotations are more comprehensive they are
sometimes not as accurate as the manually curated ones. Thus, we integrated the
information obtained from all of these sources using the following logic: for proteins
that were mapped to families of one EC number in CAZy, we accepted this
annotation. For proteins that were mapped to families of multiple EC numbers, we
first checked if they had an available annotation in SEED or KEGG, and if that was
the case, then we checked if this annotation belonged to one of the multiple
annotations in CAZy. If it did then we accepted the KEGG/SEED annotation.
Subcellular Localization annotation
To define the subcellular localization (SCL) of reactions we used the RAST genome
annotation as a first proxy. We mined the function and subsystem fields of the
annotation for special keywords. For our purposes we were only interested whether
the enzyme exerts its function inside the cell or outside. Enzymes were defined as
intracellular if their associated genes contained the keywords cytoplasm, cytosol and
cytoplasmic. Enzymes were defined as cross-membrane if their associated genes
89
89
contained one of the keywords: periplasm, periplasmic, inner membrane or
cytoplasmic membrane. And finally enzymes were defined as extracellular if their
associated genes contained one of the keywords: cellulosome, outer membrane,
secreted, cell wall, or extracellular. For enzymes that were not associated with any
meaningful keyword we took advantage of the LOCtree localization prediction
software (https://rostlab.org/owiki/index.php/LOCtree). LOCtree uses a protein
amino acid sequence to predict its SCL. It supplies five possible SCLs: cytosol, inner
membrane, periplasmatic, outer membrane, and secreted. Eenzymes with the value
cytosol were classified as intracellular, enzymes with the values secreted and outer
membrane were classified as extracellular, and enzymes with the values
periplasmatic or inner membrane were classified as cross-membrane, i.e. enzymes
that exert their function on the cross-membrane between the cell and its environment.
To fix possible erroneous annotations we refined our localization selection based on
specific knowledge of the glycan degradation biochemistry. A literature survey
suggested that there are no polysaccharides within the bacterial cytoplasm (with
glycogen being the only exception). Thus, enzymes that were predicted as
intracellular or cross-membrane were filtered out if the glycan that they processed
was either repeating, defined as a polysaccharide or had more than 10 sub-
components.
Biological origin of glycans
We retrieved the CarbBank database [132] and mapped the KEGG glycans to it using
the KEGG Glycan ID (G number) number as a cross reference. CarbBank contains
detailed descriptions of where a specific glycan can be found in nature. We parsed
these data in order to define certain glycans as being either plant-derived or animal-
derived.
Degree of polymerization of glycans
Glycans are routinely categorized as one of four possible degree of polymerization
classes. With respect to classes, glycans were defined as disaccharides if they contain
90
90
2 nodes, oligosaccharides if they contain 3-10 nodes, short polysaccharides if they
contain >10 nodes, and long polysaccharides if they have a repeating structure.
4.4.2. Construction of the CAZyme table (a key step in the
GlyDe pipeline)
We manually curated all of the CAZymes (146 EC numbers) and mapped each one
to a set of computer-based rules dictating the mechanism by which it can break a
given glycan (i.e. split its graph into two separate components). These rules account
for structural features such as the glycosidic linkages the enzyme can break, the
cleavage mechanism, the chemical neighborhood, and the degree of polymerization
of the glycan. We then executed these rules on all the glycans that appear in the
KEGG Glycan database, which yielded 141,561 glycan degradation (GlyDe)
reactions. In the following section we describe the logic behind the reconstruction of
glycan degradation reactions by identifying for each glycan which enzymes are able
to break it and how will the degradation reaction look like. GlyDe reactions are then
mapped back to CAZymes in order to produce a table where the rows are CAZymes,
the columns are glycans, and each entry contains a CAZyme score, calculated as
follows:
𝐶𝐴𝑍𝑦𝑚𝑒 𝑠𝑐𝑜𝑟𝑒𝑖𝑗 = 1
𝑔𝑘 ,∀ 𝑒𝑖
where ei is an enzyme that can degrade glycan j and gk is the number of glycans that
it breaks. The entire construction process is summarized in Figure 1b.
Glycan Degradation (GlyDe) rules
A reaction is represented by its:
- Substrate(s).
91
91
- Product(s).
- The enzyme(s) responsible for the catalysis.
- A sub-cellular localization.
For the GlyDe reactions generation process we used all the computer-based glycan
breaking rules explained bellow. For an EC number-related rule to break a glycan,
the glycan and the resultant reaction must comply with all the limitations defined in
the fields of the given rule, namely:
The glycan must contain at least one of the glycosidic linkages (or nodes containing
an acetyl group in case of deacetylation reactions) described in the Linkages Broken
field.
The glycosidic linkage hydrolyzed must appear in the terminal edges of the glycan if
the value in the Endo Vs. Exo is set to Endo, and vice versa.
The number of nodes the glycan contains must conform to the value described in the
DP Preference field.
In case the Endo Vs. Exo field is set to Exo the terminal side of the glycosidic linkage
hydrolyzed must be located on the right side of the graph of the glycan if the value in
the Terminal Side Preference field is set to Reducing and on the left side if this field
is set to Non-reducing. If this field is set to both then location of this linkage on both
sides is allowed.
The glycan must contain the structure of a glycan (nodes and edges) described in the
Contained Sub-glycan field, and the linkage being broken must also be part of this
sub-glycan.
The reaction must contain the glycan described in the Glycan Released field as one
of its products.
Figure 1a gives an example of an Exo-acting enzyme breaking a regular glycan.
92
92
Deacetylation rules
Some of the EC numbers we analyzed have a deacetylation activity, i.e. they have the
capability to remove acetyl groups. In KEGG Glycan, monosaccharides containing
an acetyl group are described as a single unique node, e.g. the node GlcNAc
corresponds to N-acetyl-glucosamine. Therefore, if an enzyme has the capability to
remove an acetyl group, we simply remove the substring "Ac" from the label of the
node and make it the product of the reaction, e.g. GlcNAc GlcN + Ac.
Reconstruction of new glycans
We manually constructed a set of 107 glycans which we determined important but
were missing from the KEGG Glycan database. To distinguish these glycans from
the ones previously available in the database we gave them the prefix "TAU" instead
of "G" that was given by KEGG. Furthermore, in most cases the products of the
degradation reactions do not have a pre-existing G number, meaning they currently
do not exist in KEGG Glycan database. Working under the assumption that most
glycans in nature are still uncharacterized in databases, we decided to add them
automatically. Thus, whenever a reaction produced a new glycan, we gave this
glycan a unique ID beginning with "TAUS" (to distinguish it from original glycans
beginning with "G" or "TAU").
4.4.3. Data Analysis
Single taxa data analysis.
Defining microbial genome-specific Glycan Degradation (GlyDe) scores.
After building the CAZyme table we associated these CAZymes with the genomes of
the HMP gut taxa. For every taxon-specific gene we calculated, based on its
enzymatic annotation and those enzymes' subcellular localization, a GlyDe score.
Given a bacterial taxon i and a glycan j, the GlyDe score is calculated as follows:
𝐺𝑙𝑦𝐷𝑒 𝑠𝑐𝑜𝑟𝑒𝑖𝑗 = 𝑛𝑗𝑘
𝑔𝑘 ,∀ 𝑒𝑘
93
93
where ek is an enzyme that can degrade glycan j, nik is the number of genes in its
genome which translate to enzyme ek, and gk is the number of glycans broken by
enzyme ek.
This metric decrements the contribution of CAZymes that are more promiscuous
versus those specifically geared to degrade the glycan in question (Figure 1b).
For some categories of glycans such as "Long Polysaccharides" and "Plant-specific
glycans", we defined a category specific score GSic, which is the sum of GlyDe
scores for glycans that belong in that group:
𝐺𝑙𝑦𝐷𝑒 𝑠𝑐𝑜𝑟𝑒𝑖𝑐 = 𝐺𝑆𝑖𝑗 ,∀ 𝑗 ∈ 𝑐
where GSij is the GlyDe score for genome i and glycan j, and c is the collection of
glycans that belong to category C.
This scoring system has the feature that summing the GlyDe scores over all the
glycans in a given genome gives the total number of CAZymes in the genome:
Total CAZymes𝑖 = 𝐺𝑆𝑖𝑗 ,∀ 𝑗 ∈ 𝐽
where 𝑗 ∈ 𝐽 is the set of all glycans, and i is the index of a specific taxon.
Subsequently, we defined the GlyDe Profile of a bacterial taxon as:
𝐺𝑙𝑦𝐷𝑒 𝑝𝑟𝑜𝑓𝑖𝑙𝑒𝑖 = 𝐺𝑆𝑖1, 𝐺𝑆𝑖2,… ,𝐺𝑆𝑖𝑗 −1,𝐺𝑆𝑖𝑗
where 𝑗 ∈ 𝐽 is the set of all glycans, and i is the index of a specific taxon.
GlyDe reaction consistency check - cross validation.
When applied to all the glycans available in KEGG, the GlyDe pipeline produced a
list of 114,573 intermediate glycan products, most of which were novel and thus do
not appear in the original KEGG database.To test how consistent GlyDe is, we
performed a cross-validation process where we picked a random subset of 1,000
94
94
glycans from KEGG and applied GlyDe to degrade them. We then tested whether the
products obtained from these 1000 glycans were enriched with known versus novel
intermediate glycans. A hyper geometric test indicated that the products were highly
enriched for known glycans, (p-value = 10-19
; see Supplementary Figure 2a). A
sensitivity analysis with subsets of different initial random sets and sizes still resulted
in highly significant enrichments (data not shown). This result testifies that GlyDe is
capable of recapitulating the biochemical knowledge imprinted in the CAZymes that
constitute its computational foundation.
Principal Coordinates Analysis on GlyDe profiles
We calculated the pairwise Bray-Curtis dissimilarities between all the GlyDe profiles
and performed Principal Coordinates Analysis (PCoA) on the resulting dissimilarity
matrix to project the differences in degradation into two dimensions (Supplementary
Figure 2d).
GlyDe-related features definition.
Based on GlyDe, we extracted 6 features that characterize the several dimensions of
a (meta-) genome glycan degradation potential. These features include: Plant-
specific, Animal-specific, Disaccharides, Oligosaccharides, Short Polysaccharides
and Long Polysaccharides GlyDe scores. Each feature represents the sum of GlyDe
scores for the glycans that belong in the class.
16S rRNA sequence data analysis.
The Human Microbiome Project (HMP) dataset.
We retrieved the 16S rRNA sequence data and metadata from fecal samples
belonging to 325 healthy human individuals from the HMP Data Analysis and
Coordination Center (DACC) [133] . Because we were not interested in time series
data we only used samples from the initial time point. 16S rRNA sequences were
mapped to the HMP genomes based on sequence similarity and further used to build
95
95
an OTU table describing the abundances of the HMP taxa in each sample. For this
purpose we used the QIIME software [134] with these exact commands:
pick_otus.py -i HMP_samples_seqs -r HMP_taxa_ref_seqs -m uclust_ref -C
make_otu_table.py -i pick_otus_output
Metagenomics sequence data analysis.
We retrieved the Muegge et. al. [118] and the Yatsunenko et. al. [121] and datasets
from MG-RAST [135]. For both datasets we downloaded the FragGeneScan gene
calling output which maps each original read to 0, 1, or more ORFs. This way many
reads could be assigned to a single ORF and so the abundance of each ORF was
taken into account. Next, ORFs were assigned to CAZYmes and to specific
subcellular localizations. We then constructed a CAZYmes abundance table
describing the abundances of all the CAZYmes in each sample. In order to calculate
a sample-specific GlyDe score we used the samples CAZYmes abundance table and
pre-calculated CAZymes scores table. Thus, the GlyDe score GSkj of glycan j in
sample k is:
𝐺𝑆𝑘𝑗 = 𝑛𝑗𝑘
𝑔𝑘 ∙ 𝐷𝑚𝑎𝑥𝐷𝑘
,∀ 𝑒𝑘
where ek is an enzyme that can degrade glycan j, nik is the number of genes in sample
k that map to enzyme ek and gk is the number of glycans broken by enzyme ek. Dmax/
Dk is a normalization factor that denotes the ratio between the depth (i.e. the total
number of sequenced reads) of the sample with the maximum depth (Dmax) and the
depth of the current sample Dk.
Subsequently, we defined the GlyDe Profile of sample k as:
𝐺𝑃𝑘 = 𝐺𝑆𝑘1, 𝐺𝑆𝑘2,… ,𝐺𝑆𝑘𝑗 −1,𝐺𝑆𝑘𝑗
We calculated all the GlyDe scores for the Muegge et. al. dataset and for the
Yatsunenko et. al. dataset. The GlyDe algorithm output tables contain the following
fields describing each sample in the data: Unique CAZymes, Total CAZymes,
96
96
KEGG-derived Glycans Degraded, Total GlyDe Score, Plant-specific GlyDe Score,
Animal-specific GlyDe Score, Bacteria-specific GlyDe Score, Disaccharides,
Oligosaccharides, Short Polysaccharides, Long Polysaccharides.
Multivariate regression between GlyDe features and bacterial abundance.
We analyzed the 16S rRNA sequences from the HMP fecal samples in order to
determine the abundance of our 203 bacterial taxa in each sample. To increase the
signal to noise ratio (SNR) we excluded species with high abundance variability
based on the following criterion:
𝑆𝑁𝑅 = mean abundance
standard deviation > 0.3
The 6 features described in the previous section were used to build a linear
regression model with bacterial abundance as the dependent variable:
Bacterial abundance = -13.248 * Plant-specific GlyDe Score + 28.1822 * Disaccharides -
9.6206 * Oligosaccharides + 32.701 * Long Polysaccharides + 42.7354
To eliminate the possibility of over-fitting the data we used a standard 10-fold cross
validation method. All calculations were performed using WEKA [136].
To assess the added value of using the GlyDe features over genomic information
alone, we defined for each HMP taxon a vector containing the genomic copy number
of the CAZymes used in the analysis. We then built a similar linear regression model
with these 82 CAZymes used as features and bacterial abundance in the HMP
samples as the dependent variable.
Because of the high variability in bacterial taxa abundance across the samples, we
used the KMeans algorithm to cluster the samples. We chose to use 3 clusters
because this option resulted in the lowest Cubic Clustering Criterion (CCC).
However, one cluster was composed of only one outlier taxon, so we omitted it from
further analysis. We built a cluster specific linear regression model as described
above:
97
97
Cluster 1 Bacterial abundance = -40.6116 * Plant-specific GlyDe Score + 33.283 * Long
Polysaccharides + 31.6149
Cluster 2 Bacterial abundance = -15.9773 * Oligosaccharides + 69.8184 * Long
Polysaccharides + 70.697
We defined a list of 25 bacterial taxa with highly predictable accuracy. A taxon was
included in the list if the standard error of its predicted abundance obeyed the
following rule for either cluster 1 or cluster 2:
predicted abundance − actual abundance
actual abundance < 1
Classification of dietary patterns using GlyDe scores.
The GlyDe scores of all herbivore and carnivore mammals from the Muegge et. al.
dataset were used to train a binary Support Vector Machine (SVM) classifier. The
SMO implementation of this classification algorithm in WEKA [136] was used for
the computation. To estimate the accuracy of the classifier we used a standard leave
one out cross validation. To apply this classifier to the remaining human and non-
human omnivore samples, we hid the labels of the samples and classified them as
either carnivore or herbivore.
98
98
Chapter 5
5. Discussion
At the heart of systems biology is the increasing acknowledgement that biological
systems are highly interconnected, and that studies of biological „nodes‟ in isolation
are not generally sufficient to recapitulate the complex emergent properties of a full
network. Genome-scale metabolic modeling (GSSM) is an implementation of this
holistic approach presented by Systems Biology within the realm of single cell
Metabolism. Genome-scale metabolic models have proven to be crucial resources for
predicting organism phenotypes from genotypes in numerous medical,
bioengineering and bio-remediation applications. The efforts needed for manually
developing models for new organisms, together with the lack of standard
nomenclatures for metabolites and reactions in manually curated models, has
restricted the focus of most research to date on isolated, non-interacting species;
however, it is well known that prokaryotes live and thrive in dense communities, and
the interactions of community members with each other as well as with the
environment determine much of the functionality, adaptability, and capabilities of the
whole group. While individual genome-scale models are adequate to predict the
behavior of cells in pure cultures, most natural systems on earth require modeling of
metabolic interactivity between species in order to capture the most relevant biology.
The appearance of semi automatic tools for generation of metabolic models, as well
as the subsequent availability of models for thousands of sequenced prokaryotes, has
made it possible to climb one step up in the ``holistic ladder'', from Genome-scale
metabolic modeling to Community genome-scale metabolic modeling (C-GSSM).
Although automatically generated models are still not of equivalent quality to
manually curated ones, they yet open new avenues in the types of questions we may
ask. Importantly, these models solve the problem of differing metabolite and
99
99
reaction nomenclature between models -- a major technical hurdle in the past -- and
thus enable seamless modeling at the community level and comparisons between its
members.
In this dissertation I focused on the computational study of some questions that can
now be addressed due to the availability of a large set of „normalized‟ prokaryotic
metabolic models.
5.1. Answering new types of questions with the
large number of available bacterial metabolic
models
In this dissertation I tried to address different classes of „community‟ related
questions that could now be answered.
The first class of questions dealt with the metabolic relations between different
bacterial species. In Chapter 2, I have done the largest study (to date) of interactions
between different microbial species. The second group of questions dealt with the
comparison between GSSMs of different species and extraction of common features.
In Chapter 3, I found a way to predict the growth rates of prokaryotes using GSSMs
in a way that outperforms the currently commonly used GSSM based method of
measuring biomass yield. Prokaryotes often live within and interact with a larger
host. In Chapter 4, I started examining the interaction between the gut microbiota and
the human host.
100
100
5.1.1. Metabolic Interaction within Bacterial communities
It is well known that prokaryotes live and thrive in dense communities, and the
interactions of community members with each other as well as with the environment
determine much of the functionality, adaptability, and capabilities of the whole
group. To date it has been difficult to predict which bacteria can stably co-exist, let
alone cooperate metabolically, making the artificial design of beneficial microbial
consortia extremely difficult.
In Chapter 2, we focused on the metabolic interactions between species, and the way
it affects the actual in-vitro and in-vivo interactions. We suggested a generic
approach for the systematic description of inter-species interactions. Our method has
its limitations. It solely focuses on the metabolic dimension, ignoring regulation as
well as the numerous strategies that microorganisms have evolved to augment the
acquisition of resources. The analysis lacks information on the true metabolic
composition of the environments considered, and hence focuses on predicting the
overall potential inter-species interactions, rather than providing a direct account of
their actual in-vivo communications in one specific environment. And finally, the
automatic reconstruction procedure for the GSSMs results in a significant increase in
the number of genome scale metabolic models, which, while proving useful in
predicting a variety of phenotypes, are typically less accurate than manually curated
models [25]. Yet, despite these significant limitations, our generic approach succeeds
in delineating clear differences in the interaction patterns of ecologically associated
versus randomly associated communities, and reveals fundamental new ecological
principles.
With the increasing efforts to provide an a-biotic description of different
environments, together with the expected rapid rise in the number of metabolic
models as well as the improvement in their quality, the utilization of metabolic
modeling for community-level modeling framework such as the one laid down here
provides a computational basis for many exciting future applications. These include
the artificial design of 'expert' communities for bioremediation, where currently the
selection of community species is done by intelligent guesswork. We can look at the
101
101
construction of „expert‟ communities as an alternative to large scale gene insertion to
a single species as done today with limited success in many bioengineering projects.
Similarly, our work may be applied to the rational design of probiotic administration,
as well as to the identification of species that may metabolically out-compete
pathogenic species.
5.1.2. Extracting cell qualities from a large scale metabolic
analysis across a large number of species
The availability of GSSMs for thousands of prokaryotes with standard nomenclatures
for metabolites and reactions made it possible to compare between the different
GSSMs, and to try to explain phenotypic behaviors across species using these
models. In Chapter 3 we aimed to explain the principles that determine cell growth
by analyzing the behavior of many bacterial species across different growing media.
Understanding cell growth rate is a long-standing scientific goal, of interest in
biology and especially in biotechnology when designing bacteria-based
manufacturing plants. In this work, we presented a new method for predicting
cellular growth rate, termed SUMEX, which does not require any empirical variables
apart from a metabolic network (i.e., a GEM) and the growth medium. SUMEX is
calculated by maximizing the SUM of molar EXchange fluxes (hence SUMEX) in a
genome-scale metabolic model. SUMEX correlated significantly with the growth
rate of microbes across species, environments, and genetic conditions, outperforming
traditional cellular objectives (most notably, the convention assuming biomass
maximization). The success of SUMEX suggested that the ability of a cell to
catabolize substrates and produce a strong proton gradient enables fast cell growth.
Easily applicable heuristics for predicting growth rate, such as what we demonstrate
with SUMEX, may contribute to numerous medical and biotechnological goals,
ranging from the engineering of faster-growing industrial strains, modeling of mixed
ecological communities, and the inhibition of cancer growth.
We expect that future work based on the principles listed here will lead to improved
biotechnology applications and perhaps even to new functional insights into cancer,
as well as filling in a basic gap in our understanding of cellular function. The work
102
102
done in Chapter 3 was the first work (to my best knowledge) comparing GSSMs
across a large number of species in order to learn about nature. There are many
questions and phenomena that be researched using the same general principles. I will
introduce some of them later in this chapter.
5.1.3. Investigating the relations between gut bacterial
communities and their hosts
Many bacterial communities operate within a larger host and interact with it. Our
initial goal when we started what turned to be Chapter 4 was to build a metabolic
model of the human gut with its microbiota and use it to explain gut related diseases.
We found out that all the existing prokaryote GSSM models, both manually and
automatically generated, had hardly any reference to Glycans. This finding forced us
to change our initial goal. We decided then to focus on Glycans which form the
primary nutritional source of microbes in the human gut, and on developing a way to
automatically integrate them into the GSSMs. In Chapter 4 we presented a novel
computational pipeline for modeling Glycan Degradation (GlyDe), providing a broad
view of the usage of these compounds on genome and metagenome scales, and
integrating them into the GSSMs.
GlyDe predicts the usage patterns of thousands of glycans by each of the sequenced
individual gut bacteria deposited in the Human Microbiome Project (HMP) database.
We aimed to determine the association between diet and microbial glycan
metabolism in the gut. We found diet-driven adaptations at both the level of single
species and of communities. We found species belonging to Bacteroidetes to be the
most efficient degraders of animal-derived glycans and human milk
oligosaccharides. While this trend is apparent in both the Bacteroides and
Parabacteroides genera it is absent from Prevotella, another key member of that
phylum.
Diets that are high in animal protein have been associated with high levels of
Bacteroides, whereas enrichment of Prevotella was associated with diets that are rich
in plant-derived carbohydrates and very low in animal protein [121-123]. Given that
103
103
many dietary animal glycans are derived from proteins (i.e. glycoproteins and
proteoglycans), we proposed that the high capability of Bacteroides and
Parabacteroides to degrade animal glycans can explain why their abundance is
increased in westerners [123, 124].
We expect that the plethora of novel glycans generated by GlyDe and their predicted,
glycan degradation efficiency, supplied by our method will be a highly important
tool for designing prebiotic interventions. We showed, as an example, a strong
observed relationship between degradation of long polysaccharides and the
abundance of Clostridia that may help identify prebiotic interventions that may
prevent Clostridium difficile infections by increasing levels of non-pathogenic
clostridia. The identification of such degradation-abundance relationships by GlyDe
is highly promising for future applications.
To the best of our knowledge, this is the first time that a computational framework
has been able to computationally discriminate between carnivores and herbivore
based on their glycans degradation profiles.
GlyDe also supplied the automatic tool that can adds the Glycans degradation
support to GSSMs. Using GlyDe, we can now build a compound model of the human
gut with its microbiota and use it to further advance our understanding of human
dietary needs, to try and explain gut related diseases, and to suggest the design of
future nutritional interventions.
There are many other questions related to the relationship between a bacterial
community and a host, many of them a related to diseases. The availability of
GSSMs for many of the pathogens enables us to model such systems using the
C-GSSMs.
5.2. Future directions
This dissertation contains some pioneering work done using the newly available
large set of prokaryotes‟ GSSMs. The dissertation only scratched the surface of the
104
104
opportunities that opened when these models became available. Chapters 2-4 each
suggest additional future research venues within the scope of their respective studies.
In the next few paragraphs, I explore a few additional ideas I find worth pursuing
now, building upon the work presented here and now that we have those metabolic
models.
5.2.1. New methods for simulating communities
In this dissertation we focused on modeling communities in steady state in a
chemostat. There are additional methods of utilizing metabolic models which are
currently used for single species and can be extended to community research. One of
the methods that removes the requirements for steady state is „dynamic flux balance
analysis‟ as described in [137] which is currently used only for single species
models. This method can be viewed as a part of the „cellular automata‟ family[138].
This method is excellent for simulating batch processing conditions and can also
assist in simulating large communities in chemostat when the existing FBA based
methods might fail due to the large size of Linear Programming problem it creates.
The problem with cellular automata based methods is the number of parameters that
need be applied in order to simulate an in-vivo test. However, these parameters may
be considerably less than the number of parameters needed when using full kinetic
methods.
5.2.2. Specialized bacterial communities
The ability to design and simulate the action of a given bacterial community opens
many opportunities in the fields of bio-engineering and bio-remediation.
Bacterial communities can be designed, using the methods used in this dissertation,
to manufacture materials without using complex gene knockouts or gene insertions.
Currently many bio-remediation operations are done by specific bacterial
communities; however these communities were mostly selected by guesswork. Using
the tools listed in this work, better communities may be designed so that they will
outperform the existing ones.
105
105
5.2.3. Examples of organisms’ traits that can be extracted
using the large number of prokaryotes metabolic models
In this dissertation in Chapter 3, we investigated „growth rate‟ across a large number
of prokaryote species. There are many interesting questions which we can be
investigated using the existing models. Following is a small list of such questions:
Is there a propagation of metabolic pathways across the phylogenetic tree?
What is the correlation between the phylogenetic distance of species and their
growing environment?
Are there metabolic factors that play an important role in determining the PH
sensitivity of species?
Currently the existing metabolic models are not accurate enough to give exact
answers on the species level, however when doing a large scale survey on many
species, the relevant qualities may provide a good signal in aggregate, and lead to
new insights.
5.2.4. Investigating relationships of specific communities of
bacteria and their host
In Chapter 4, we began to investigate the gut bacterial community within mammals
and especially in humans. There are many such bacterial communities in nature that
operate within the vicinity of a given host. Some are related to the research of
diseases and pathogens, and others are related to parasitic or symbiotic relationships.
An additional example, which I believe is very important, is that of specific
communities of bacteria that can act as fertilizers for plants, or that can act as
insecticides and help plants survive attacks from pathogens or insects.
5.3. Summary
106
106
The appearance of semi automatic tools for generation of metabolic models, and the
subsequent availability of models for thousands of sequenced prokaryotes have
opened a new research sub field within the research field of metabolic modeling.
Comparing between species using GSSMs and building C-GSSMs shows a strong
potential in answering questions regarding species behaviors and in assisting us in
better utilizing prokaryotes in bioengineering applications. This dissertation has only
begun to address some of the key questions arising in this new field.
The current results as shown in this dissertation are promising; however it is clear
that the methods listed here are only partial and limited, mainly due to fact that they
focus only on the metabolic aspect of the relationship between species, ignoring
regulation at the single species level, and the cell specialization at the community
level. It is also expected that when modeling communities, the assumption of „steady
state‟ built in Flux Balance analysis, will become more limiting than it was when
researching a single cell.
I expect that in the future, regulation will be introduced into C-GSSMs at all levels,
and that a more dynamic metabolic modeling approach (cellular automata) will be
more commonly used.
An opening of a new research field is always exciting, and I see myself blessed to
live in such exciting times.
107
Appendix 1. Supplementary data for Chapter 2
A.1.1 Supplementary Figures
Appendix 1::Supplementary Figure S1. Cooperation and competition levels of the ecological groups at different levels of
competition and resource overlap.
Bars represent standard error (bars are not shown for sample sizes <2; for very small values of standard error bars are shown in
red).
108
108
Appendix 1::Supplementary Figure S2: The frequency of resource overlap values between ecologically associated (black) and
non-associated (white) species pairs.
As shown, ecologically associated pairs differ in the pattern of distribution of their resource overlap values. This further
supports a non-random distribution of metabolic-demand similarity between ecologically-associated and non-associated pairs
whereas the pick observed for moderate values (~0.5) arises mainly from the contribution of co-occurring pairs.
109
109
A.1.2 Supplementary Tables
Appendix 1::Supplementary Table S1. Description of model species and selected properties.
The table is sorted according to the fraction of winning events. Species seed ids are as in[25]. Fractions of regulatory genes were retrieved from[139]. General environmental complexity estimates (1- obligatory symbionts; 2- specialized; 3- aquatic; 4-
facultative host-associated; 5- multiple; 6- terrestrial species) were obtained from[140]. Minimal doubling time information was
retrieved from[78].
Species' name Species' seed
id
Maximal
Biomass
Productio
n Rate
(MBR)
Fraction
of
winning
events
Fraction
of
regulator
y genes
Estimate of
environme
ntal
diversity
Minimal
doubling
time
Dehalococcoides
ethenogenes 195 Core243164_3 4.3 0 19
Thiomicrospira
crunogena XCL-2 Core39765_1 4 0 1
Bartonella
bacilliformis
KC583 Core360095_3 3.8 0
Mycoplasma
genitalium G-37 Core243273_1 1.2 0 0 1 12
Anaplasma
marginale str. St.
Maries Core234826_3 4.7 0 21.6
Buchnera
aphidicola str. APS
(Acyrthosiphon
pisum) Core107806_1 3.5 0 0 1
Treponema
pallidum subsp.
pallidum str.
Nichols Core243276_1 0.9 0
Borrelia
burgdorferi B31 Core224326_1 3.6 0 4 4
Bifidobacterium
longum NCC2705 Core206672_1 11.4 0.1 0.1 4 1.51
110
110
Wolbachia sp.
endosymbiont of
Drosophila
melanogaster Core163164_1 13 0.1
Coxiella burnetii
RSA 493 Core227377_1 8.8 0.1 5 8
Ehrlichia
ruminantium str.
Gardel Core302409_3 14 0.1
Rickettsia
prowazekii str.
Madrid E Core272947_1 10.3 0.1
Gluconobacter
oxydans 621H Core290633_1 8.1 0.1 0.94
Blochmannia
floridanus Core203907_1 11.7 0.1 1 36
Thiomicrospira
denitrificans
ATCC 33889 Core326298_3 7.6 0.1
Tropheryma
whipplei str. Twist Core203267_1 21.5 0.1
Aquifex aeolicus
VF5 Core224324_1 8 0.1 0 2 1.8
Helicobacter pylori
26695 Core85962_1 23.3 0.2 0 4 2.4
Streptococcus
pneumoniae R6 Core171101_1 31.3 0.2 0 4
Streptococcus
pneumoniae
TIGR4 Core170187_1 31.3 0.2 0 4 0.5
Thermoanaerobact
er sp. X514 Core399726_4 34.5 0.2
Carboxydothermus
hydrogenoformans
Z-2901 Core246194_3 24 0.2 2
Onion yellows
phytoplasma OY-
M Core262768_1 20.3 0.2
111
111
Thermotoga
maritima MSB8 Core243274_1 21 0.2 0 2 1.2
Mycoplasma
pulmonis UAB
CTIP Core272635_1 22 0.2 0 1 1.5
Streptococcus
thermophilus
CNRZ1066 Core299768_3 29.1 0.2
Ureaplasma
parvum serovar 3
ATCC 700970 Core273119_1 16.1 0.2
Legionella
pneumophila
subsp.
pneumophila str.
Philadelphia 1 Core272624_3 24.9 0.2 0 3.3
Idiomarina
loihiensis L2TR Core283942_3 26.9 0.2
Xylella fastidiosa
9a5c Core160492_1 16.5 0.3 0 1 5.13
Campylobacter
jejuni subsp. jejuni
84-25 Core360110_3 26.6 0.3
Neisseria
gonorrhoeae FA
1090 Core242231_4 25 0.3 0.58
Campylobacter
jejuni subsp. jejuni
NCTC 11168 Core192222_1 26.6 0.3 5 1.5
Chlamydia
trachomatis
D/UW-3/CX Core272561_1 23.2 0.3 1 24
Haemophilus
influenzae Rd
KW20 Core71421_1 41.2 0.3
Leifsonia xyli
subsp. xyli str.
CTCB07 Core281090_3 46 0.3 5
112
112
Symbiobacterium
thermophilum IAM
14863 Core292459_1 33.8 0.3 4.2
Desulfovibrio
desulfuricans G20 Core207559_3 24.6 0.3 0
Elusimicrobium
minutum Pei191 Core445932_3 30.6 0.3
Chlamydophila
pneumoniae AR39 Core115711_7 22.9 0.3 0 1
Campylobacter
jejuni subsp. jejuni
CF93-6 Core360111_3 26.6 0.3
Bdellovibrio
bacteriovorus
HD100 Core264462_1 21.9 0.3 1.4
Zymomonas
mobilis subsp.
mobilis ZM4 Core264203_3 21.8 0.3 2
Pseudomonas
putida KT2440 Core160488_1 78.8 0.4 0.1 5 1.1
Bordetella
pertussis Tohama I Core257313_1 46.9 0.4 0.1 1 3.8
Lactococcus lactis
subsp. lactis Il1403 Core272623_1 63.8 0.4 0.1 5 0.7
Lactobacillus
plantarum WCFS1 Core220668_1 80.7 0.4 0.1 4 1.6
Kineococcus
radiotolerans
SRS30216 Core266940_1 48.7 0.4
Bacteroides fragilis
YCH46 Core295405_3 28 0.4 0.63
Clostridium tetani
E88 Core212717_1 56.4 0.4 0 5 0.5
Cytophaga
hutchinsonii ATCC
33406
Core269798_1
2 48 0.5 0
Salinibacter ruber
DSM 13855 Core309807_5 20.7 0.5 14
113
113
Clostridium
acetobutylicum
ATCC 824 Core272562_1 83.8 0.5 0.1 5 0.58
Mannheimia
succiniciproducens
MBEL55E Core221988_1 87.4 0.5 0.6
Anaeromyxobacter
dehalogenans 2CP-
C
Core290397_1
3 72.9 0.5 9.2
Francisella
tularensis subsp.
tularensis Schu 4 Core177416_3 57.3 0.5 3
Caulobacter
crescentus CB15 Core190650_1 36.3 0.5 0.1 3 1.5
Nitrosococcus
oceani ATCC
19707 Core323261_3 44.7 0.5
Listeria
monocytogenes
J0161 Core393130_3 55.5 0.5
Nitrosomonas
europaea ATCC
19718 Core228410_1 41.3 0.5 0 5 18.5
Francisella
tularensis subsp.
novicida U112 Core401614_5 59.8 0.5 3
Frankia sp. Ccl3
Core106370_1
1 71.8 0.5
Neisseria
meningitidis MC58 Core122586_1 30 0.5 0 4
Acinetobacter sp.
ADP1 Core62977_3 92 0.5
Listeria
monocytogenes
FSL J1-194 Core393117_3 55.5 0.5
Nocardia farcinica
IFM 10152 Core247156_1 53.2 0.6 3
Magnetospirillum Core342108_5 46.3 0.6 0
114
114
magneticum AMB-
1
Staphylococcus
aureus subsp.
aureus N315 Core158879_1 94.8 0.7 0 4 0.4
Methylococcus
capsulatus str. Bath Core243233_4 55.5 0.7 1.87
Acinetobacter
baumannii ATCC
17978 Core400667_4 73.8 0.7
Streptomyces
coelicolor A3(2) Core100226_1 94.3 0.7 0.1 5 2.2
Corynebacterium
glutamicum ATCC
13032 Core196627_4 64.5 0.7 5 1.2
Flavobacterium
johnsonia
johnsoniae UW101 Core376686_6 41.4 0.7
Thiobacillus
denitrificans
ATCC 25259 Core292415_3 61.5 0.7
Leptospira
interrogans serovar
Copenhageni str.
Fiocruz L1-130 Core267671_1 64.1 0.7
Listeria innocua
Clip11262 Core272626_1 94 0.7 0.1 5 0.6
Pseudomonas
fluorescens PfO-1 Core205922_3 104.9 0.7
Pseudoalteromonas
haloplanktis
TAC125 Core326442_4 48.8 0.7 0.5
Rhizobium
leguminosarum bv.
viciae 3841 Core216596_1 105.3 0.8
Staphylococcus
aureus subsp.
aureus COL Core93062_4 123.9 0.8
115
115
Rhodopseudomona
s palustris CGA009 Core258594_1 69.3 0.8 9
Methylobacillus
flagellatus KT Core265072_7 76 0.8 2
Agrobacterium
tumefaciens str.
C58 Core176299_3 91.7 0.8 0.1 5
Rubrobacter
xylanophilus DSM
9941 Core266117_6 70.7 0.8 3.85
Brucella melitensis
16M Core224914_1 72.8 0.8 0 4 2
Vibrio cholerae
O395 Core345073_6 127.3 0.8 0.2
Burkholderia
pseudomallei
K96243 Core272560_3 99.5 0.8 1
Ralstonia
solanacearum
GMI1000 Core267608_1 135.8 0.8 0.1 5 4
Vibrio vulnificus
YJ016 Seed196600_1 128.1 0.8 0
Staphylococcus
aureus subsp.
aureus NCTC 8325 Core93061_3 94.8 0.8
Yersinia pestis
Pestoides F Core386656_4 104.2 0.8
Clostridium
beijerincki
beijerinckii
NCIMB 8052
Core290402_3
4 120.3 0.8
Pseudomonas
putida GB-1 Core76869_3 104.7 0.8
Vibrio
parahaemolyticus
RIMD 2210633 Core223926_1 118.7 0.8 0 4 0.2
Yersinia pestis
CO92 Core214092_1 103.5 0.8 0.1 5 1.25
116
116
Mycobacterium
tuberculosis
H37Rv Core83332_1 69.4 0.8 0 4 19
Polaromonas sp.
JS666 Core296591_1 70.4 0.8
Staphylococcus
aureus subsp.
aureus Mu50 Core158878_1 94.8 0.8 0 4
Bradyrhizobium
japonicum USDA
110 Core224911_1 107.2 0.9 0.1 4 20
Shewanella
frigidimarina
NCIMB 400
Seed318167_1
0 92.5 0.9
Bacillus subtilis
subsp. subtilis str.
168 Opt224308_1 211.7 0.9 0.1 6 0.43
Escherichia coli
W3110 Core316407_3 247.1 0.9 4
Pseudomonas
aeruginosa PAO1 Core208964_1 158.4 0.9 0.1 5
Sinorhizobium
meliloti 1021 Core266834_1 202.8 0.9 0.1 5 1.5
Burkholderia
cepacia R1808 Core269482_1 150.6 0.9
Klebsiella
pneumoniae MGH
78578 Core272620_3 205.4 0.9
Bacillus anthracis
str. 'Ames
Ancestor' Core261594_1 167.7 0.9
Listeria
monocytogenes
EGD-e Core169963_1 87.9 0.9 0.1 5 1
Shigella flexneri
2a str. 2457T Core198215_1 183.2 0.9 4
Shigella
dysenteriae Core216598_1 145.6 0.9
117
117
M131649
Bacillus anthracis
str. Ames Core198094_1 171.2 0.9 0.1 0.5
Photobacterium
profundum SS9 Core298386_1 156.1 1 2.5
Photorhabdus
luminescens subsp.
laumondii TTO1 Core243265_1 116.5 1 0.1 0.5
Vibrio cholerae O1
biovar eltor str.
N16961 Core243277_1 134.5 1 0 4 0.2
Salmonella
typhimurium LT2 Core99287_1 229 1 0.1 4 0.4
Escherichia coli
K12 Core83333_1 250.2 1 0.1 4 0.35
Shewanella
oneidensis MR-1 Seed211586_1 91.3 1 0 5 0.66
Appendix 1::Supplementary Table S6. The list of EnvO niches used in the analysis and the number of assigned samples.
Envo ID Niche description Number of samples
ENVO:00000063 water body 541
ENVO:00001998 soil 489
ENVO:00002007 sediment 276
ENVO:00002006 water 258
ENVO:00002044 sludge 113
ENVO:01000009
biotic mesoscopic physical
object 98
ENVO:00000023 stream 93
ENVO:00002002 food 66
ENVO:00002264 waste 58
ENVO:00000076 mine 52
ENVO:00002031 anthropogenic habitat 52
ENVO:00000176 elevation 48
ENVO:00002009 terrestrial habitat 48
ENVO:00001995 rock 40
118
118
ENVO:01000001 mud 37
ENVO:00002985 oil 36
ENVO:00000043 wetland 34
ENVO:00000131 glacial feature 25
ENVO:02000019 bodily fluid 23
ENVO:00000104 undersea feature 21
ENVO:00000013 cave system 20
ENVO:00000479 mouth 18
ENVO:00002204 contamination feature 18
ENVO:00002170 compost 17
ENVO:00000094 volcanic feature 14
ENVO:00000303 coast 13
ENVO:00000073 building 12
ENVO:00000291 drainage basin 12
ENVO:00000026 well 12
ENVO:00002005 air 10
ENVO:00000077 agricultural feature 10
ENVO:00003030 silage 9
ENVO:00002008 dust 8
ENVO:00003869 straw 8
ENVO:00000463 harbor 7
ENVO:00000182 plateau 6
ENVO:00000309 depression 6
ENVO:00000395 channel 5
ENVO:00000130 reef 5
ENVO:00002982 clay 5
ENVO:00010505 aerosol 3
ENVO:00000091 beach 3
ENVO:01000010
abiotic mesoscopic physical
object 3
ENVO:00000097 desert 3
ENVO:00002272 waste treatment plant 3
ENVO:00000475 inlet 3
ENVO:00002226 borehole 3
ENVO:00002040 wood 2
ENVO:00000086 plain 2
ENVO:00000304 shore 2
119
119
ENVO:00000175 karst 2
ENVO:00002000 slope 2
ENVO:00000049 volcanic hydrographic feature 2
ENVO:00000062 populated place 1
ENVO:00002039 bone 1
ENVO:00000474 cut 1
ENVO:00000562 park 1
ENVO:00005738 foam 1
ENVO:00000098 island 1
A.1.3 Supplementary Methods
A.1.3.1 Computing the Maximal Biomass Production Rate
(MBR) of species
Constraint-Based Modeling (CBM) was used in order to simulate co-growth in two-
species systems, where species are represented by genome-scale metabolic models.
Briefly, in these models, a stoichiometric matrix (S) is used to encode the
information about the topology and mass balance in a metabolic network, including
the complete set of enzymatic and transport reactions in the system and its biomass
reaction. Reactions are inferred from genome annotations and specialized prediction
tools. Given a metabolic model, Constraint-Based Modeling (CBM) provides a
solution space in terms of predicted fluxes that is consistent with the constraints set
up by the model. Flux balance analysis (FBA)[7] is a CBM method that further
constrains the solution space by solving a linear problem of maximizing or
minimizing a biomass production rate objective function[141, 142]. The biomass
production rate describes the rate of production of a set of metabolites required for
cellular growth, where a higher biomass flux corresponds with a faster growth rate
of the organism[143].
Here, 160 metabolic models were retrieved from The Seed's metabolic models
section (http://seed-viewer.theseed.org/seedviewer.cgi?page=ModelViewer)[25].The
models are automatically constructed by a pipeline that starts with a complete
120
120
genome sequence as an input and integrates numerous technologies such as genome
annotation, reaction network annotation and assembly, determination of reaction
reversibility, and model optimization to fit experimental data. For each species we
calculate its maximal biomass production rate (MBR) by assuming that all exchange
reactions can be potentially fully active (which is equivalent to assuming a rich
media). The upper and lower bounds of exchange and non exchange reactions are
conventionally set as follows:
For irreversible reactions:
Exchange reactions:
0 ≤ Vi,ex ≤ Vi, Max_ex (Vi, Max_ex = 1000)
Non exchange reactions:
0 ≤ Vi ≤ Vi, Max ( Vi, Max = 1000)
For reversible reactions:
Exchange reactions:
Vi, Min_ex ≤ Vi,ex ≤ Vi, Max_ex (Vi, Min_ex = -50 Vi, Max_ex = 1000)
Non-exchange reactions:
Vi, Min ≤ Vi ≤ Vi, Max (Vi, Min = -1000 Vi, Max = 1000)
Simulations were run using the "ILOG CPLEX" solver using the "Condor"
platform[144]. Following the filtering out of different strains of the same species and
models that did not have a biomass reaction defined or that their biomass reaction
could not be activated, we were left with a final set of 118 models (see
Supplementary Table S1). In addition to the 118 bacterial models, a metabolic model
121
121
for H. walsbyi (archaea) was constructed by using the SEED tool. This model was
only used for growth simulations with Salinibacter ruber.
A.1.3.2 Generation of a multi-species system metabolic
model
Our approach for generating multi-species models follows the definition employed
by[47]. Briefly, we converted the model of each organism into a compartment in a
multi-species system. For two species A and B this system consists of:
[CA]=cytoplasm compartment of species A; [CB]=cytoplasm compartment of species
B; and [EAB]=Extra-cellular compartment of species A and B. CA and CB include all
non-exchange and transport reactions of the corresponding species . EAB includes the
union of the exchange reactions of A and B. The objective function of the multi-
species system was defined as the sum of the biomass reactions of the member
organisms. This method is used in our setup for simulating pairwise growth but is
applicable for any number of organisms. Applying the multi-species system analysis
to all possible pairwise combinations of our 118 metabolic models, we examined
6903 unique pairs whose growth can be simulated under a range of environments.
A.1.3.3 Computing a Competition inducing Medium
(COMPM) for single species and multi-species systems
For a single species model A, COMPM is defined as the ranges of fluxes of the
exchange reactions (VCOMPM,A) that supports its maximal biomass rate (MBR),
when all exchange metabolites are provided at the minimal required amount. The
latter is found by Flux Variability Analysis (FVA)[7], where Vi, Min_FVA then denotes
the lower limit (maximal flux of metabolites into a compartment) of a given reaction.
Following our definitions, an increase in (the negative) Vi, Min_FVA will effectively
limit the flux of metabolites into the compartment and prevent species A from
reaching its MBR. Notably, for the large majority of exchange reactions (70%-80%)
Vi, Min_FVA = Vi,max FVA. To relate COMPM environments to real ecological
122
122
conditions we verified that species inhabiting similar environments tend to have
similar metabolic profiles, as previously demonstrated in[64]. As documented in
many laboratory experiments, typical limiting factors in COMPM environments
include oxygen, glucose and nitrogen sources (Supplementary Note 4). Finally,
computational simulation providing predictions for the effect of removal of chosen
metabolites on species growth were experimentally tested, supporting the ability of
the models to identify growth limiting factors (Supplementary Note 7). The full
description of VCOMPM, A is provided at Supplementary Table S10.
A similar computation is done for species B and then, in the multi-species system of
A and B, we define:
VCOMPM, AB = VCOMPM, A U VCOMPM, B
Vi, Min_FVA,AB = Min(Vi, Min_FVA,A , Vi, Min_FVA,B))
so that the lower bound of each reaction is set according to the lower FVA limit,
considering the species involved. By definition, at individual growth, COMPMAB
allows A and B to reach their MBR. However, at co-growth, any resource overlap
will prevent species A and B to simultaneously reach their MBR, and reveal potential
sources of competition. The full description of VCOMP, AB is provided at
Supplementary Table S11.
A.1.3.4 Computing a Cooperation-inducing Medium
(COOPM)
A cooperation-inducing medium (COOPM) for a multi-species system is defined
here as a set of metabolites that allows the system to obtain a positive growth rate
(above a certain predetermined threshold, which may yet be far from optimal), and
such that the removal of any metabolite from the set would force the system to have
no such solution. A feasible solution in this context is defined as one achieving at
123
123
least 10% of the joint MBR obtained when grown on a rich medium (COMPM). The
use of other MBR thresholds is examined at Supplementary Note 5. COOPM is
calculated using mixed integer linear programming as in[145, 146], as described
below:
As a first step we start with VCOMPM, AB, a set of exchange reactions flux ranges as
defined above. We then solve a minimization problem which uses, in addition to the
usual FBA constraints: (i) a constraint on minimal growth rates, VBM-COOPM ≥ 0.1 x
VBM-COMPM where VBM-COOPM and VBM-COMPM are the Biomass Production rates on
COOPM and COMPM respectively (ii) a constraint expressing whether or not an
exchange metabolite i is consumed: Vi, COOPM, AB + Vi,min θi ≥ Vi,min, where Vi, COOPM,
AB is the flux running through the exchange reaction i, and Vi, COOPM, AB ≤ 0 when
the metabolite i is consumed (negative flux). Here, the binary variable θi attains a
value of 1 if metabolite i is not consumed (Vi, COOPM, AB ≥ 0) by any of the organisms,
and 0 otherwise.
Identifying a minimal set of metabolites in a medium then amounts to maximizing
the sum of the θi variables over all metabolites in Vi, COMPM, AB . Overall, the
optimization problem can be expressed as follows:
, ,
, m in , m a x
, ,
, , , m in , m in
, ,
m a x
:
0
/ 1 0
{0 ,1}
n
i
i C O M P M A B
j j j
B M C O O P M B M C O M P M
i C O O P M A B i i i
i C O O P M A B
j
i V
S u b je c t to
S V
V V V
V V
V V V
i V
v V
124
124
The bounds on the active exchange reactions are set to their COMPM value. The full
description of VMM, AB is provided at Supplementary Table S12.
A.1.3.5 Experimental and computational co-growth
analysis
Co-growth experiments were conducted between all co-growth combinations formed
between five species, all non-pathogenic and capable of growing in IMM. The
species and their seed models are the following: Listeria innocua Clip11262
(Core272626_1), Agrobacterium tumefaciens str. C58 (Core176299_3), Escherichia
coli K12 (Core83333_1), Pseudomonas aeruginosa PAO1 (Core208964_1), Bacillus
subtilis str. 168 (Opt224308_1).
Growth experiments for each individual and pairwise combinations were conducted
in IMM defined medium[147], in 96-well plates at 30°C with continuous shaking,
using the Biotek ELX808IU-PC microplate reader. Optical density was measured
every 15 minutes at a wavelength of 595nm.
A simulated medium was designed to match the defined medium with minimal
modifications allowing co-growth (Table SM-1). For L. innocua (strain Clip1126)
and A. tumefaciens (strain C58), we both predict and observe a neutral interaction in
the given media. That is, the Sum of Individual Growth (SIG) approximately equals
the total Co-Growth in a multi-species system (CG). To simulate a negative shift
(SIG/CG > SIG/CGneutral_medium), co-growth simulations were conducted when adding
all one- and two-compound combinations of exchange metabolites to the simulated
IMM (considering all exchange metabolites of the given species). To simulate a
positive shift (SIG/CG < SIG/CGneutral_medium), co-growth simulations were
conducted, subtracting all one- or two-pertaining compound combinations from the
simulated IMM. Co-growth simulations were conducted across all
subtraction/addition combinations; subsequently we chose the media inducing the
most prominent shifts. A table describing co-growth patterns across all reductive
combinations is provided at Supplementary Table S13. Based on the selected
125
125
predictions, the experimental media were modified by adding thymidine and xylose
(for a negative shift) and by the subtraction of thiamine and glucose (positive shift).
The growth experiments for the additional 9 bacterial pairs are shown in Supplementary
Note 1. Growth experiments in additional selected shifted media are described at
Supplementary Note 7.
Appendix 1::Table SM-1. IMM defined medium and its in silico representation.
Modifications of IMM were done using the same algorithm used for selecting a minimal media (Supplementary Methods), aiming to find the minimal set of metabolites which are necessary to support co-growth. The same in silico media were used for
all pairwise combinations.
Metabolite In vitro medium In silico medium
Thiamin + +
D-Methionine + +
Magnesium + +
L-Valine + +
L-Isoleucine + +
L-Leucine + +
L-Histidine + +
Calcium + +
D-Glucose-6-phosphate + +
Potassium + +
Citrate + +
L-Arginine + +
L-Tryptophan + +
L-Phenylalanine + +
Biotin + +
Riboflavin + +
Adenine + +
Pyridoxal + +
Nicotinamide_D-ribonucleotide + +
L-Glutamine + +
L-Cysteine + +
Lipoic acid + -
para-aminobenzoic acid + -
126
126
Oxygen + +
Cytosine - +
Zinc - +
Cobalt - +
Fe2+ - +
Chloride - +
Sulfate - +
Copper2 - +
Manganese - +
Spermidine - +
gly-asn-L - +
sn-Glycerol-3-phosphate - +
octadecanoate - +
A.1.3.6 Finding close cooperative loops in real and
random networks of give-take interactions and in real and
randomly drawn communities
Starting from the network of 'give-take' interactions we derived two sub-networks:
the ecologically-associated sub-network including the edges between ecologically-
associated species, and the non ecologically-associated sub-network including the
edges between the non associated species. The original network is provided at
Supplementary Table S2, indicating the type of ecological association corresponding
to each edge. The original network and the sub-networks of ecologically associated
and non-associated species are composed of 80, 66, and 80 nodes and 3160, 648 and
2512 edges, respectively. For each of the two sub-networks, the number of loops
was compared to the number found in 1000 random networks. Random networks
were generated by shuffling edges, retaining node number and edge degree. Notably,
in the network describing interactions between niche-associated pairs the number of
loops is significantly higher than random (t test < 0.001), unlike in the network
describing non ecological-associations. Real communities were derived from the
127
127
ecological distribution data where a community represents the set of species detected
in a given sample (as listed in Supplementary Table S7). The rate of occurrence of
close cooperative cycles was recorded (1) across all true samples (194 appearances)
(2) and across 1000 data sets generated through random shuffling of the original
samples data while maintaining the same sample size distribution and the same rank
of species' appearances as in the original data.
A.1.4 Supplementary Notes
A.1.4.1 Supplementary Note 1: Experimental and
computational co-growth analyses for 10 bacterial pairs in
interaction-specific media.
Individual and co-growth experiments were conducted for five bacterial species, all
non-pathogenic and capable of growing in IMM and their 10 corresponding pairwise
combinations. Growth experiments for each individual and pairwise combination
were conducted in three media: IMM – a chemically defined minimal medium41
(termed primary medium), a "negative" medium designed to induce a negative shift
towards increased competition (by adding thymidine and xylose) and a "positive"
medium designed to induce a positive shift towards less competition (by subtracting
thiamine and glucose). The "negative" and "positive" media were designed as
described in the Supplementary Methods section, representing the most generic
media for the induction of a shift in the pattern of co-growth across most growth
combinations (Supplementary Table S13). The observed shift in the co-growth
pattern (in comparison to co-growth in the primary medium) was successfully
predicted in 65% of the experiments, with precision 0.75 and recall 0.8 (Table SN1-
1) The species and their seed models are the following: Listeria innocua Clip11262
(Core272626_1), Agrobacterium tumefaciens str. C58 (Core176299_3), Escherichia
coli K12 (Core83333_1), Pseudomonas aeruginosa PAO1 (Core208964_1), Bacillus
subtilis str. 168 (Opt224308_1).
128
128
Appendix 1::Table SN1-1. Predicted and observed co-growth shifts.
For predicted and observed co-growth combinations we compared the ratio between the Sum of the Individual Growths (SIG) and the co-growth (CG) across the three media (primary, negative, and positive). The SIG/CG ratio in the negative and positive
media is compared to the ratio in the primary media where negative and positive shifts refer to an increase or decrease in this ratio, respectively. Colored columns represent a predicted directional shift in the corresponding interaction-designed media. Red
indicates a predicted negative shift in the negative media and green indicates a predicted positive shift in the positive media.
Observations: Table entries marked with '√' and 'X' represent corresponding or non-corresponding shifts as observed in laboratory co-growth experiments. Colored '√' symbols represent TP predictions; non-colored '√' symbols represent TN
predictions; Colored 'X' symbols represent FP predictions; non-colored 'X' represent FP predictions; For observations, SIG/CG
ratio was calculated according to OD values recorded in logarithmic growth and the corresponding growth rates, as described in Table SN1-3. Predicted and observed SIG/CG values are shown in Table SN1-2. Growth curves are presented at Figure SN1-1.
Positive shift Negative shift Species-pair
√ √ Agrobacterium tumefaciens-
Listeria innocua
√ √ Agrobacterium tumefaciens-
Escherichia coli
√ √ Agrobacterium tumefaciens-
Pseudomonas aeruginosa
X √ Agrobacterium tumefaciens-
Bacillus subtilis
√ X Listeria innocua-Escherichia
coli
X X Listeria innocua-Pseudomonas
aeruginosa
X X Listeria innocua-Bacillus
subtilis
√ √ Escherichia coli-Pseudomonas
aeruginosa
X √ Escherichia coli-Bacillus
subtilis
√ √ Pseudomonas aeruginosa-
Bacillus subtilis
6/10 7/10 True predictions
129
129
Appendix 1::Table SN1-2 Calculated values for predicted and observed co-growth shifts.
Values show the SIG/CG ratio (SIG: Sum of the Individual Growth; CG: Co Growth). PRM, NM, PM: Primary, Negative and Positive Medium, respectively. L: Listeria innocua; A: Agrobacterium tumefaciens ; E: Escherichia coli; P: Pseudomonas
aeruginosa; B: Bacillus subtilis.
Computational
predictions Experimental observations
Growth rate ratio‡ Growth ratio†
PRM NM PM PRM NM PM PRM NM PM
A-L 1.06 1.12 0 0.73 2.08 0.5 0.91 2.09 0.64
A-E 1.25 1.38 1.04 1.71 2.39 0.89 1.54 2.69 1.49
A-P 1.1 1.12 0.98 1.26 1.5 1.15 1.65 2.47 1.41
A-B 1.34 1.43 0.77 1.75 2.42 4.15 2.66 4.27 2.85
L-E 1.21 1.21 0.9 1.56 1.41 0.53 1.5 1.64 0.9
L-P 0.91 0.91 0.7 1.31 1.37 0.62 1.0 1.05 1.0
L-B 1.22 1.20 0.78 1.1 1.18 1.11 1.03 1.34 1.36
E-P 1.43 1.36 1.4 1.1 1.03 0.65 1.38 0.87 1.27
E-B 1.49 1.56 1.28 1.2 1.59 1.63 1.54 1.78 2.05
P-B 1.17 1.18 1.06 2.43 2.47 1.33 2.18 2.36 1.61
‡ Growth rate ratio was calculated by comparing ΔOD/Δtime ratio in SIG and CG
during exponential growth. Exponential growth was determined for each experiment
independently as shown in Figure SN1-1. For SIG, ΔOD was calculated as sum ΔOD
of both species.
† Growth ratio was calculated by comparing the OD in SIG and CG at a constant
time point (half time of the experiment). For SIG, OD was calculated as sum OD of
both species at the selected time point. OD values at time0 (the beginning of the
experiments) were subtracted.
130
130
Appendix 1::Figure NS1-1. Growth curves of individual and pair-wise combinations across different media.
Growth combinations are ordered as in Tables SN1-2 and SN1-3. The title of each graph indicates the species combination and
the medium (that is LA,PRM indicates Listeria innocua- Agrobacterium tumefaciens in primary medium). Abbreviations are as
in Table SN1-3. Gold and orange lines represent the individual growth of the first and second pair members, respectively (that
is, in LA combination L. innocua is shown in gold and A. tumefaciens is shown in orange). Blue line represents co-growth. Red
line represents the sum of individual growth at the manually selected exponential growth phase (OD values of first species at
time0 were subtracted). Only successful predictions are shown (√ at Table SN1-2).
131
131
132
132
A.1.4.2 Supplementary Note 2: using systematic data
sources for estimating the ecological relevance of win-lose
predictions
For each pair of species we looked at the outcome of the competition and defined
species as winners or losers according to their growth rate (the faster species is the
winner as in Figure 2a in the main text). For each species we calculated its fraction of
winning events across all its co-growth experiments. The list of species' competition
values is provided at Supplementary Table S1. Top "winners" include ecologically
diverse fast growers such as Escherichia coli, Salmonella typhimurium, Vibrio
cholerae and Pseudomonas aeruginosa. Species with a low mean competition score
include slow grower pathogens such as Mycoplasma genitalium and Borrelia
burgdorferi and obligatory symbionts as Buchnera aphidicola.
Maximal growth rate information is available for 66 species in the data, retrieved
according to manual survey of the scientific literature[78]. The matrix at Figure 2b
contains all these species for which doubling time information is available, sorted
according to their doubling time. To study the statistical significance of the win-lose
division in the experimentally-driven matrix we compared the strength of green-red
division (win-lose) in the original matrix to the red-green division in 1000 random
matrices. To produce such random matrices, the order of species was randomly
permuted 1000 times and for each corresponding matrix we counted how many
winners were mapped to the upper triangle ("green" side). In the original matrix we
observe 1256 true classifications (the experimentally faster is the winner), which is
higher than the number of true classifications observed in 998 random matrices (T-
test, P value 0.002).
To systematically study the biological relevance of the competition score we looked
at the correlation between competition values and environmental diversity,
133
133
considering two independent measures – fractions of regulatory genes were taken
from[139]
, describing the fraction of transcription factors out of the total number of
genes in the genome - an indicator of environmental variability[140]
. General
environmental complexity estimates were also obtained from[148]
where the natural
environments of bacterial species were categorized based on the NCBI classification
for bacterial lifestyle[148]
and ranked according to the complexity of each category (1-
obligatory symbionts; 2- specialized; 3- aquatic; 4- facultative host-associated; 5-
multiple; 6- terrestrial species). We observe a significant correlation between
environmental diversity and winning potential, considering both absolute and partial
mean competition score. Correlation values in a spearman correlation test between
competition values against the two measures of environmental diversity:
Mean competition score versus the fraction of regulatory genes: 0.6 (P value 9e-6)
Mean competition score versus NCBI estimate for environmental diversity: 0.46 (P
value 2e-3)
Mean competition score versus experimentally recorded minimal doubling time: -
0.34 (P value 5e-3)
Experimental growth rates and lifestyle annotations are provided at Supplementary
Table S1.
A.1.4.3 Supplementary Note 3: Simulating co-growth of
Salinibacter ruber and Haloquadratum walsbyi.
We studied the effect of the media on the type of interaction between Salinibacter
ruber and Haloquadratum walsbyi – two halophylic species that co-exist in salterns.
We chose to focus on these species as a synergistic interaction between them was
134
134
documented, where it was suggested that the improved growth of H. walsbyi can be
explained by the uptake of dihydroxyacetone (DHA) produced by S. ruber[58]
. We
computationally studied the interaction between the species in a poor medium, with
and without DHA. Starting from competition-inducing medium (COMPM),
reduction was done using the algorithm used for computing cooperation-inducing
media (Methods, main text and Supplementary Methods) where we looked for the set
of metabolites that allows a feasible solution for each of the species individually
(achieving at least 10% of the corresponding growth in COMPM). In our
simulations, co-growth of both species in a rich medium (COMPM, Methods and
Supplementary Methods) revealed no cooperative interaction (PCMS > 0, Table
SNS-1). Looking at the content of the metabolites in the media we observed that
DHA, externally provided to the system, is consumed by the multi-species system
and hence the contribution of its transfer between the species to their growth
potential is concealed. The dependence between H. walsbyi and S. ruber for DHA
supply is revealed when reducing the medium. As suggested in the experimental
studies, we observed that the growth of H. walsbyi becomes possible only when
adding DHA into the medium or by adding S. ruber to the community (Table SNS-
1).
Appendix 1::Table SN3-1. Interactions between Salinibacter ruber and Haloquadratum walsbyi across different
media.
Presence of
DHA in the
media
Co-
growth
Individual
growth of
H.
walsbyi
Individual
growth of
S. ruber
PCMS Cooperative
interaction
COMPM + 69.1 60.2 20.1 0.57 -
Reduced
media
(+ DHA)
+ 27.3 11.7 8.3 -0.88 +
Reduced
media
-DHA
- 27.3 0 8.3 NA +
135
135
A.1.4.4 Supplementary Note 4: Relating the designed
media to true ecological conditions
Throughout most of this analysis, simulations are conducted in computationally-
derived, designed media. In order to examine the ecological relevance of the
designed media we first tested whether ecologically related species exhibit similarity
in their media (VCOMPM,A, Supplementary Methods), as can be expected from the
demonstrated similarity in the metabolic pathways of co-occurring species[64].
Indeed, we observe that the resource overlap (Methods) between ecologically
associated species is significantly higher than the resource overlap between none
ecologically related species (P value 1.5 e-8 in a one-sided Wilcoxon test; median
values for resource overlap are 0.41 and 0.46, respectively). We then characterized
the rate of occurrence of different metabolites across species-specific
computationally-designed environments, as well as identified typical growth-limiting
factors. Notably, the 10 most frequent compounds (listed at Table SN4-1), include
essential inorganic compounds as metals and salts. In contrast, species show high
diversity in their carbon sources (Table SN4-1).
In order to identify species-specific limiting factors we looked at the typical flux of
each metabolite (that is, the mean Vi, Min_FVA across all species, Supplementary
Methods), where a low typical flux indicates that a compound is a limiting factor.
Typical limiting factors include oxygen, glucose and nitrogen sources, in
correspondence with experimental knowledge. Alternatively, the widely-distributed
inorganic compounds in Table SN4-1, are all consumed at a typically low levels (all
show mean Vi, Min_FVA > -1 in comparison to mean Vi, Min_FVA < -20 for the highly
consumed metabolites in the left column, Table SN4-2).
136
136
Finally, we studied the distribution of metabolites in the pair-specific, rich (VCOMPM,
AB) and poor (VCOOPM, AB) environments (Methods and Supplementary Methods).
Metabolites which are frequent at both pair-wise media are inorganic essential
compounds (Table SN4-2). Notably, metabolites which are typical of the rich media
but are missing from the poor, cooperation inducing media are typically derivates of
amino acids, representing a set of metabolic products that can be produced by one of
the species and then transferred to its pair members[61]
.
Appendix 1::Table SN4-1. Characterization of species-specific metabolic computationally-designed environments
The full list of species-specific environments is provided at Supplementary Table S10.
* Highest absolute value
Appendix 1::Table SN4-2. Characterization of pair-specific metabolic environments
10 most frequent
metabolites across the
118 species-specific
environments
10 most rare metabolites
across the 118 species-
specific environments
10 metabolites with the
highest* mean flux at
optimal conditions (typical
limiting factors)
Copper2
beta-
Methylglucoside_C7H14O6 Oxygen
Sulfate
D-
Glucosamine_C6H14NO5 H+
Fe3 Decanoic_acid_C10H19O2 D-Glucose
Magnesium
D-O-
Phosphoserine_C3H7NO6P L-Glutamate
Zinc
(R,R)-
Tartaric_acid_C4H4O6 sn-Glycerol_3-phosphate
Manganese Propanoate_C3H5O2 NH3
Cobalt Isocitrate_C6H5O7 Fumarate
Potassium
(R,R)-Butane-2,3-
diol_C4H10O2 D-Fructose
Calcium Nicotinamide_C6H6N2O Nitrate
Fe2
beta-
Methylglucoside_C7H14O6 L-Serine
137
137
The full lists of pair-specific rich and poor environments are provided at Supplementary Tables S10 and S11, respectively.
* Found across all environments
** calculated by subtracting the frequency of metabolite at poor media from its
frequency in rich media
A.1.4.5 Supplementary Note 5: The use of various
thresholds for determining a feasible growth solution in
minimal, cooperation-inducing, media
10 most frequent metabolites
across both minimal and rich
pairwise environments*
10 frequent metabolites in
rich media that are absent in
poor, cooperation inducing,
media*
Magnesium_Mg ala-L-glu-L_C8H13N2O5
Sulfate gly-glu-L_C7H11N2O5
Chloride_Cl Ala-Gln_C8H15N3O4
Potassium_K H+_H
Calcium_Ca ala-L-asp-L_C7H11N2O5
Fe2+_Fe Ala-Leu_C9H18N2O3
Manganese_Mn Sodium_Na
Cobalt_Co Gly-Leu_C8H16N2O3
Copper2_Cu gly-pro-L_C7H12N2O3
Zinc_Zn
L-
alanylglycine_C5H10N2O3
138
138
A Cooperation-inducing Media (COOPM) is defined here as a set of metabolites that
allows a feasible solution with positive growth rate, such that the removal of any
metabolite from the set would make such solution infeasible. We examined several
growth requirements threshold ranging between 10% of the BPR found in
competition-inducing (COMPM), rich media (the minimal media reported in the
main text) to 100% (as in the original COMPM environment, main text). All reduced
media reveal cooperative solutions (Table SN5-1). At all solutions, ecologically
associated pairs of species, and in particular mutually exclusive pairs exhibit higher
level of cooperation in comparison to non-associated pairs. The same trends were
observed when a minimal, cooperation-inducing, medium was calculated as the
intersection of the exchange reactions in COMPM.
In a multi-species system, we defined a symmetrical interaction as such where both
A and B are "givers" (and "takers"), that is both species improve their growth in
comparison to individual growth. Notably, this definition is permissive as A and B
can show variability in the extent to which their growth is improved. Despite the
permissiveness of the definition the majority of species show a-symmetrical
cooperative directionality with a single giver (Table SN5-1). We explored the
symmetrical interaction on different growth media. Growth media were determined
by setting thresholds on the feasible solution which required achieving at least X% of
the corresponding growth in COMPM. The thresholds were set on both the biomass
production of the multi-species system (as in Table SN5-1) and on the contained
organisms (i.e. requiring that each compartment/organism will have a biomass
production rate higher than a threshold, in comparison to its growth in COMPM).
This indeed raised the number of symmetrical events (Table SN5-2), though it
reduced the ecological signal, where similar fractions of cooperative events are
observed for the group of niche-associated and non-associated pairs, testifying
against the ecological relevance of enhancing the propensity of symmetrical
solutions via such means.
139
139
Appendix 1::Table SN5-1. Frequency of symmetrical interaction events under minimal growth media with different
thresholds for biomass production of the system.
%BPR
(out of
BPR in
COMPM)
Total number
of
cooperative
events
(fraction of
unidirectional
events¤)
§N=3160
Fraction of cooperative events within different
ecological groups
Non-
associated
N=2512
Niche-
associated‡
N=536
Co-
occurrin
g N=84
Mutuall
y-
exclusiv
e N=28
Rich
Media
(COMP
M)
100% 0 0 0 0 0
Reduce
d media
75% 1814(0.65) 0.52 0.77 0.85 0.93
50% 1466(0.88) 0.4 0.69 0.71 0.86
25% 1279(0.93) 0.36 0.58 0.6 0.79
10% 1293(0.94) 0.37 0.57 0.51 0.71
Intersectio
n† 656(0.96) 0.16 0.38 0.35 0.36
¤ Unidirectional events refer to cooperative interactions where only one of the pair
members is a giver and the other is a taker.
§N represents all possible combinations in a specific group
‡ Niche associated pairs do not include co-occurring and mutually exclusive pairs
†Intersection medium for a pair of species is calculated as the intersection of uptake
reactions from their individual COMPMs
140
140
Appendix 1::Table SN5-2. Frequency of symmetrical interaction events under minimal growth media with different
thresholds for biomass production of the system and the compartments in the system.
%BPR
(out of
BPR in
COMPM)
Total
number of
cooperative
events
(fraction of
unidirectio
nal events)
N=3160
Fraction of cooperative events within different
ecological groups
Non-
associated
N=2512
Niche-
associated
N=536
Co-
occurrin
g N=84
Mutuall
y-
exclusiv
e N=28
10%‡
(10%†) 1630(0.37) 0.52 0.51 0.44 0.71
25%‡
(25%†) 1768(0.40) 0.54 0.61 0.61 0.79
50%‡
(50%†) 2325(0.51) 0.72 0.8 0.82 0.89
75%‡
(75%†) 1156(0.52) 0.4 0.25 0.26 0.18
‡ the threshold for the feasible solution of the multi-species system
† the threshold for the feasible solution of the multi-species system in each
compartment in the multi-species system
141
141
A.1.4.6 Supplementary Note 6: Frequency of directional
give-take relationships across bacterial families (top 10
combinations).
Appendix 1::Table SN6-1. frequency of inter-family give-take interactions
The table describes the frequency of inter-family give-take interactions, considering the total number of pairwise inter-class
combinations. In order to have groups of similar size, some groups (e.g., Actinobacteria) describe the phylum level
classification.
Giving family Receiving family
Total
number of
inter-family
interactions
Total
number of
directional
give-take
inter-family
interactions
Fraction of
directional
give-take
inter-family
interactions
Lactobacillales
Alpha/others
proteobacteria 24 13 0.54
Clostridia Bacillales 36 20 0.56
Clostridia Betaproteobacteria 40 24 0.6
Clostridia
Hyperthermophilic
bacteria 8 5 0.63
Spirochete
Alpha/others
proteobacteria 8 5 0.63
Clostridia Deltaproteobacteria 12 8 0.67
Clostridia Bacteroides 12 9 0.75
Clostridia Actinobacteria 32 26 0.81
Clostridia
Alpha/others
proteobacteria 16 13 0.81
Clostridia Epsilonproteobacteria 12 11 0.92
142
142
A.1.4.7 Supplementary Note 7: Experimental and
computational growth analyses of Listeria innocua and
Agrobacterium tumefaciens across pre-designed media
In order to explore the predictive power of our co-growth simulations in changing
environments, we first identified the limiting factors of Listeria innocua and
Agrobacterium tumefaciens at their simulated IMM (i.e. where the metabolites are
consumed at the maximal threshold determined, Vi, Min_FVA =-50, Supplementary
Methods). As can be expected from the neutral interactions between the two
organisms (predicted and observed, Supplementary Methods), most of their limiting
factors at the simulated-IMM do not overlap (Table SN7-1). The predicted limiting
factors of L. innocua are glutamine, glucose and cysteine. The predicted limiting
factors of A. tumefaciens are isoleucine, glutamine and histidine. Simulations were
then conducted while decreasing the level of these metabolites (that is increasing Vi,
Min_FVA) at different thresholds until their full removal (Vi, Min_FVA=0). For the four
amino acids studied, decreasing the corresponding fluxes slowed down the growth
rate of the relevant species (cysteine and glutamine for L. innocua and histidne
glutamine and isoleucine for A. tumefaciens), but had only a minor effect pattern of
co growth (Table SN7-1). The predictions for the growth and co-growth patterns
following the removal of cysteine and histidine, predicted to have the most
significant effect on growth (Table SN7-1), were further tested experimentally.
Laboratory observations indicate that the effect of metabolites removal on both
growth and co-growth patterns can be fully predicted: the growth of A. tumefaciens is
affected by the removal of histidine but not cysteine and the growth of L. innocua is
affected by the removal of cysteine but not histidine (Table SN7-2). In both cases,
co-growth pattern remains relatively similar to the pattern observed in the original
media (Table SN7-2).
143
143
The computational predictions indicate that decreasing the level of glucose is likely
to affect both individual and co-growth patterns (Table SN7-1). At individual
growth, decreasing the level of glucose is likely to affect only L. innocua, slowing
down its growth, but at co-growth it is predicted to increase the inter-species
competition (possibly due to the resource overlap induced by the shortage in
glucose). The full removal of glucose is predicted to prevent the growth of L.
innocua (again, with no effect on A. tumefaciens), where co-growth is predicted to
induce a modest level of cooperation (Table SN7-2). Reassuringly, experimental
tests support the computational predictions where the removal of glucose from the
media prevents the growth of L. innocua but has no effect on A. tumefaciens. As
predicted, decreasing the level of glucose increases the competition at co-growth.
However, at full removal we do not observe the predicted mild cooperation.
Overall, in most growth experiments (7/8) predictions and observations correlate
(Table SN7-2). When looking at co-growth predictions, the most significant growth
ratio change occurs at the partial and full removal of glucose. In agreement with the
predictions, the partial elimination of glucose induced the most drastic elevation in
growth rate ratio, where, as predicted, a weaker effect is observed for the removal of
amino acids. With the exception of a single experiment experimental measure (co-
growth pattern following full removal of glucose), we observe an overall agreement
between predictions and observations. Hence, overall, this set of experiments
supports the ability of the metabolic models to predict the growth pattern of species
at varying environments.
144
144
Appendix 1::Table SN7-1. Computational predictions for the effect of reducing and removing computationally-predicted
limiting factors from IMM media.
L, A – predictions for the individual growth of Listeria innocua and Agrobacterium tumefaciens, respectively across the
media tested; LA – co-growth prediction. Values in red indicate a change >±0.05 in growth and growth ratio in
comparison to values predicted for the original IMM.
Reduced metabolite
(Vi, Min_FVA =-10*)
Full removal of the metabolite
(Vi, Min_FVA = 0)
L A A Growth
ratio
(L+A)/L
A
L A A Growth
ratio
(L+A)/L
A
Growth at the
original
IMM**
36
52
83
1.06
36 2 3 1.06
Isoleucine 6 8
9
1.06 36 7 8 1.05
Histidine† 6 9 7 1.1 36 0
6
1
Glutamine 2
8 6 1.05 30 6 3 1.04
Cysteine† 1 2 7 1.08 29
2
5
1.08
Glucose 1 2
5
1.12
0
2 1
0.85
*Similar behavior is observed for additional Vi, Min_FVA (-50 < Vi, Min_FVA < 0).
** For all metabolites in the table Vi, Min_FVA =-50.
† Full reduction of histidine and cysteine has the most drastic effect on the growth
predictions of Agrobacterium tumefaciens and Listeria innocua, respectively.
145
145
Appendix 1::Table SN7-2. Predicted and observed growth and co-growth shifts.
Each cell's color represents the computationally-predicted growth shift in the designed media: red indicates reduced growth (growth predictions section) and reduced growth ratio (growth ratio section); grey represents no growth reduction (growth
predictions section) and no growth ratio change (growth ratio section; black color in the corresponding cells at Table SN7-1); dark green represents elevated ratio (red color in the corresponding cells at Table SN7-1). Observations: Table entries marked
with '√' and 'X' represent corresponding or non-corresponding shifts as observed in laboratory co-growth experiments. Growth
shift is defined as a change of >±0.25 in growth and growth ratio in comparison to values detected at the original IMM. The corresponding experimental results are provided at Table 3. L - Listeria innocua; A - Agrobacterium tumefaciens.
Growth predictions Growth ratio
(L+A)/LA L A
Histidine
(Vi, Min_FVA = 0)
√ √ √
Cysteine
(Vi, Min_FVA = 0)
√ √ √
Glucose
(Vi, Min_FVA =-1
0)
X √ √
Glucose
(Vi, Min_FVA = 0)
√ √ X
Appendix 1::Table SN7-3. Observed growth and co-growth shifts. Values indicate the maximal OD in the experiments.
Experiments were conducted as described in the Supplementary Methods section and in Supplementary Note 1.
L A LA (L+A)/LA
Growth at the
original
IMM**
0.2 .19 0.5 0.8
Histidine 0.43 .13 0.59 0.95
146
146
** For all metabolites at the table Vi, Min_FVA =-50.
(Vi, Min_FVA = 0)
Cysteine
(Vi, Min_FVA = 0)
0.02 .22 0.28 0.86
Glucose
(Vi, Min_FVA =-1
0)
0.21 .21 0.3 1.4
Glucose
(Vi, Min_FVA = 0)
0.01 .19 0.15 1.25
147
147
Appendix 2. Supplementary data for Chapter 3
A.2.1 List of Abbreviations
GEM: genome-scale metabolic model
FBA: flux balance analysis
SUMEX: maximization of the sum of metabolic exchange fluxes (with the
convention that outward fluxes are positive)
PMAX: maximization of proton exchange (from inside to outside the cell)
GR: growth rate
ds66: a dataset of growth rates for 66 organisms in rich media
ds57: a dataset of 57 growth rates of E. coli with varying knockouts and media
ds18: a dataset of growth rates for 6 organisms grown in 3 media
ds24: a dataset of 24 of E. coli grown in 24 media
A.2.2 Supplementary results
A.2.2.1 Sensitivity analysis of SUMEX and Biomass
SUMEX does not assume known uptake rates. This is an important strength of the
metric, because only in cases where uptake rates of key compounds are known, can
traditional methods (most notably, FBA using a biomass objective) predict growth
rate due to the rate-yield relationship (Growth rate = Substrate uptake rate * Yield)
(this is at least true in substrate limited conditions). However, in optimizing any
objective function in a GEM (including SUMEX), it is necessary to set bounds on the
uptake reactions (or at least on some reactions) in order to gain computationally
feasible solutions. We chose to set standard bounds on uptakes of all compounds in a
given medium at a value of -50 units (see Supplementary Methods for full
148
148
characterization of the bounds). We set the same standard bounds for all metrics we
tried, unless otherwise noted. In order to test how dependent SUMEX is on these
bounds, we did a sensitivity analysis across the 3 datasets, testing both SUMEX and
biomass. Briefly, we altered each uptake bound across all models in a given dataset
by a random amount between either ±10% or ±50% (uniformly distributed) of its
standard value, and then re-assessed the correlation of the metric against growth rates
for that dataset (see Fig. S1).
We found the correlation of SUMEX with growth rate to be highly robust to changes
in the uptake bounds, and indeed to be significantly more robust than biomass on two
of the three datasets given the same random distributions of uptakes (P=2e-31 and
P=2e-4 in F-tests on ds18 and ds57, respectively, at 50% variation; there was no
distinguishable difference in ds66 – see Table S1). In the rich media conditions of
ds66, the correlation of SUMEX vs. GR varied less than 10% even with 100%
variance in uptake bounds. For completeness, we repeated the same test on the
secretion bounds and achieved similar results (see Fig. S1 & Table S1).
149
149
ds1
8d
s66
ds5
7
±10%(-45 to -55)
±50% (-25 to -75)
10% (900 to 1000)
50% (500 to 1000)
spearman’s rho, SUMEX vs. GR
spea
rman
’s r
ho
, Bio
mas
s vs
. GR
variance in bounds:
Uptake bounds (open) Secretion bounds (all)
0 0.5 10
0.5
1
0 0.5 10
0.5
1
0 0.5 10
0.5
1
0 0.5 10
0.5
1
-1 0 1-1
-0.5
0
0.5
1
-1 0 1-1
-0.5
0
0.5
1
0 0.5 10
0.5
1
0 0.5 10
0.5
1
0 0.5 10
0.5
1
0 0.5 10
0.5
1
-1 0 1-1
-0.5
0
0.5
1
-1 0 1-1
-0.5
0
0.5
1
Appendix 2::Figure S1. Sensitivity analysis of GEM bounds. The Spearman’s rhos (2-tailed) of growth rate versus
both SUMEX (x-axis) and max Biomass (y-axis) are shown for 3 bacterial datasets (ds18, ds66, and ds57), when uptake
bounds of all open metabolites (i.e., metabolites that are allowed to be taken up in a given medium) are randomly varied
by ±10% (1st column) or ±50% (2nd column) of the standard bound (which is -50 for all allowed uptakes), and when
secretion bounds of all exchanged metabolites are randomly varied by 10% (3rd column) or 50% (4th column) of the
standard secretion bound (+1000). Sumex displays significant robustness to changes in bounds. The green line in each
plot has a slope of 1.
150
150
ds18 ds66 ds57 ds18 ds66 ds57
10% 0.7% 0.3% 16.8% 0.0% 0.2% 5.1%
50% 1.5% 1.3% 35.5% 1.1% 0.7% 7.3%
10% 7.4% 0.4% 53.4% 3.0% 0.0% 62.9%
50% 28.7% 1.8% 628.9% 2.8% 0.4% 55.8%
p-val in F-test that variance of
Spearman's rho vs. growth rate is lower
for SUMEX than Biomass50% 1.9E-31 0.1 1.2E-04 0.8 1.0 1.6E-07
RSD, Spearman's rho of Biomass vs. GR
RSD, Spearman's rho of SUMEX vs. GR
Uptake bounds (open) Secretion bounds (all)amt
bounds
varied*:
Appendix 2::Table S1. Statistics of GEM sensitivity analysis. Summary statistics are presented from Fig. S1. The top
four rows show the Relative Standard Deviation, RSD = abs((std(rho)/mean(rho)))*100, of SUMEX or Biomass versus
GR across random variations in model uptake bounds or variations in secretion bounds (as labeled). Cases in which
RSD is less than 10% of the variation in bounds are highlighted grey. The bottom row shows the significance (p-val) of
an F-test that the correlation of SUMEX versus growth rate varies less across 50% variations in model bounds than the
correlation of Biomass versus growth rate. The F-test shows high significance for uptake bounds in ds18 and ds57, and
secretion bounds in ds57. *As in Fig. S1, uptake bounds were varied to ±(%) while secretion bounds were varied
between the standard value and –(%).
A.2.2.2 Expanded analysis of obligate fermenters and
respirers in ds66
As noted in the main text, we split ds66 into two groups: obligate fermenters and
organisms that respire (see Table S4 for the breakdown). We found that SUMEX is
predictive of growth rate for the respirers, but not for the obligate fermenters. Of
note, although SUMEX does not significantly correlate with growth rate for the 9
obligate fermenters, it also does not show significantly less significance than
randomly chosen sets of 9 organisms from ds66. Unlike SUMEX, biomass yield
does show a significant correlation versus GR for the 9 obligate fermenters (rho =
0.66, p = 0.03 in 1-sided Spearman test), although the significance is also not
significantly above that expected if we choose 9 organisms at random. This suggests
that while biomass is a poor predictor of growth rate in respirers, it may be
appropriate for predicting the GR of obligate fermenters. Among the set of 9
obligate fermenters, there was one organism, Lactobacillus plantarum, for which
evidence has been found for respiration when the organism is provided exogenous
heme and menaquinone [84]. Therefore, it is possible that L. plantarum should be
re-categorized as a respirer. Removing L. plantarum from the fermenter set and
151
151
calculating Spearman correlations on the remaining 8 organisms resulted in
significance for both biomass and SUMEX versus GR (see Fig. 3d). Due to these
considerations, a larger dataset of obligate fermenters will be required in order to
allow more definite statements about the application of SUMEX or biomass to
predicting their growth (none of the other datasets treated in this chapter include
obligate fermenters).
Interestingly, SUMEX and PMAX significantly under-predict the growth rates of
obligate fermenters compared with respirers (all fermenter datapoints lie below the
trendline in Fig 3a-b). This suggest that, since growth of fermenters relies on
mechanisms independent from their ability to produce a strong proton gradient, a
proton gradient-dependent predictor (such as SUMEX) under-represents their
capability for fast growth.
A.2.2.3 The relationship between flux and molecular
weight in SUMEX
In order to check if maximizing SUMEX indeed causes uptake of high molecular
weight compounds and the output of low molecular weight compounds, we
calculated the correlation between the molecular weights of exchanged metabolites
(with nonzero fluxes) and their average exchange fluxes (as determined by flux
variability analysis) when calculating SUMEX for E. coli on rich medium, as well as
for all exchanged metabolites in all models across the ds18 dataset. We achieved
strong negative correlations between molecular weight and outward exchange flux in
both cases (ρ=-0.73, P = 4e-23 and ρ=-0.56, P = 1e-34 for the two analyses),
confirming our hypothesis.
A.2.2.4 Ranging of biomass% lower bound
Cellular growth involves an intrinsic tradeoff between growth rate and biomass yield.
In calculating SUMEX, we enforce a small flux (5% of the maximum possible in the
given condition) through the biomass yield reaction, since some yield is necessary to
152
152
sustain growth. In order to more fully understand the relationship of SUMEX with
growth yield, we varied this lower bound on biomass yield between 0 and 100% of
the maximum (i.e., the maximum biomass yield computed in the model on a given
media) in all of the datasets dealt with in this chapter.
In the bacterial datasets, we found the correlation of SUMEX with growth rate to be
typically robust to changes in the yield, except for when biomass approaches 100%,
at which point the correlation drops off in several datasets (see Fig. S2). This
suggests that the correlation of SUMEX with growth rate is robust to changes in
yields in the model, at least within physiological ranges 16,24
. On the contrary, the
dropoff near 100% biomass yield poses an interesting parallel with the results of [80]
(see Fig. S2). Flux variability analysis of ds18 confirmed that flux variability in
maximal SUMEX decreases as the lower bound on biomass yield increases from
70% to 100% (see Fig. S3).
Interestingly, in the NCI60 cell lines, the correlation of SUMEX with growth rate
was more sensitive than in the bacterial datasets to the lower bound on biomass yield,
and dropped off sharply as biomass yield lower limit increased (see Fig. S2). This
suggests that cancer cells, which are not as evolutionarily well-tuned as bacteria for
efficient growth, might have to sacrifice more of their biomass yield than bacteria to
attain maximal growth rates.
Also intriguingly, we found a peak in the correlation between SUMEX and growth
rate in ds66, ds18 and ds24 (peaks were at 90%, 55%, and 80% max biomass for the
three datasets; see Fig. S2). These peaks in correlation with growth rate suggest that
certain percentages of maximum biomass yields may be dominant across the
different conditions in each dataset.
153
153
-1
-0.5
0
0.5
1
0 20 40 60 80 100
ρ, S
UM
EX v
s. G
R
Biomass LB (%max)
-0.2
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100
ρ, S
UM
EX v
s. G
R
Biomass LB (%max)
0.53
0.54
0.55
0.56
0.57
0.58
0.59
0 20 40 60 80 100
ρ, S
UM
EX v
s. G
R
Biomass LB (%max)
0.48
0.49
0.5
0.51
0.52
0.53
0.54
0 20 40 60 80 100
ρ, S
UM
EX v
s. G
R
Biomass LB (%max)
0
0.1
0.2
0.3
0.4
0.5
0.6
0 20 40 60 80 100
ρ, S
UM
EX v
s. G
R
Biomass LB (%max)
ds66 ds18
ds24ds57
a.
b.
b.
d.
e.NCI60 cancer cell lines
Appendix 2:Figure S2: Effect of biomass lower bound on SUMEX. The correlation of SUMEX versus growth rate as
lower bound (LB) on biomass is varied. (a) ds66, (b) ds18, (c) ds57, (d) ds24 (E. coli grown on 24 carbon sources, from
[35]), and (e) NCI60 cancer cell lines. ds24 was calculated with the iAF1260 E. coli model.
A.2.2.5 Network flexibility in SUMEX as biomass lower
bound approaches 100%
In order to assess the flexibility of the metabolic networks as biomass approaches
100%, we did a flux variability analysis (FVA) of all reactions in ds18 under optimal
154
154
SUMEX conditions with biomass set to equal 100%, 90%, 80%, and 70% of its max,
and then we assessed the change in flux variability (∆FV) of the flux range of each
reaction across the 18 conditions (6 organisms x 3 media). The ∆FV metric was
calculated for each reaction/condition as the slope (change) of the flux range when
the biomass value increases from 70% to 100% of its max. A positive ∆FV means
that, as biomass increases between 70% and 100%, the range within which fluxes of
a given reaction can vary increase, and the magnitude of ∆FV indicate the strength of
the increase/decrease. Fig. S3 shows the results: it is clear that FV decreases as
biomass trends towards 100%.
155
155
0 500
exchange
Porphyrin and chlorophyll metabolism
Reductive carboxylate cycle (CO2 fixation)
Streptomycin biosynthesis
Fructose and mannose metabolism
Tryptophan metabolism
Fatty acid metabolism
Sulfur metabolism
Glycerophospholipid metabolism
One carbon pool by folate
Valine, leucine and isoleucine degradation
Arginine and proline metabolism
Thiamine metabolism
Lysine biosynthesis
Valine, leucine and isoleucine biosynthesis
Pantothenate and CoA biosynthesis
Glycerolipid metabolism
Butanoate metabolism
Cysteine metabolism
Glycolysis / Gluconeogenesis
Methionine metabolism
Glyoxylate and dicarboxylate metabolism
Aminosugars metabolism
Folate biosynthesis
Glutathione metabolism
Propanoate metabolism
Nitrogen metabolism
Citrate cycle (TCA cycle)
Pentose phosphate pathway
Glutamate metabolism
Pyruvate metabolism
Glycine, serine and threonine metabolism
Nicotinate and nicotinamide metabolism
Carbon fixation in photosynthetic organisms
Starch and sucrose metabolism
Urea cycle and metabolism of amino groups
Pyrimidine metabolism
Purine metabolism
none
FV increases as biomass -> 100%
FV decreases as biomass -> 100%
# reactions
Appendix 2::Figure S3: Flux variability in SUMEX solution as function of biomass lower bound. FVA was performed on
the optimal solution space of SUMEX at lower bounds of biomass between 100% and 70%. Reactions whose flux
variability increased or decreased more than a set cutoff are binned into pathways and plotted. Overall, this shows a
general decrease in flux variability as biomass approaches 100%.
156
156
rho P
FBAwMC 0.26 1.1E-01
PMAX 0.38 3.6E-02
Biomass 0.40 2.9E-02
MOMENT 0.47 1.0E-02
SUMEX 0.47 1.2E-02
1-sided Spearman test:
Appendix 2::Table S2: Analysis of ds24. We obtained growth rates of E. coli in batch culture under 24 minimal media
conditions from [35]. For this analysis, we used the iAF1260 E. coli model, in order to be consistent with the other
metrics. SUMEX was computed over the 23 media for which the carbon source was present extracellularly in the
standard iAF1260 model (the excluded metabolite was glucosamine). Values listed are for SUMEX with the standard
5% lower bound of biomass. Also listed is the maximization of extracellular proton production (PMAX), which displays
significance, but below that of the top 3 metrics.
A.2.2.6 Summing exchange fluxes in the optimal biomass
solution space predicts growth rate
We were interested in doing an independent validation of SUMEX, based on
changing uptake bounds in the model. Our logic is as follows:
In the bacterial datasets, we did not have detailed measurements of uptake and
secretion fluxes. However, because of the property that biomass yield corresponds to
growth rate at steady state if uptake rates are exactly known (from mass
conservation: 1
i i
ib io m a ss
m vm
, where μ=growth rate and mi and vi are the mass
and flux of each exchanged component, i), we hypothesized that if we tune the
uptake rates to increase the correlation of biomass with growth rate, we would bias
towards realistic uptake rates and be able to independently validate SUMEX by
summing, but not maximizing, exchange fluxes.
Restated, we summed extrapolated uptake and secretion rates without doing a
maximization, rather than computing the maximum achievable sum (as in SUMEX).
In practice, this meant sampling for in silico media variants that gave significant
(P≤0.05) positive correlations between maximal biomass and growth rate, and in
157
157
these cases, checking also the correlation to growth rate of the sum of mean
allowable exchange fluxes (SUMofEX) that support optimal biomass. Because any
individual flux vector supporting maximal biomass is not unique, we calculated
SUMofEX by summing the means of the flux variability ranges (computed by FVA)
of each exchange component, as calculated within the biomass solution space.
We did this for ds18, ds66, and ds57, and achieved significant correlations between
SUMofEX (calculated within the maximal biomass yield solution space) and growth
rate in all three datasets, and indeed generally stronger correlations than the biomass
objective achieved on the same data, as shown by points being to the right of the
green line in Fig. S4. This analysis again independently validated the correlation of
the sum of exchange fluxes with growth rates.
-1
-0.5
0
0.5
1
-1 0 1
-1
-0.5
0
0.5
1
-1 0 1
-1
-0.5
0
0.5
1
-1 0 1
a. b. c.
ρ, sum of extrapolated exchanges supporting max biomass (i.e., SUMofEX) vs. GR
ρ, m
ax b
iom
ass
vs. G
R
ds18 ds66 ds57
Appendix 2::Figure S4: Extrapolating bounds for biomass. Allowed uptake bounds were randomly varied via uniform
distribution in the range [-50, 0] across all models for (A) ds18, (B) ds66, and (C) ds57. The variation was done such that
for a single iteration, the uptake lower bound of a compound C1 was fixed to the same randomly determined value for
all conditions in which C1 could be taken up, while the uptake of a compound C2 would take a different randomly
determined uptake than C1 across all models, etc. Maximum biomass yield was computed for each model. Then, the
sum of exchange fluxes (SUMofEX) was computed as the sum of the means of the flux variability ranges (calculated by
FVA) of all exchange reactions, under the condition of optimal biomass (i.e., SUMEX wasn’t maximized, but rather it
was summed from the means of allowed exchange fluxes that support optimal Biomass). The plots show only media
variants in which biomass correlated significantly (P≤0.05) to growth rate, as we conjectured that these points would
give the most accurate uptake bounds. Each dot represents the correlation coefficient (Spearman ρ) of growth rate vs.
SUMofEX (x-axis) or optimal biomass (y-axis) for a single variant of the medium uptake bounds. Dots show points for
which SUMofEX correlates significantly with growth rate, and red crosses show points for which SUMofEX does not
correlate significantly. The green lines have a slope of 1, so points to the right of the lines denote variants of the uptake
bounds for which SUMofEX correlated better than biomass yield with growth rate.
158
158
A.2.2.7 Gene Expression of pathways contributing to
SUMEX
We obtained gene expression data from [85], in which E. coli was grown on minimal
media supplemented with 6 different carbon sources. We excluded the acetate-
supplemented medium since the SEED‟s automatically-generated model is unable to
use it as carbon source, indicating an evident flaw in the model. Thus, we were left
with data concerning 5 different growth media.
For each medium, we quantified the beneficial or deleterious effect each gene had
with respect to the realization of each of the two hypothetical cellular objectives
(SUMEX vs. biomass yield). This was done by TOX, a variant on the EDGE method
recently developed in our lab (Wagner et al., manuscript accepted for publication at
PNAS).
We considered only genes that are non-essential for biomass production in the E. coli
reconstruction, since SUMEX requires, by definition, nominal biomass yield.
TOX is defined as follows: Given a non-essential gene g, a pre-defined medium m,
and a hypothetical cellular objective f (always biomass yield or SUMEX in this
chapter):
1. Calculate the maximal value of f following KO of gene g, in medium m.
Denote it fKO
.
KO of g is implemented by constraining all of its associated reactions to
carry a zero flux.
2. Calculate the maximal value of f, given that gene g is active, on medium m.
Denote it fUP
. Activation of g is implemented by constraining all of its
associated reactions to carry at least an ε flux (in absolute value), and at least
one of them to carry exactly an flux (in absolute value), maximizing f
under each of these sets of constraints, and then choosing the smallest f
obtained as fUP
. We used ε = 0.1.
3. Return ( , , ) :U P K O
T O X g m f f f .
159
159
( , , ) 0T O X g m f signifies that g contributes towards the realization of f on medium
m, whereas
( , , ) 0T O X g m f signifies that g has a deleterious effect on f under the same
conditions.
We next calculated a one-sided P-value for the Wilcoxon rank-sum test to determine
whether genes with positive TOX scores have significantly higher expression levels
than those with negative scores, which would serve as a confirmation that the a
priori objective f is predictive of actual gene expression. With SUMEX as the
hypothetical objective, significant scores were obtained in all media, save one that
obtained borderline significance (Table S3). With biomass yield as the objective
function, none of the media showed any significance.
We then verified that metabolic pathways predicted to be active by SUMEX are
indeed expressed. We say that “an objective f predicts gene g to be active on medium
m” if ( , , ) 0T O X g m f . Taking the set of highly-expressed genes as the set of active
genes on each medium, we find that the predictions of SUMEX outperform those of
biomass yield, both in terms of precision and of recall (Table S3).
Highly-expressed genes on a given medium are defined as ones whose expression
have a z-score greater or equal to 1
0 .8 on that medium. We verified that our
results were insensitive to the choice of ε and expression z-score threshold within
reasonable bounds.
160
160
Biomass SUMEX Biomass SUMEX Biomass SUMEX
Glucose 0.71 0.01 0.00 0.17 0.00 0.59
Glycerol 0.72 0.06 0.08 0.20 0.01 0.63
Succinate 0.84 0.05 0.13 0.20 0.03 0.63
L-Alanine 0.59 0.02 0.08 0.22 0.01 0.68
L-Proline 0.30 0.04 0.08 0.19 0.01 0.62
Average: 0.63 0.04 0.08 0.19 0.01 0.63
recall:precision:p-val:C-source:
Appendix 2::Table S3: Association of global gene expression with SUMEX. Significance of correlation between (first)
the medium-dependent contribution of each gene to the realization of each of the two hypothetical cellular objectives
(biomass yield or SUMEX) and (second) measured expression of the gene on minimal media supplemented with one of
the specified carbon sources. The leftmost table specifies the p-values obtained for a 1-sided rank-sum test for the
alternative that non-essential genes that are beneficial with respect to the cellular objective (either SUMEX or biomass
yield) have higher expression than those deleterious towards that objective. The two other tables report the precision
and recall values when predicting highly-active genes (whose expression has a score of at least 1
0 .8
by picking genes
that were beneficial towards either SUMEX or biomass yield. Method from Wagner et al., personal communication.
A.2.2.8 Correlation of SUMEX and Biomass:
We calculated the correlation of SUMEX and Biomass (2-sided Spearman test), and
found that they correlate highly significantly on ds66, weakly on ds57, and
insignificantly on ds18 (see Fig. S5).
-1000
1000
3000
5000
7000
0 100 200 300
SUM
EX
Biomass
ds66, SUMEX vs. Biomass
300
500
700
900
2 4 6
SUM
EX
Biomass
ds57, SUMEX vs. Biomass
0
500
1000
1500
2000
0 30 60 90 120
SUM
EX
Biomass
ds18, SUMEX vs. Biomass
ρ = 0.90P <1E-7
ρ = 0.41P = 1.2E-3
ρ = 0.36P = 0.14
Spearmancorr:
a. b. c.
Appendix 2::Figure S5: Correlation of Biomass with SUMEX. Plots of SUMEX versus Biomass are presented for (a)
ds66, (b) ds57, and (c) ds18, along with Spearman tests.
161
161
A.2.3 Supplementary Methods:
A.2.3.1 Models
Unless otherwise noted, analyses were done on genome-scale metabolic
reconstructions (GEMs) as obtained from SEED [25], at http://seed-
viewer.theseed.org/. The 66 organisms in ds66 were chosen because (1) their GEMs
were available from SEED and published in [25], and (2) their optimal doubling
times were available from [78]. For analysis of ds24, the iAF1260 E. coli model was
used, and the NCI60 cancer cell line analysis used custom-made models based on the
generic human model (described below). Table S4 lists the names of the ds66
models and organisms.
A.2.3.2 General methods
Linear Programming (LP) and Quadratic Programming (QP) calculations were done
using IBM Cplex software on an Intel based machine running Linux. The Spearman
correlation calculations and other analyses were done using either Matlab software or
Java.
Optimizations were run in in silico environments consistent with the known media,
where all exchange metabolites for a given species were available at a fixed rate of -
50.0. In the case of ds66, the environment was „rich‟, so we allowed uptake flux in
all exchange reactions for all organisms. Other constraints are described in the
following section.
By convention, exchange fluxes denoting entrance of a metabolite into the cell
(uptake) are negative valued, while exchanges denoting exit of a metabolite from the
cell (output / secretion) are positive valued. Therefore, maximizing the total
exchange flux (i.e. the SUMEX metric) would denote maximizing the output at the
expense of the input (output exchanges – input exchanges).
162
162
A.2.3.3 Reactions constraints and optimal environment
setting
Unless stated differently we used the following constraints on the reactions fluxes,
and in the definition of rich media:
For irreversible reactions:
Exchange reactions:
0 ≤ Vi,ex ≤ Vi, Max_ex (Vi, Max_ex = 1000)
Non exchange reactions:
0 ≤ Vi ≤ Vi, Max ( Vi, Max = 1000)
For reversible reactions:
Exchange reactions:
Vi, Min_ex ≤ Vi,ex ≤ Vi, Max_ex (Vi, Min_ex = -50 Vi, Max_ex = 1000)
Non exchange reactions:
Vi,Min ≤ Vi ≤ Vi, Max (Vi,Min = -1000 Vi,Max = 1000)
A.2.3.4 Building NCI60 cancer cell models
Our method to reconstruct the NCI60 cancer cell lines (based on the yet unpublished
methods in Yizhak et al, personal communication) required several key inputs: (a)
the generic human model [20], (b) gene expression data for each cancer cell line
from [87], and (c) growth rate measurements (Note: the growth rates were used only
to determine which genes should be used in constraining the models, in order to
163
163
obtain models that were as physiologically relevant as possible; they were not used to
determine the weight on the bounds, etc.). The algorithm then reconstructs a specific
metabolic model for each cell line by modifying the upper bounds of reactions in
accordance with the expression of the individual gene microarray values.
Specifically, the model reconstruction process is as follows:
(1) Decompose reversible reactions into unidirectional forward and backward
reactions.
(2) Evaluate the correlation between the expression of each reaction in the
network and the measured growth rate. The expression of a reaction is
defined as the mean over the expression of the enzymes catalyzing it.
(3) Modify upper bounds on reactions demonstrating significant correlation
to the growth rate (after correcting for multiple hypothesis using FDR) in
a manner that is linearly related to expression value.
We were able to produce feasible models for 60 cell lines using this procedure, and
these 60 were used for the analyses presented in the chapter. The NCI60 models thus
described were optimized for either biomass yield or SUMEX to obtain the results in
Fig. 3.
A.2.3.5 Computation of metrics
Following is an explanation of the exact way we calculated each of the metrics listed
in Fig. 1A:
Sum of exchange fluxes (SUMEX):
The sum of exchange fluxes (SUMEX) follows this procedure:
1. In addition to standard uptake constraints (see previous sections), we set a
lower limit on biomass yield at 5% of its maximum, as determined by FBA
on the given medium.
2. We search within this space for the max achievable exchange flux (secretion
– uptake; calculated as the sum of exchange fluxes). SUMEX is the optimal
value.
164
164
This can be represented mathematically as:
1
, m in m a x
m in
:
0
m a x
j j
n
e x c h a n g e
i
j
b io m a s s b io m a s s
j
V
S u b je c t to
V V
v V
S V
v v v
Where S is the stoichiometric matrix of metabolites and V is the vector of reactions
that together define the metabolic model. SV = 0 defines the steady state of the
metabolic model, and the limits on Vi are as defined in the reactions constraints
section. Vbiomass is the flux through the biomass reaction, and Vmaxbiomass is the
maximal achievable biomass yield, as determined through maximization of the
biomass objective function (see next section). Because all exchange fluxes by
definition point outwards (i.e., positive flux denotes secretion), the sum of exchanges
intrinsically minimizes metabolic uptake and maximizes metabolic secretion in a
single optimization. In practice we exclude Vbiomass from the Vexchange vector when
calculating biomass, but adding Vbiomass back in has no significant effect on the
solution (since Vbiomass is typically very small in the SUMEX solution). The process
is illustrated schematically in Fig. S6.
165
165
Catabolism + H+ gradient
Anabolism
2-1=11-2=-1
Sum
of
exc
ha
nge
(S
UM
EX
):
2
12
1
low high
Gro
wth
rate
:
Appendix 2::Figure S6: Schematic of SUMEX. The summing of molar fluxes through exchange reactions, i.e., the
quantity maximized in SUMEX, is illustrated. A high SUMEX value is achieved by high output fluxes and low input
fluxes. This is achieved mathematically through a single optimization, due to the sign convention that all exchange
fluxes by default point outwards.
Maximal biomass objective function
This is the standard method for determining maximal biomass yield in a given
environment using GEMs. We have taken the biomass function defined by the
automatic metabolic models generator [25] and we calculated its value when each of
the organisms was grown in its given media.
The objective function solved was:
, m in m a x
:
0
m a x
j j
b io m a ss
j
j
S u b je c t to
v V
V
S V
v v v
Where S is the stoichiometric matrix of metabolites and V is the vector of reactions
that together define the metabolic model. SV = 0 defines the steady state of the
metabolic model, and the limits on Vi are as defined in the reactions constraints
section. This metric has been described extensively elsewhere (e.g., [5, 149]).
Codon usage bias:
166
166
This metric was described in [78]. Codon usage biases for the 66 organisms of
interest for our study were kindly provided by Vieira-Silva and Rocha.
Uptake exchange reactions count
This topological metric provides a simple sum of the number of uptake exchange
reactions in a model (i.e., exchange reactions through which flux can enter the
organism).
All exchange reactions count
This topological metric provides a simple sum of the total number of exchange
reactions of the organism.
Maximize biomass with all critical uptake metabolites limited
This metric assesses the maximal biomass achievable under a limited uptake
environment. For this analysis, all critical uptake reactions had their fluxes limited to
-10.0 (negative indicating entrance into the cell, by standard convention). „Critical‟
uptake reactions are those whose metabolites are fully consumed when the organism
is grown in an optimal environment. Other than the change in constraints, the
maximization was identical to the maximization of biomass metric.
Minimize molar carbon consumption per biomass unit
This metric is predicated on the hypothesis that evolution has driven selection for the
most efficient usage of carbon in production of biomass. It is based on a metric from
[77], except instead of „glucose‟ we minimize molar carbon uptake, as our models
are grown in complex media.
We calculated this objective function in 2 steps:
Step 1)
Calculate the maximum biomass of the organisms when grown in a given media (see
biomass objective description for details).
167
167
Step 2)
Calculate the maximum of the sum of exchange reactions that contain carbon and
that are able to carry flux while fixating the maximum biomass flux value.
**Note: Because of the sign conventions on fluxes, when maximizing the flux of
uptake exchange reaction we are actually minimizing the uptake of the specific
exchange metabolite represented by this reaction, as uptake fluxes have a negative
value in our models.
The Linear program solved by the second step is:
, m in m a x
:
0
m a x
m a x
j j
i
i s c
b io m a ss b io m a ss
j
j
V
V V
S u b je c t to
V V
v V
S V
v v v
Where Vsc is the group of uptake exchange reactions that are able to carry flux and
that contain carbon in their exchange metabolite.
Reactions count
Here we took the total count of reactions in the model, with the idea that a larger
metabolism might correlate with a faster growth rate.
Maximize sum of network flux
Here we determined the sum of fluxes in the network, as an indicator of the general
activity level of the metabolic network. We computed as follows:
168
168
, m in m a x
:
0
m a x
j j
j
j
j
j
S u b je c t to
v V
V
S V
v v v
Maximal biomass per squared flux unit
This method assesses the ability of a GEM to produce biomass while minimizing
enzyme usage, as measured through the following formula:
2
b io m a ss
i
VM a x
V.
We calculated this objective function in 2 steps:
Step 1)
Maximum biomass was calculated in an optimal environment.
Step 2)
Fixing biomass to its maximum value, we minimized the squared sum of all fluxes of
the organism:
2
, m in m a x
:
0
m a x
m in
j j
i
i a ll r e a c tio n
b io m a ss b io m a ss
j
j
V
V
S u b je c t to
V V
v V
S V
v v v
169
169
Maximize biomass under limited phosphate molar uptake
For this metric and a number of others, we assessed maximal biomass under limited
nutrient uptake conditions. In this metric, we limited phosphate uptake. Specifically,
we solved the following optimization problem:
1 0 .0 m a x
, m in m a x
_ _ _ _
:
0
* _ ( )
m a x
p p j
j j
b io m a ss
V
j
j
p se e d e x c h a n g e re a c tio n s c o n ta in in g p h o sp h a t e
S u b je c t to
v P h o s p h a te c o u n t
v V
v V
V
S V
v
v v v
Where Phosphate_count (Vp) is the molar count of phosphate in the uptake exchange
reactions that are able to carry flux.
We limited the total molar amount of phosphate to a value of -10.0, as we observed
that providing higher levels did not limit growth, while reducing the limit was too
limiting for some organisms, leading to minimal growth and reduced correlation.
Maximize biomass under limited nitrogen molar uptake
This metric is the same as the phosphate limitation metric, except nitrogen is limited
instead. We limited the total molar amount of nitrogen to a value of -100.0 for the
rationale described for the phosphate limitation metric.
Maximize biomass under limited carbon molar uptake
This metric is the same as the phosphate limitation metric, except carbon is limited
instead. We limited the total molar amount of carbon to a value of -1000.0 for the
rationale described for the phosphate limitation metric.
Maximize ATP maintenance (i.e., hydrolysis) reaction
170
170
This metric assesses the maximal molar amount of ATP that can be charged from
ADP in the cell, given a set of inputs as media. ATP production, which is a measure
of efficiency of energy production, is often considered as an alternative metric to
biomass in genome-scale models. Production of more energy from a fixed set of
cellular uptakes would thus logically be associated with stronger or faster growth.
The rationale behind this metric is that evolution drives maximal energetic efficiency.
As none of the models contained an ATP maintenance (i.e., hydrolysis) reaction we
added that reaction:
ATP + H2O -> ADP + H + Phosphate.
The linear problem computed is to maximize this ATP hydrolysis (also called „ATP
maintenance‟) reaction:
, m in m a x
:
0
m a x
j j
a tp
j
j
S u b je c t to
v V
V
S V
v v v
(Where Vatp is the ATP maintenance reaction).
Maximal ATP maintenance (i.e., hydrolysis) per squared flux unit
This method is based on a hypothesis that cells operate to maximize ATP
maintenance yield (ie, the total amount of ATP that can be charged in a given
environment) while minimizing enzyme usage. The total metric can be stated as
follows:
2
A T P
i
VM a x
V
We calculated this objective function in 2 steps:
171
171
Step 1)
Calculate maximum ATP maintenance flux as described under „Maximize ATP
maintenance reaction.‟
Step 2)
Calculate the minimum square sum of all fluxes of the organism when we fixate the
ATP maintenance of the organism, using the following optimization:
2
, m in m a x
:
0
m in
j j
i
i a l l r e a c tio n
j
j
V
V
S u b je c t to
v V
S V
v v v
where Vall reaction is the set of all the reactions in the metabolic model of the organism.
A.2.3.6 Growth experiments of 6 organisms on 3 defined
IMM media (ds18)
To validate SUMEX, we performed in vitro experiments to measure the growth rates
of a number of organisms (listed in Table S5) in multiple environments. Growth
experiments were conducted in 96-well plates at 30°C, with continuous shaking,
using a Biotek ELX808IU-PC microplate reader. Optical density was measured
every 15 minutes at a wavelength of 595nm. Growth rates were determined during
early to mid exponential growth phase by taking the slope of a linear fit through the
natural log of the data.
Using models taken from SEED [25], we calculated various growth metrics (see Fig.
1A) in in silico environments mirroring the environments from the in vitro
experiments. Table S6 contains the environment [147] used in vitro (and in silico)
and the changes done to it in the different experiments.
172
172
A.2.4 Tables
Appendix 2::Table S4: Description of ds66. Description of the 66 organisms that were used in the article, including
categorization into respirers and obligate fermenters (and the sources used to determine those categories). Biomass and
doubling times are for growth in an optimal environment. (The doubling times are from [78]).
The table will be provided apon request to mail: [email protected].
Organism: Medium:
Growth
Rate:
Agrobacterium tumefaciens str. c58 IMM 0.09
Bacillus subtilis subsp. subtilis str. 168_4 IMM 0.32
Escherichia coli W3110 IMM 0.17
Listeria innocua Clip11262 IMM 0.09
Pseudomonas aeruginosa PAO1 IMM 0.48
Serratia marcescens IMM 0.45
Agrobacterium tumefaciens str. c58 IMM-gt 0.05
Bacillus subtilis subsp. subtilis str. 168_4 IMM-gt 0.13
Escherichia coli W3110 IMM-gt 0.04
Listeria innocua Clip11262 IMM-gt 0.00
Pseudomonas aeruginosa PAO1 IMM-gt 0.58
Serratia marcescens IMM-gt 0.25
Agrobacterium tumefaciens str. c58 IMMxt 0.23
Bacillus subtilis subsp. subtilis str. 168_4 IMMxt 0.32
Escherichia coli W3110 IMMxt 0.15
Listeria innocua Clip11262 IMMxt 0.21
Pseudomonas aeruginosa PAO1 IMMxt 0.36
Serratia marcescens IMMxt 0.40
Appendix 2::Table S5: in vitro growth experiments (i.e.,ds 18). This table provides a list of in vitro growth experiments performed in our lab for validation of SUMEX. The table lists the species and the environments used. Simulations for Serratia
marcescens were done using an in silico model of S. odorifera 4Rx14.796
173
173
Metabolite In vitro medium In silico medium
Thiamin + +
D-Methionine + +
Magnesium + +
L-Valine + +
L-Isoleucine + +
L-Leucine + +
L-Histidine + +
Calcium + +
D-Glucose-6-phosphate + +
Potassium + +
Citrate + +
L-Arginine + +
L-Tryptophan + +
L-Phenylalanine + +
Biotin + +
Riboflavin + +
Adenine + +
Pyridoxal + +
Nicotinamide_D-ribonucleotide + +
L-Glutamine + +
L-Cysteine + +
Lipoic acid + -
para-aminobenzoic acid + -
Oxygen + +
Cytosine - +
Zinc - +
Cobalt - +
Fe2+ - +
Chloride - +
Sulfate - +
Copper2 - +
Manganese - +
Spermidine - +
gly-asn-L - +
sn-Glycerol-3-phosphate - +
Octadecanoate - +
Additions done for the enlarged IMM environment (IMMxt):
174
174
Xylose C5H10O5
Deoxythymidine C10H14N2O5
Removals done for the reduced IMM environment (IMM-gt):
Thiamin C12H17N4OS
D-Glucose_6-phosphate C6H12O9P
Appendix 2::Table S6: IMM defined medium.
This table provides the IMM defined medium [147] and its in silico representation. IMM was also modified to generate two alternate media.
175
175
Appendix 3. Supplementary data for Chapter 4
A.3.1 Figures
176
176
177
177
Appendix 3::Supplementary Figure 1: KEGG Glycans. a-e: Examples of different types of glycans found in the KEGG
Glycans database. (a) A textual representation of the KCF graph for glycan G00010. (b) A visual representation of the
KCF graph for glycan G00010. (c) A regular glycan. (d) A Linear repeating glycan. (e) A non linear repeating glycan. (f)
Reconstruction of GlyDe reactions. The GlyDe algorithm receives a glycan graph structure and an EC number as input
and generates the appropriate glycan products, while considering the following rules: Glycosidic linkages hydrolyzed,
Endo vs. exo acting enzyme, Degree of polymerization preference, Reducing vs. non-reducing end preference, Contained
sub-glycan, Glycan Released. In the example depicted in the figure, these fields had the following values respectively:
Glc a1-2 Glc, exo, 10+, non-reducing, unknown, TAU00015 (the glycan ID of glucose).
178
178
c.
179
179
Appendix 3::Supplementary Figure 2: Glycan Degradation of the gut microbiota reference genomes. (a) A cross-
validation process was performed to see that GlyDe reaction products are enriched with existing glycans rather than
hypothetical ones. The Venn diagram depicts a significant overlap between the products created by GlyDe (green) and
the original glycans in the KEGG Glycan database (purple). (b) Principle Coordinates Analysis (PCoA) of the glycan
degradation profiles of the species colored according to their respected phyla. (c) The bar chart depicts the median
GlyDe score of the species belonging to each genus and colored according to their respected phyla. (d) A heatmap
denoting the average relative glycan degradation efficiency of the different bacterial genera in the study. Each entry was
calculated based on the average sum of GlyDe scores per genus for a specific degree of polymerization (DP) category
(e.g. Disaccharides) and normalized by the overall sum of GlyDe scores of all the genera for the same DP category.
180
180
-200 -100 0 100 200 300 400
PCo1 = 46.07%
Carnivores
Herbivores
Omnivores
a.
b.
181
181
Appendix 3::Supplementary Figure 3: The connection between glycan degradation and diet. (a) The Muegge et. al.
dataset. Principal Coordinate Analysis of the GlyDe profiles of all the samples. The first principal coordinate shows a
gradient is formed starting from Herbivores (red) to Omnivores (green) and Carnivores (blue). (b & c) The Yatsunenko
et. al. dataset. (b) The box plots represents the variation in GlyDe profiles of the samples over the first principal
coordinate and separated into bins according to the age of the host. (c) Box plots showing the differences between the
Animal to Plants-specific GlyDe score ratios of adults in Malawi, Venezuela and United States of America.
A.3.2 Tables
Due to their size, the supplementary tables will be supplied appon request to mail:
Appendix 3::Supplementary Table 1: CAZymes degredation rules. A description of the manually curated GlyDe rules
that describe each CAZymes (EC number) by the following fields: Enzyme Name, KEGG Reactions, Glycosidic
Linkages Hydrolyzed, Contained Sub-glycan, Glycan Released, Endo vs. Exo, DP preference, Terminal Side Preference,
Enzymatic Reaction, and Comments.
Appendix 3::Supplementary Table 2: Comarison between KEGG and non KEGG Glyde Scores. A comparison of all
GlyDe scores (column 3) to whether they are degraded in KEGG or not (column 4). The first column indicates the SEED
ID of the HMP taxon and the second column indicates the Glycan ID in KEGG.
Appendix 3::Supplementary Table 3: The GlyDe outputs for all the HMP taxa. The GlyDe outputs for all the HMP taxa
including their NCBI Taxon IDs, scientific names, number of unique CAZymes, number of total CAZymes, GlyDe
scores, number of glycans degraded according to KEGG, number of glycans degraded according to GlyDe, the Animal-,
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
United States of America Venezuela Malawi
An
imal
/Pla
nts
Gly
De
Rat
ioc.
182
182
Plant- and Bacteria-specific GlyDe scores, and GlyDe scores for Disaccharides, Oligosaccharides, Short Polysaccharides
and Long Polysaccharides.
Appendix 3::Supplementary Table 4: The GlyDe scores of 8 Human Milk Oligosaccharides (HMOs) available in KEGG. MSMFLNnH (G02805), LNFP-I (G00535), LNFP-II (G00623), LNFP-III (G00557), MFLNH-I (G02935), MFLNH-III
(G00680), MFpLNH-IV (G01926), IFLNH-I (G12225). KEGG glycans were manually classified as HMOs based on
structural similarity to glycans found in Zivkovic et al. [150].
Appendix 3::Supplementary Table 5: The CAZyme table. The values in the table represent the efficiency in which
different CAZymes (rows) break different glycans (columns). A detailed explanation is given in Figure 1b and the
methods section.
Appendix 3::Supplementary Table 6: A detailed account of the glycans used throughout this analysis. The table
includes: Glycan ID, Name, Composition, Degree of Polymerization, Class and Biological Origin.
Appendix 3::Supplementary Table 7: The full list of monosaccharides and other basic chemical entities used as nodes in
the graphs representing glycan structures in KEGG and incorporated into our system. Glycans which contain unknown
nodes were filtered out. Other columns indicate the internal ID used by our platform, the KEGG Compound number
and the SEED cpd number.
Appendix 3::Supplementary Table 8: An OTU table representing HMP taxa. (columns) found in HMP samples (rows)
according to 16S rRNA sequence similarity and reconstructed using QIIME (Methods).
Appendix 3::Supplementary Table 9: The HMP bacterial reference genomes glycan degradation (GlyDe) matrix. Each
entry in the matrix represents the GlyDe score of a glycan (column) by a bacterial taxon (row). The IDs of the bacterial
taxa are taken from SEED [112].
Appendix 3::Supplementary Table 10: The Muegge et. al. samples CAZymes matrix. Each entry in the matrix represents
the abundance of a CAZyme (column) by a sample (row).
Appendix 3::Supplementary Table 11: The Yatsunenko et. al. samples CAZymes matrix. Each entry in the matrix
represents the abundance of a CAZyme (column) by a sample (row).
Appendix 3::Supplementary Table 12: The Muegge et. al. samples GlyDe matrix. Each entry in the matrix represents
the GlyDe score of a glycan (column) by a sample (row).
Appendix 3::Supplementary Table 13: The Yatsunenko et. al. samples GlyDe matrix. Each entry in the matrix
represents the GlyDe score of a glycan (column) by a sample (row).
Supplementary Table 14: The Muegge et. al. samples GlyDe output report. The report includes the sample ID, number
of unique CAZymes, the number of total CAZymes, the Total GlyDe scores, the number of glycans degraded according
to KEGG, the number of glycans degraded according to GlyDe, the Plant-, Animal- and Bacteria-specific GlyDe scores,
and GlyDe scores for Disaccharides, Oligosaccharides, and short and long Polysaccharides.
Appendix 3::Supplementary Table 15: The Yatsunenko et. al. samples GlyDe output report. The fields are similar to
Supplementary Table 14.
Appendix 3::Supplementary Table 16: Host diet predictions for 18 human samples taken from Muegge et. al. An SVM
classifier was trained based on the GlyDe profiles of herbivore and carnivore animals and used to classify the human
samples.
Appendix 3::Supplementary Table 17: Taxa with highly predictable abundance. The table shows the standard error in
predicted abundance of the HMP bacterial taxa in the two abundance-derived clusters.
183
183
Bebliography
1. Kell, D.B., Systems biology, metabolic modelling and metabolomics in drug
discovery and development. Drug Discov Today, 2006. 11(23-24): p. 1085-
92.
2. Hucka, M., et al., The systems biology markup language (SBML): a medium
for representation and exchange of biochemical network models.
Bioinformatics, 2003. 19(4): p. 524-31.
3. Erdi, P.t. and J . Toth, Mathematical models of chemical reactions : theory
and applications of deterministic and stochastic models. 1989, Princeton,
N.J.: Princeton University Press. xxiv, 259 p.
4. Price, N.D., J.L. Reed, and B.O. Palsson, Genome-scale models of microbial
cells: evaluating the consequences of constraints. Nat Rev Microbiol, 2004.
2(11): p. 886-97.
5. Orth, J.D., I. Thiele, and B.O. Palsson, What is flux balance analysis? Nat
Biotechnol, 2010. 28(3): p. 245-8.
6. Kauffman, K.J., P. Prakash, and J.S. Edwards, Advances in flux balance
analysis. Curr Opin Biotechnol, 2003. 14(5): p. 491-6.
7. Mahadevan, R. and C.H. Schilling, The effects of alternate optimal solutions
in constraint-based genome-scale metabolic models. Metab Eng, 2003. 5(4):
p. 264-76.
8. Almaas, E., Z.N. Oltvai, and A.L. Barabasi, The activity reaction core and
plasticity of metabolic networks. PLoS Comput Biol, 2005. 1(7): p. e68.
9. Burgard, A.P., et al., Flux coupling analysis of genome-scale metabolic
network reconstructions. Genome Res, 2004. 14(2): p. 301-12.
10. Reed, J.L. and B.O. Palsson, Genome-scale in silico models of E. coli have
multiple equivalent phenotypic states: assessment of correlated reaction
subsets that comprise network states. Genome Res, 2004. 14(9): p. 1797-805.
11. Segre, D., D. Vitkup, and G.M. Church, Analysis of optimality in natural and
perturbed metabolic networks. Proc Natl Acad Sci U S A, 2002. 99(23): p.
15112-7.
184
184
12. Shlomi, T., O. Berkman, and E. Ruppin, Regulatory on/off minimization of
metabolic flux changes after genetic perturbations. Proc Natl Acad Sci U S
A, 2005. 102(21): p. 7695-700.
13. Ashburner, M., et al., Gene ontology: tool for the unification of biology. The
Gene Ontology Consortium. Nat Genet, 2000. 25(1): p. 25-9.
14. Kanehisa, M., et al., The KEGG resource for deciphering the genome.
Nucleic Acids Res, 2004. 32(Database issue): p. D277-80.
15. Karp, P.D., et al., The MetaCyc Database. Nucleic Acids Res, 2002. 30(1): p.
59-61.
16. Aziz, R.K., et al., SEED servers: high-performance access to the SEED
genomes, annotations, and metabolic models. PLoS One, 2012. 7(10): p.
e48053.
17. Thiele, I. and B.O. Palsson, A protocol for generating a high-quality genome-
scale metabolic reconstruction. Nat Protoc, 2010. 5(1): p. 93-121.
18. Feist, A.M., et al., A genome-scale metabolic reconstruction for Escherichia
coli K-12 MG1655 that accounts for 1260 ORFs and thermodynamic
information. Mol Syst Biol, 2007. 3: p. 121.
19. Forster, J., et al., Genome-scale reconstruction of the Saccharomyces
cerevisiae metabolic network. Genome Res, 2003. 13(2): p. 244-53.
20. Duarte, N.C., et al., Global reconstruction of the human metabolic network
based on genomic and bibliomic data. Proc Natl Acad Sci U S A, 2007.
104(6): p. 1777-82.
21. McCloskey, D., B.O. Palsson, and A.M. Feist, Basic and applied uses of
genome-scale metabolic network reconstructions of Escherichia coli. Mol
Syst Biol, 2013. 9: p. 661.
22. Shlomi, T., M.N. Cabili, and E. Ruppin, Predicting metabolic biomarkers of
human inborn errors of metabolism. Mol Syst Biol, 2009. 5: p. 263.
23. Jerby, L., T. Shlomi, and E. Ruppin, Computational reconstruction of tissue-
specific metabolic models: application to human liver metabolism. Mol Syst
Biol, 2010. 6: p. 401.
24. Zhuang, K., et al., The design of long-term effective uranium bioremediation
strategy using a community metabolic model. Biotechnol Bioeng, 2012.
109(10): p. 2475-83.
25. Henry, C.S., et al., High-throughput generation, optimization and analysis of
genome-scale metabolic models. Nat Biotechnol, 2010. 28(9): p. 977-82.
185
185
26. Maglott, D., et al., Entrez Gene: gene-centered information at NCBI. Nucleic
Acids Res, 2011. 39(Database issue): p. D52-7.
27. Stolyar, S., et al., Metabolic modeling of a mutualistic microbial community.
Mol Syst Biol, 2007. 3: p. 92.
28. Wintermute, E.H. and P.A. Silver, Emergent cooperation in microbial
metabolism. Mol Syst Biol, 2010. 6: p. 407.
29. Monod, J., The growth of bacterial cultures. Annu Rev Microbiol, 1949. 3: p.
371-394.
30. Chao, L., B.R. Levin, and F.M. Stewart, Complex Community in a Simple
Habitat - Experimental-Study with Bacteria and Phage. Ecology, 1977.
58(2): p. 369-378.
31. Whiteley, M., et al., Effects of community composition and growth rate on
aquifer biofilm bacteria and their susceptibility to betadine disinfection.
Environ Microbiol, 2001. 3(1): p. 43-52.
32. Maynard Smith, J., Optimization Theory in Evolution. Annual Review of
Ecology and Systematics, 1978. 9: p. 31-56.
33. Yim, H., et al., Metabolic engineering of Escherichia coli for direct
production of 1,4-butanediol. Nat Chem Biol, 2011. 7(7): p. 445-52.
34. Fong, S.S. and B.O. Palsson, Metabolic gene-deletion strains of Escherichia
coli evolve to computationally predicted growth phenotypes. Nat Genet,
2004. 36(10): p. 1056-8.
35. Adadi, R., et al., Prediction of microbial growth rate versus biomass yield by
a metabolic network with kinetic parameters. PLoS Comput Biol, 2012. 8(7):
p. e1002575.
36. Freilich, S., et al., Competitive and cooperative metabolic interactions in
bacterial communities. Nat Commun, 2011. 2: p. 589.
37. Gause, G.F., experimental studies on the struggle for existance. Journal of
Experimental Biology, 1932. 9: p. 389-402.
38. Darwin, C., On the origin of species by means of natural selection, or, The
preservation of favoured races in the struggle for life. Dover giant thrift ed.
2006, Mineola, NY: Dover Publications.
39. Boer, P.J.d., The present status of the competitive exclusion principle. Trends
in Ecology & Evolution, 1986. 1(1): p. 25-28.
186
186
40. Marx, C.J., Microbiology. Getting in touch with your friends. Science, 2009.
324(5931): p. 1150-1.
41. Fuhrman, J.A., Microbial community structure and its functional
implications. Nature, 2009. 459(7244): p. 193-9.
42. Lotem, A., M.A. Fishman, and L. Stone, Evolution of cooperation between
individuals. Nature, 1999. 400(6741): p. 226-7.
43. Hibbing, M.E., et al., Bacterial competition: surviving and thriving in the
microbial jungle. Nat Rev Microbiol, 2010. 8(1): p. 15-25.
44. Gross, K., Positive interactions among competitors can produce species-rich
communities. Ecol Lett, 2008. 11(9): p. 929-36.
45. Diamond J, M. and E. Gilpin M, Examination of the “null” model of connor
and simberloff for species co-occurrences on Islands Oecologia 1982. 52(1):
p. 64-74.
46. Connor, E.F. and S. D., The assembly of species communities: Chance or
competition? Ecology, 1979. 60: p. 1132-1140.
47. Klitgord, N. and D. Segre, Environments that induce synthetic microbial
ecosystems. PLoS Comput Biol, 2010. 6(11): p. e1001002.
48. Ebenhoh, O. and T. Handorf, Functional classification of genome-scale
metabolic networks. EURASIP J Bioinform Syst Biol, 2009: p. 570456.
49. Freilich, S., et al., Decoupling Environment-Dependent and Independent
Genetic Robustness across Bacterial Species. PLoS Comput Biol, 2010. 6(2):
p. e1000690.
50. Freilich, S., et al., Metabolic-network-driven analysis of bacterial ecological
strategies. Genome Biol, 2009. 10(6): p. R61.
51. Gerhardson, B., Biological substitutes for pesticides. Trends Biotechnol,
2002. 20(8): p. 338-43.
52. Mani, R., et al., Defining genetic interaction. Proc Natl Acad Sci U S A,
2008. 105(9): p. 3461-6.
53. Klitgord, N. and D. Segre, Ecosystems biology of microbial metabolism. Curr
Opin Biotechnol, 2011: p. 541-546.
54. MacArthur, R., Species packing and competitive equilibrium for many
species. Theor Popul Biol, 1970. 1(1): p. 1-11.
187
187
55. Tilman, D., Resource competition between planktonic algae: experimental
and theoretical approach. Ecology, 1977. 58: p. 338–348.
56. Cherif, M. and M. Loreau, Stoichiometric constraints on resource use,
competitive interactions, and elemental cycling in microbial decomposers.
Am Nat, 2007. 169(6): p. 709-24.
57. Orr, H.A., Fitness and its role in evolutionary genetics. Nat Rev Genet, 2009.
10(8): p. 531-9.
58. Elevi Bardavid, R. and A. Oren, Dihydroxyacetone metabolism in
Salinibacter ruber and in Haloquadratum walsbyi. Extremophiles, 2008.
12(1): p. 125-31.
59. Mowery, D.C., J.E. Oxley, and S.B. S., Technological overlap and interfirm
cooperation: implications for the resource-based view of the firm. Science,
1998. 27(5): p. 507-523.
60. Schink, B., Synergistic interactions in the microbial world. Antonie Van
Leeuwenhoek, 2002. 81(1-4): p. 257-61.
61. Wintermute, E.H. and P.A. Silver, Dynamics in the mixed microbial
concourse. Genes Dev, 2010. 24(23): p. 2603-14.
62. Labrenz, M. and J.F. Banfield, Sulfate-reducing bacteria-dominated biofilms
that precipitate ZnS in a subsurface circumneutral-pH mine drainage system.
Microb Ecol, 2004. 47(3): p. 205-17.
63. Kato, S., et al., Stable coexistence of five bacterial strains as a cellulose-
degrading community. Appl Environ Microbiol, 2005. 71(11): p. 7099-106.
64. Chaffron, S., et al., A global network of coexisting microbes from
environmental and whole-genome sequence data. Genome Res, 2010.
65. Bell, T., et al., The contribution of species richness and composition to
bacterial services. Nature, 2005. 436(7054): p. 1157-60.
66. Gomes, N.C., et al., Effects of the inoculant strain Pseudomonas putida
KT2442 (pNF142) and of naphthalene contamination on the soil bacterial
community. FEMS Microbiol Ecol, 2005. 54(1): p. 21-33.
67. Hansen, S.K., et al., Evolution of species interactions in a biofilm community.
Nature, 2007. 445(7127): p. 533-6.
68. Wandersman, C. and P. Delepelaire, Bacterial iron sources: from
siderophores to hemophores. Annu Rev Microbiol, 2004. 58: p. 611-47.
188
188
69. Ley, R.E., et al., Microbial ecology: human gut microbes associated with
obesity. Nature, 2006. 444(7122): p. 1022-3.
70. Hirschman, L., et al., Habitat-Lite: a GSC case study based on free text terms
for environmental metadata. OMICS, 2008. 12(2): p. 129-36.
71. Ogata, H., et al., KEGG: Kyoto Encyclopedia of Genes and Genomes.
Nucleic Acids Res, 1999. 27(1): p. 29-34.
72. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res, 1997. 25(17): p.
3389-402.
73. Oberhardt, M.A., B.O. Palsson, and J.A. Papin, Applications of genome-scale
metabolic reconstructions. Mol Syst Biol, 2009. 5: p. 320.
74. Oh, Y.K., et al., Genome-scale reconstruction of metabolic network in
bacillus subtilis based on high-throughput phenotyping and gene essentiality
data. J Biol Chem, 2007.
75. Schuster, S., T. Pfeiffer, and D.A. Fell, Is maximization of molar yield in
metabolic networks favoured by evolution? J Theor Biol, 2008. 252(3): p.
497-504.
76. Knorr, A.L., R. Jain, and R. Srivastava, Bayesian-based selection of
metabolic objective functions. Bioinformatics, 2006.
77. Schuetz, R., L. Kuepfer, and U. Sauer, Systematic evaluation of objective
functions for predicting intracellular fluxes in Escherichia coli. Mol Syst
Biol, 2007. 3: p. 119.
78. Vieira-Silva, S. and E.P. Rocha, The systemic imprint of growth and its uses
in ecological (meta)genomics. PLoS Genet, 2010. 6(1): p. e1000808.
79. Groeneveld, P., A.H. Stouthamer, and H.V. Westerhoff, Super life--how and
why 'cell selection' leads to the fastest-growing eukaryote. FEBS J, 2009.
276(1): p. 254-70.
80. Schuetz, R., et al., Multidimensional optimality of microbial metabolism.
Science, 2012. 336(6081): p. 601-4.
81. Vazquez, A., et al., Impact of the solvent capacity constraint on E. coli
metabolism. BMC Syst Biol, 2008. 2: p. 7.
82. Soga, N., et al., Kinetic equivalence of transmembrane pH and electrical
potential differences in ATP synthesis. J Biol Chem, 2012. 287(12): p. 9633-
9.
189
189
83. Fischer, S. and P. Graber, Comparison of DeltapH- and Delta***φ***-
driven ATP synthesis catalyzed by the H(+)-ATPases from Escherichia coli
or chloroplasts reconstituted into liposomes. FEBS Lett, 1999. 457(3): p.
327-32.
84. Brooijmans, R., et al., Heme and menaquinone induced electron transport in
lactic acid bacteria. Microb Cell Fact, 2009. 8: p. 28.
85. Liu, M., et al., Global transcriptional programs reveal a carbon source
foraging strategy by Escherichia coli. J Biol Chem, 2005. 280(16): p. 15921-
7.
86. Discovery_Sciences.
http://dtp.nci.nih.gov/docs/misc/common_files/cell_list.html]. Available
from: http://dtp.nci.nih.gov/docs/misc/common_files/cell_list.html.
87. Lee, J.K., et al., A strategy for predicting the chemosensitivity of human
cancers and its application to drug discovery. Proc Natl Acad Sci U S A,
2007. 104(32): p. 13086-91.
88. Lozupone, C., et al., Diversity, stability and resilience of the human gut
microbiota. Nature, 2012. 489(7415): p. 220-230.
89. O'Keefe, S.J., Nutrition and colonic health: the critical role of the
microbiota. Curr Opin Gastroenterol, 2008. 24(1): p. 51-8.
90. Goodman, A. and J. Gordon, Our unindicted coconspirators: human
metabolism from a microbial perspective. Cell metabolism, 2010. 12(2): p.
111-116.
91. Holmes, E., et al., Gut microbiota composition and activity in relation to host
metabolic phenotype and disease risk. Cell metabolism, 2012. 16(5): p. 559-
564.
92. Tremaroli, V. and F. Bäckhed, Functional interactions between the gut
microbiota and host metabolism. Nature, 2012. 489(7415): p. 242-249.
93. Nicholson, J., et al., Host-gut microbiota metabolic interactions. Science
(New York, N.Y.), 2012. 336(6086): p. 1262-1267.
94. Clemente, J., et al., The impact of the gut microbiota on human health: an
integrative view. Cell, 2012. 148(6): p. 1258-1270.
95. Holmes, E., et al., Therapeutic modulation of microbiota-host metabolic
interactions. Science translational medicine, 2012. 4(137).
96. Dirk, G., et al., Bioinformatics for the Human Microbiome Project. PLoS
Computational Biology, 2012. 8.
190
190
97. Macfarlane, G. and S. Macfarlane, Models for intestinal fermentation:
association between food components, delivery systems, bioavailability and
functional interactions in the gut. Current opinion in biotechnology, 2007.
18(2): p. 156-162.
98. Van den Abbeele, P., et al., Microbial community development in a dynamic
gut model is reproducible, colon region specific, and selective for
Bacteroidetes and Clostridium cluster IX. Applied and environmental
microbiology, 2010. 76(15): p. 5237-5246.
99. Elia, M. and J. Cummings, Physiological aspects of energy metabolism and
gastrointestinal effects of carbohydrates. European journal of clinical
nutrition, 2007. 61 Suppl 1: p. 74.
100. Koropatkin, N., E. Cameron, and E. Martens, How glycan metabolism shapes
the human gut microbiota. Nature reviews. Microbiology, 2012. 10(5): p.
323-335.
101. Mahowald, M., et al., Characterizing a model human gut microbiota
composed of members of its two dominant bacterial phyla. Proceedings of the
National Academy of Sciences of the United States of America, 2009.
106(14): p. 5859-5864.
102. Cantarel, B., V. Lombard, and B. Henrissat, Complex carbohydrate
utilization by the healthy human microbiome. PloS one, 2012. 7(6).
103. Flint, H., et al., Interactions and competition within the microbial community
of the human colon: links between diet and health. Environmental
microbiology, 2007. 9(5): p. 1101-1111.
104. Barcenilla, A., et al., Phylogenetic relationships of butyrate-producing
bacteria from the human gut. Applied and environmental microbiology,
2000. 66(4): p. 1654-1661.
105. Macfarlane, G. and S. Macfarlane, Fermentation in the human large
intestine: its physiologic consequences and the potential contribution of
prebiotics. Journal of clinical gastroenterology, 2011. 45 Suppl: p. 7.
106. Willem, F.B., et al., Prebiotic and Other Health-Related Effects of Cereal-
Derived Arabinoxylans, Arabinoxylan-Oligosaccharides, and
Xylooligosaccharides. Critical Reviews in Food Science and Nutrition, 2011.
51.
107. Sonnenburg, J. and M. Fischbach, Community health care: therapeutic
opportunities in the human microbiome. Science translational medicine,
2011. 3(78).
191
191
108. Borenstein, E., Computational systems biology and in silico modeling of the
human microbiome. Briefings in bioinformatics, 2012. 13(6): p. 769-780.
109. Hashimoto, K., et al., KEGG as a glycome informatics resource.
Glycobiology, 2006. 16(5).
110. Altschul, S.F., et al., Basic local alignment search tool. J Mol Biol, 1990.
215(3): p. 403-10.
111. Cantarel, B., et al., The Carbohydrate-Active EnZymes database (CAZy): an
expert resource for Glycogenomics. Nucleic acids research, 2009.
37(Database issue): p. 8.
112. Aziz, R., et al., SEED servers: high-performance access to the SEED
genomes, annotations, and metabolic models. PloS one, 2012. 7(10).
113. Peterson, J., et al., The NIH Human Microbiome Project. Genome Research,
2009. 19.
114. Marcobal, A., et al., Bacteroides in the infant gut consume milk
oligosaccharides via mucus-utilization pathways. Cell host & microbe, 2011.
10(5): p. 507-514.
115. Sonnenburg, J.L., Glycan Foraging in Vivo by an Intestine-Adapted Bacterial
Symbiont. Science, 2005. 307.
116. Sonnenburg, E., et al., Specificity of polysaccharide use in intestinal
bacteroides species determines diet-induced microbiota alterations. Cell,
2010. 141(7): p. 1241-1252.
117. Marcobal, A. and J. Sonnenburg, Human milk oligosaccharide consumption
by intestinal microbiota. Clinical microbiology and infection : the official
publication of the European Society of Clinical Microbiology and Infectious
Diseases, 2012. 18 Suppl 4: p. 12-15.
118. Muegge, B.D., et al., Diet Drives Convergence in Gut Microbiome Functions
Across Mammalian Phylogeny and Within Humans. Science, 2011. 332.
119. Muegge, B., et al., Diet drives convergence in gut microbiome functions
across mammalian phylogeny and within humans. Science (New York, N.Y.),
2011. 332(6032): p. 970-974.
120. Ley, R.E., et al., Evolution of Mammals and Their Gut Microbes. Science,
2008. 320.
121. Yatsunenko, T., et al., Human gut microbiome viewed across age and
geography. Nature, 2012. 486(7402): p. 222-227.
192
192
122. Judith, R.K. and D.W. Gary, The gut microbiota, environment and diseases of
modern society. Gut Microbes, 2012. 3.
123. Wu, G., et al., Linking long-term dietary patterns with gut microbial
enterotypes. Science (New York, N.Y.), 2011. 334(6052): p. 105-108.
124. Filippo, C.D., et al., Impact of diet in shaping gut microbiota revealed by a
comparative study in children from Europe and rural Africa. Proceedings of
the National Academy of Sciences, 2010. 107.
125. Ng, K.M., et al., Microbiota-liberated host sugars facilitate post-antibiotic
expansion of enteric pathogens. Nature, 2013.
126. FAOSTAT, F., Agriculture Organization of the United Nations. Statistical
Database, 2009.
127. Langille, M.G., et al., Predictive functional profiling of microbial
communities using 16S rRNA marker gene sequences. Nat Biotechnol, 2013.
128. Henry, C., et al., High-throughput generation, optimization and analysis of
genome-scale metabolic models. Nature biotechnology, 2010. 28(9): p. 977-
982.
129. Freilich, S., et al., Competitive and cooperative metabolic interactions in
bacterial communities. Nature communications, 2011. 2: p. 589.
130. Thiele, I., A. Heinken, and R. Fleming, A systems biology approach to
studying the role of microbes in human health. Current opinion in
biotechnology, 2013. 24(1): p. 4-12.
131. Schellenberger, J., et al., Quantitative prediction of cellular metabolism with
constraint-based models: the COBRA Toolbox v2.0. Nature protocols, 2011.
6(9): p. 1290-1307.
132. Doubet, S. and P. Albersheim, CarbBank. Glycobiology, 1992. 2(6): p. 505.
133. Markowitz, V., et al., IMG/M-HMP: a metagenome comparative analysis
system for the Human Microbiome Project. PloS one, 2012. 7(7).
134. Caporaso, J., et al., QIIME allows analysis of high-throughput community
sequencing data. Nature …, 2010.
135. Glass, E.M., et al., Using the metagenomics RAST server (MG-RAST) for
analyzing shotgun metagenomes. Cold Spring Harb Protoc, 2010. 2010(1): p.
pdb prot5368.
136. Hall, M., et al., The WEKA data mining software: an update. ACM SIGKDD
Explorations Newsletter, 2009. 11(1): p. 10-18.
193
193
137. Mahadevan, R., J.S. Edwards, and F.J. Doyle, 3rd, Dynamic flux balance
analysis of diauxic growth in Escherichia coli. Biophys J, 2002. 83(3): p.
1331-40.
138. Wolfram, S., A new kind of science. 2002, Champaign, IL: Wolfram Media.
xiv, 1197 p.
139. Madan Babu, M., S.A. Teichmann, and L. Aravind, Evolutionary dynamics of
prokaryotic transcriptional regulatory networks. J Mol Biol, 2006. 358(2): p.
614-33.
140. Parter, M., N. Kashtan, and U. Alon, Environmental variability and
modularity of bacterial metabolic networks. BMC Evol Biol, 2007. 7: p. 169.
141. Edwards, J.S., R. Ramakrishna, and B.O. Palsson, Characterizing the
metabolic phenotype: a phenotype phase plane analysis. Biotechnol Bioeng,
2002. 77(1): p. 27-36.
142. Varma, A., B.W. Boesch, and B.O. Palsson, Biochemical production
capabilities of Escherichia coli. Biotechnol Bioeng, 1993. 42(1): p. 59-73.
143. Varma, A. and B.O. Palsson, Stoichiometric flux balance models
quantitatively predict growth and metabolic by-product secretion in wild-type
Escherichia coli W3110. Appl Environ Microbiol, 1994. 60(10): p. 3724-31.
144. Livny, D.T.a.T.T.a.M., Distributed computing in practice: the Condor
experience. Concurrency - Practice and Experience, 2005. 17(2-4): p. 323-
356.
145. Burgard, A.P., S. Vaidyaraman, and C.D. Maranas, Minimal reaction sets for
Escherichia coli metabolism under different growth requirements and uptake
environments. Biotechnol Prog, 2001. 17(5): p. 791-7.
146. Suthers, P.F., et al., A genome-scale metabolic reconstruction of Mycoplasma
genitalium, iPS189. PLoS Comput Biol, 2009. 5(2): p. e1000285.
147. Phan-Thanh, L. and T. Gormon, A chemically defined minimal medium for
the optimal culture of Listeria. Int J Food Microbiol, 1997. 35(1): p. 91-5.
148. Maglott, D., et al., Entrez Gene: gene-centered information at NCBI. Nucleic
Acids Res, 2005. 33(Database issue): p. D54-8.
149. Oberhardt, M.A., A.K. Chavali, and J.A. Papin, Flux balance analysis:
interrogating genome-scale metabolic networks. Methods Mol Biol, 2009.
500: p. 61-80.
194
194
150. Zivkovic, A., et al., Human milk glycobiome and its impact on the infant
gastrointestinal microbiota. Proceedings of the National Academy of
Sciences of the United States of America, 2011. 108 Suppl 1: p. 4653-4658.
סאקלר ובברלי ש ריימונד"הפקולטה למדעים מדוייקים ע
בלבטניק ש"בית הספר למדעי המחשב ע
מידול מטבולי של אוכלוסיית בקטריות מגלה תובנות
חדשות על היחסים בתוך האוכלוסיה ובאורגניזמים עצמם
"דוקטור לפילוסופיה"חיבור לשם קבלת תואר
מאת
רפי זרצקי
העבודה בוצעה תחת הנחייתו של
פרופסור איתן רופין
הוגש לסנאט של אוניברסיטת תל אביב
2013 נובמבר
2
2
תמצית
הוכיחו מזה זמן (Genome-scale metabolic model – GSSM)מודלים מטבולים בהיקף גנומי
למרות הישומים הרבים . את תועלתם בחיזוי התנהגותם של אוגניזמים רק על בסיס הידע הגנומי עליהם
. קצב בניית המודלים המטבוליים לא עמד בקצב ריצוף האורגניזמים, בהם משתמשים במודלים מטבולים
עובדה זאת בצרוף העדר שימוש במזהים זהים לתאור מטבוליטים וראקציות במודלים המטבולים אשר
הגבילו עד היום את המחקר החישובי המטבולי בעיקר לחקירת אונגניזמים כישויות , מיוצרים ידנית
בכלל ובקטריות בפרט חיים כאוכלוסיות (prokaryotes)יחד עם זאת ידוע כי פרוקריוטים . מבודדות
והאינטראקציות בין חברי האוכלוסיה משפיעים רבות על הפונקציונליות של הפרטים ,מגוונות
בעוד המודלים המטבולים הקיימים כיום מתאימים בכדי לחזות את ההתנהגות של . והאוכלוסיה
רוב המערכות בטבע דורשות מידול מטבולי של , אוכלוסיות אחידות הבנויות מסוג אחד של אורגניזם
. מספר גדול יותר של אורגניזמים בכדי להבין את המכניזם בו הם פועלים
אוטומטיים לבניית מודלים מטבוליים אשר הופיעו לאחרונה ביחד עם אלפי פרוקריוטים -מחוללים סמי
י "למרות שהמודלים המיוצרים ע. מרוצפים מאפשרים כעת לבנות מודלים של אוכלוסיות של בקטריות
קיומם מאפשר , בחלק מהמודלים אשר יוצרו באופן ידני המחוללים אינם באיכות מקבילה לזו שקיימת
הדבר הפשוט של השוואה בין מספר רב של אורגניזמים על . לענות על שאלות שלא יכולנו לענות בעבר
נעשה , דבר אשר היה כמעט בלתי אפשרי בעידן של מודלים מטבולים ידניים, בסיס מודלים מטבולים
. וכמובן בנית מודל של אוכלוסית בקטריות בכל גודל נהיה פיזיבילי. כעת פשוט
המחקר המוצג בעבודה זו רותם את המודלים המטבוליים תוצרי המחוללים בכדי לענות על מספר שאלות
העבודה כוללת מחקר ופיתוח כלים לחיזוי של . בסיסיות הקשורות למידול אוכלוסיות של בקטריות
מחקר על תכונות כלליות , אינטראקציות מטבוליות בין סוגים שונים של בקטריות בסביבות גידול שונות
י השוואת ההתנהגות המטבולית של מספר רב של "של פרוקריוטים אשר יכולות להתגלות רק ע
. אורגניזמים ומחקר על האינטראקציה בין אוכלוסיות של בקטריות והאורגניזמים המאכלסים אותם
עבודה זו מדגימה את הפוטנציאל העצום הטמון במידול המטבולי של אוכלוסיות של בקטריות לצורך
באופן ספציפי . טכנולוגיות-ולצורך תכנון מערכות ביו, קידום ההבנה הבסיסית של תהליכים ביולוגיים
העבודה כוללת פיתוח של כלי חישובי למידול אוכלוסיות של בקטריות על בסיס המודלים המטבולים של
שיתוף פעולה בין החברים באוכלוסיה בסביבות גידול /החברים באוכלוסיה לצורך חיזוי יחסי תחרות
לחיזוי קצב הגידול של , שיטה חדשה וקלה לחישוב, שונות ואימות ברמה הניסויית והאקולוגית של הכלי
ואלגוריתם חדש לחיזוי אופי פרוק הגליקנים , בקטריות המבוססת על המודלים המטבולים שלהם
(glycans)באוכלוסיות של בקטריות השוכנות במערכות העיכול של יונקים .
3
3
תקציר כללי רקע
במשך תקופה ארוכה השימוש במחשבים בביולוגיה היה בעיקרו לצורך עיבוד של מידע ביולוגי אשר
Systems)ביולוגיה של מערכות "הופעתו של המדע הבין תחומי . י ביולוגים"נאסף ונותח ע
Biology)" , אשר בוחן את המערכות הביולוגיות בראייה כוללת ומנסה לפענח עקרונות בסיסיים שלהם
שינה בצורה משמעותית את היחסים בין חוקרי מדעי המחשב ובין , על מנת להבין ולחזות את התנהגותם
במסגרת הביולוגיה של המערכות חוקרי מדעי המחשב בשיתוף עם חוקרי הביולוגיה . חוקרי הביולוגיה
, פיתחו מודלים חישוביים מורכבים אשר ניסו לבצע סימולציה של תופעות ביולוגיות ברמות פרוט שונות
גישת בניית המודלים החישוביים . וחוקרי הביולוגיה התבקשו לאמת את התחזיות אשר המודלים סיפקו
והגישות היותר " ביולוגיה של מערכות"היא זו שהפרידה בין גישת ה, אשר הינה גישה הוליסטית
מקובלות של הביולוגיה המולקולרית אשר היו יותר רדוקציונריות ואשר היוו את מרבית המחקר
[1] .הביולוגי שנעשה במחצית של המאה הקודמת
ניתוח רשתות מטבוליות
ובאופן ספציפי היא מתרכזת במידול של רשתות " ביולוגיה של מערכות"עבודה זו שייכת לתחום ה
י בניית מודלים המתארים את כלל הרשת המטבולית של "ע. (prokaryotes)מטבוליות בפרוקריוטים
ופותחה עבורה שפה , השיטה לבניית מודלים מטבולים הינה מוגדרת היטב מזה זמן. הישות הנחקרת
כמו כן פותחו כלים למדידת . SBML [2]שפה זו נקראת . י מחשב"ליצוג המודלים הניתנת לקריאה ע
חלבונים וגנים אשר יחדיו משמשים לאימות , ריכוזי מטבוליטים, שטפים של ראקציות מטבוליות
. [1]המודלים המפותחים
ההבנה של התהליכים המטבולים במערכות חיות הינה בעלת פוטנציאל כלכלי רב בתעשיה בכלל
בתעשיה נעשה שימוש במודלים מטבולים לצורך . רפואה וניקוי רעלים בשיטות ביולוגיות-ובתחומי הביו
המטבוליזם משמש , בתחום הרפואה. (כתוספי מזון)ויצור של חומצות אמינו , דלקים, הנדסת מזון
כיום סכרת והשמנת יתר שהינן מחלות הקשורות למטבוליזם . בתפקיד מרכזי במספר גדול של מחלות
. מהוות גורם משמעותי בתמותת ומחלות של אנשים
4
4
רקע ביולוגי
מתייחס לאוסף הטרנספורמציות הכימיות ממרכיבים , (cellular metabolism)המונח מטבוליזם תאי
רוב הראקציות הכימיות בתא מתבצעות באמצעות חלבונים מיוחדים . לתוצרים המתרחש בתוך התא
. כלל הראקציות בתא יוצר רשת מטבולית מורכבת המכילה אלפי ראקציות. הנקראים אינזימים
מידול מתמטי של המטבוליזם התאי
. מידול מתמטי של מטבוליזם יכול להעשות בדרכים שונות
מודלים קינטיים
כ מודלים קינטיים מיוצגים על ידי קבוצה של משוואות דיפרנציאליות אשר מחשבות את נגזרת הזמן "בד
מודלים אלו . של ריכוזי מטבוליטים בהתחשב בקצב פעולת הראקציות הכימיות אותן הם ממדלים
, דורשים ידע של פרמטרים רבים לגבי ריכוזים וקצבים של הראקציות הכימיות אותם רוצים למדל
. ומתאימים בעיקר לניתוח של מספר קטן של ראקציות
(constraint based modeling - CBM)מודלים מבוססי אילוצים
השימוש במודלים , כאשר עוסקים במידול של רשתות מטבוליות המכילות מספר רב של ראקציות
שתאים ביולוגיים CBMהבסיס מאחורי . קינטיים הופך להיות לא ישים ולכן נדרש להשתמש בשיטה זו
י עכיפה של המגבלות הפיזיקליות אנו "ע. [3]כפופים למגבלות פיזיקליות אשר מגבילות את התנהגותם
כיום נעשה שימוש רב . אלו פתרונות הינם אפשריים ואלו לא, יכולים לזהות ברשת מטבולית גדולה
י תאים וקצב הפרשת "קצב של צריכת מטבוליטים ע, לצורך חיזוי קריטיות של גניםCBMומוצלח ב
: י מטריצה" המודל המטבולי מיוצג עCBMב . י תאים"מטבוליטים ע
Sm x n ⋲ Rmm x n
jעמודה . את מספר המטבוליטים הקיימים במודלm- מייצג את מספר הראקציות במודל וnכאשר
Aראקציה הצורכת את המטבוליט : לדוגמא) jמייצגת את המשוואה הלינארית של ראקציה , במטריצה
בהתאמה בשורות המייצגים את המטבוליטים 1+,1- תכיל 1:1 ביחס של Bומייצרת את המטבוליט
. במטריצה מייצג את המקדם הסטיוכיומטרי של המטבוליט בראקציה Si,jהתא . האלו
ניתן לחזות ערכים , באמצעות תיכנות לינארי אשר מופעל על הייצוג המטריציוני של המודל המטבולי
כעת אילוצים ביולוגיים . אפשריים של שטפים העוברים בראקציות המגדירות את הרשת המטבולית
5
5
בין האילוצים הקיימים . אשר שמים על המטריצה מאפשרים לצמצם את מרחב הפתרונות האפשריים
: ניתן למצא
הינו V הינה המטריצה ו S כאשר S · V = 0אילוץ המונע הצטברות של מטבוליטים בתוך התא -
וקטור השטפים של הראקציות במטריצה
דינמים המכתיבים כיוון לראקציות השונות וערכי מקסימום לשטפים -אילוצים תרמו-
מדיה מינימלית לגידול האורגניזם -
תהליך בניית המודלים המטבולים
כימי -מידע ביו, מידע גנומי: תהליך הבנייה של מודל מטבולי מבוסס על מספר מקורות מידע הכוללים
GO[4] כאלו במאגר 20,000ישנם יותר מ )התהליך כולל הגדרת פונקציות ביוכימיות . ומידע פיזיולוגי
-פרוטאין-גן: השלישיה. י גנים"המיוצרים ע( פרוטאינים)ושיוכן לאנזימים , המייצגות ראקציות (
סך השלישיות המייצגות את הגנים . מהווה את אבן הבניין של המודלים המטבולים (GPR)ראקציה
: ידועים הינםGPRמאגרי . המטבולים באורגניזם מהוות את הגרעין של המודל המטבולי שלו
KEGG[5] ,MetaCyc [6]ו - . The Seed[7] ישנו פרוטוקול איטרטיבי ידוע ומוגדר לבניית
זו הייתה . עד לפני מספר שנים בניית מודלים מטבוליים נעשה אך ורק ידנית. [8]מודלים מטבולים
. שנים1-2עבודה ארוכה שארכה כ
מודלים מטבולים של בקטריות
הפרוקריוטים כוללים ברובם את הבקטריות ואת . פרוקריוטים הינם אורגניזמים חד תאיים חסרי גרעין
הם היצורים הראשונים . סיסטם של העולם-הם בעלי תפקיד מרכזי באקו. (archaea)הארכיאות
גוף האדם מכיל פי . שנוצרו וגם היום כמות הביומסה שלהם גדולה מזו של הצמחים ובעלי החיים גם יחד
למרות . הבקטריות הינן קריטיות במחזור חומרי מזון. יותר תאים בקטריאלים מאשר תאים שהם שלו10
רובן יכולות לסייע בהרבה ישומים החל מייצור מזון ודלקים וכלה , שחלק מהבקטריות הינן פתוגניות
הבנת היחסים . בקטריות חיות באוכלוסיות מגוונות וגדולות. בסיוע בניקוי רעלנים ופרוק חומרים מזיקים
בין הבקטריות המרכיבות את האוכלוסיות השונות והיחסים בינן ובין האורגניזם אשר מכיל אותם
הינה קריטית להבנת הביולוגיה שלהם ושל המארח (כדוגמת חיידקי המעי במערכת העיכול של היונקים)
.ולמציאת דרכים יעילות לבניית אוכלוסיות של בקטריות למטרות יעודיות
ולכן הם נחקרו לעומק (eukaryotes)מזוית מחקרית הפרוקריוטים הינם פשוטים יחסית לאיוקריוטים
הידע הרחב שנצבר עליהם הביא לפיתוח מחולל מודלים מטבולים . ברמת הפנוטיפ והגנוטיפ שלהם
6
6
הינו ישום מבוסס Model Seed [9]המחולל המוביל הנקרא . עבורם אשר עליו מתבססת עבודה זו
Webמחולל זה מיישם בתוכו רבים מהשלבים הנדרשים לבניית . המאפשר בנייה אוטומטית של מודלים
. באמצעותו היום ניתן ליצר מודל מטבולי של בקטריה תוך פחות מיום . [8]מודל מטבולי
מידול מטבולי רחב היקף של בקטריות
כעת עם הופעת . רוב המחקר שנעשה היום בתחום המודלים המטבוליים הינו ברמת האורגניזם הבודד
נראה כי זהו הזמן להתחיל , גנומי-מחוללי המודלים עבור פרוקריאוטים והעניין הרב שקיים במידע המטה
. עבודה זו אכן עשתה זאת. ולחקור את התנהגות אוכלוסיות הבקטריות בעזרת המודלים המטבולים
מידול מטבולי של אוכלוסיות של פרוקריוטים
כפי שצויין רוב המחקר שנעשה היום בתחום המודלים המטבוליים של פרוקריוטים הינו ברמת
והוא מניח שהמינים השונים חיים כישויות מבודדות אשר אינן מתקשרות עם ישויות , האורגניזם הבודד
לעומת זאת המציאות היא שהפרוקריוטים חיים באוכלוסיות צפופות ומגוונות אשר מקיימות . אחרות
יחסים אלו והאינטראקציה עם הסביבה בה פועלת האוכלוסיה . אינטראקציה חזקה בין מרכיביהן
ניתן לראות את האוכלוסיה . השרידות והיכולות של האוכלוסיה כמכלול, משפיעות על הפונקציונליות
.עצמה כישות רב תאית בעלת יכולות ויעדים משל עצמה
בעוד שמודלים של אורגניזמים ספציפיים הינם מספקים בכדי לבצע תחזיות על אופן פעולת
נדרשים מודלים אחרים אשר , לצורך ביצוע תחזיות על אוכלוסיות, האורגניזימים בתרבית טהורה
יכולים לבצע סימולציה של אינטראקציה בין אורגניזמים שונים ובין עצמם ובין האורגניזמים השונים
אך בעבודות אלו המחקר [11 ,10]בעבר נעשו מספר נסיונות בסיסיים לבנות מודלים אלו . והסביבה
נעשה בעיקר עבור זוגות ספציפיים של אורגניזמים ולא בוצע מחקר רחב היקף על האינקראקציה בין
מציגה פלטפורמה חישובית ושיטות בכדי 2העבודה שנעשתה בפרק . סוגים שונים של פרוקריוטים
כאשר אבני הבניין של המודלים של האוכלוסיות הם המודלים , לבנות מודלים של אוכלוסיות בכל גודל
. SEEDמחולל המודלים המטבוליים של " האוטומטיים שנבנו ע
7
7
תקציר הפרקים בעבודה
תחרות ושיתוף פעולה באוכלוסיות של בקטריות
:מבוסס על המאמר
Competitive and cooperative metabolic interactions in bacterial communities
Shiri Freilich, Raphy Zarecki, Omer Eilam, Ella Shtifman Segal, Christopher S. Henry, Martin Kupiec, Uri Gophna, Roded Sharan & Eytan Ruppin
:במאמר זה שני הכותבים הראשונים תרמו במידה זהה
[12] 2011 בדצמבר 13: בNature Communicationsהמאמר פורסם ב
ידוע כי פרוקריוטים חיים ומשגשגים באוכלוסיות צפופות ומגוונות והיחסים בין חברי האוכלוסיות ובין
עצמם וכן האינטראקציה של הפרוקריוטים עם הסביבה החיצונית משפיעים באופן מהותי על היכולות
עד היום קשה לחזות האם בקטריה מסויימת יכולה לחיות במקביל או לשתף פעולה . של האוכלוסיה
דבר זה מקשה מאוד על תכנון מלאכותי של . בצורה מטבולית עם בקטריות אחרות באותה אוכלוסיה
. טכנולוגיים ורפואיים-אוכלוסיית בקטריות לצורכים ביו
אנו מציעים מערכת חישובית . אנו מתמקדים ביחסים המטבוליים בין סוגים שונים של בקטריות2בפרק
ואשר בונה מודל מטבולי של אוכלוסית בקטריות ובעזרת מודל , המתבססת בלעדית על המימד המטבולי
. זה מנסה לחזות האם בקטריות שונות מסוגלות לשתף פעולה או מתחרות זו בזו בסביבות גידול שונות
. השיטה הינה גנרית ומאפשרת למדל כל כמות של בקטריות שונות החיות יחדיו
זוגות של בקטריות אשר 118במסגרת אימות החיזויים של הכלי ביצענו בחינה חישובית של היחסים בין
התוצאות הושוו מול נתונים . ואשר איכותם נמדדה ופורסמה[9]י מחולל מודלים מטבולים "נבנו ע
התוצאות שהתקבלו הראו כי ישנה קורלציה חזקה בין בקטריות אשר. גנומי-אקולוגיים של מידע מטה
8
8
כמו כן נעשה שימוש . יכולות לשתף פעולה בינהן ובין הימצאות אותן בקטריות באותן סביבות אקולוגיות
בכלי לחזות את תוצאות התחרות בין בקטריות ידועות בסביבת גידול נתונה וכן נעשה חיזוי סביבות
אחר מכן בוצעו ניסויים . גידול בה בקטריות מסויימות משתפות פעולה וסביבות אחרות בהן הן מתחרות
ניסויים אלו הראו כי נמצאה התאמה גבוהה אם כי לא מושלמת בין . מעבדתיים בכדי לאמת את התוצאות
. התוצאות החזויות לבין התוצאות בניסויים
הכלי משתמש אך ורק במימד המטבולי של . לשיטות ולכלי שפותח ישנן מגבלות אשר פוגעות בדיוק שלו
כמו . האוכלוסיה ומתעלם משפעול של רגולציה אשר יכולה להשפיע על התפקוד המטבולי של הבקטריה
כן הכלי מתעלם מאסטרטגיות אשר מיקרו אורגניזמים רבים פיתחו בכדי לנצח בתחרות עם מיקרו
. י מחולל מודלים"הכלי מתבסס על מודלים מטבוליים אשר חוללו ע. אורגניזמים אחרים כדוגמת רעלנים
מודלים אלו עדיין אינם מדוייקים כמו חלק מהמודלים אשר פותחו ידנית ודבר זה פוגע גם הוא בדיוק
. אך עם כל המגבלות שצויינו לעיל אנו רואים סיגנל ברור בתוצאות שהמערכת חוזה, התחזיות
איכות התוצאות של הכלי שפותח צפוייה להשתפר עם השיפורים שנעשים כל הזמן באכות המודלים
ביוטי של סביבות -י מחוללי המודלים וכן בשל השיפור בידע על המבנה הא"המטבוליים המיוצרים ע
אנו צופים שהכלי שפותח יהווה תשתית חשובה לתכנון של אוכלוסיות יעודיות של . הגידול השונות
לצרכים רפואיים כאשר ניתן יהיה למצוא , טכנולוגיים לצורך יצור חומרים-בקטריות לצרכים ביו
בקטריות אשר יתחרו מול הבקטריות הפתוגניות ויקשו עליהם לסגסג ולצרכי פעולות של ניקוי רעלים
אנו רואים בהנדסה של אוכלוסיות תחליף חלקי להנדסה הגנטית ברמת . בסביבות אקולוגיות שונות
הבקטריה הבודדת אשר הינה בשימוש רב בימים אלו ואשר לה מגבלות רבות כאשר מנסים לבצע
. שינויים רבים במבנה הגנומי של בקטריה נתונה
9
9
אורגניזמים צורכי חמצן בסביבה -חיזוי האנטרופיה המקסימלית של מיקרו
מהווה מדד טוב לקצב הגידול שלהם, נתונה
:מבוסס על המאמר
Maximal Sum of metabolic exchange fluxes outperforms biomass yield as a predictor of growth rate of microorganisms
Raphy Zarecki, Matthew A. Oberhardt, Keren Yizhak, Allon Wagner, Ella Shtifman Segal, Shiri Freilich, Christopher S. Henry, Uri Gophna and Eytan Ruppin
:במאמר זה שני הכותבים הראשונים תרמו במידה זהה
: וכמו כן הוצג בכנס בשםPLoS Computational Biologyהמאמר נשלח לפרסום ב
Predicting cell metabolism and phenotypes (CA, USA 4-6/3/2013)
המאמר מהווה דוגמא לשימוש במספר רב של מודלים מטבוליים לצורךזיהוי תכונות חבויות של
.אורגניזמים
אנו מנסים להסביר ולחזות את קצב הגידול של האורגניזמים על בסיס השוואה בין המודלים 3בפרק
הבנת העקרונות הקובעים את קצב הגידול של תאים הינו סוגיה חשובה ביותר בעלת . המטבולים שלהם
אנו טוענים כי פרוקריוטים מנצלים חלק גדול . טכנולוגיה-השלכות רבות בתחומי הרפואה והביו
וחישוב האחד מאפשר לחזות את , מהפוטנציאל האנטרופי הטמון במזון אותו הם סופגים לצורך גדילה
אנו מציגים נוסחאות המבוססות על חוקי התרמודינמיקה ונוסחה פשטנית יותר הנקראת . השני
SUMEX (SUM of Exchanges) אשר ממקסמת את ההפרש בין השטפים של המטבוליטים אשר
. כמדדים לחיזוי קצב הגדילה, האוגניזמם מפריש לאלו שהוא סופג
אשר רובם אינם )היתרונות המשמעותיים בנוסחאות אלו הוא שהן אינן משתמשות בנתונים אימפיריים
והנוסחה הפשטנית גם אינה זקוקה לנתונים על ערכי האנרגיה , על קצבי השטפים של הראקציות (ידועים
. החופשית של גיבס עבור המטבוליטים השונים
10
10
מניתוח של מידע ביולוגי על קצב גדילה של מספר רב של בקטריות ותאים סרטניים ומניסויים אשר
הראנו כי השיטה שאנו מציעים היתה טובה מכל השיטות החישוביות אשר מבוססות על , עשינו במעבדה
מודלים מטבוליים ואשר יש בהן שימוש כיום לרבות המדד שמחשב את השטף של ראקציית הביומסה
של האורגניזם אשר משמש כמדד המוביל לחיזוי קצב הגידול למרות שהוא מחשב את הספק הגידול ולא
התוצאות של הניסויים והבדיקות שעשינו מראים את החשיבות של . [13]קצב הגידול של האורגניזם
. ראקציות המנצלות את הגרדיאנט של הפרוטונים באיפשור הגדילה באורגניזמים ארוביים
השימוש בתוצאות המחקר בישומים , מעבר לחשיבות של מחקר זה ברמת ההבנה של תהליכים ביולוגיים
טכנולוגיים יכול למשל לקבוע את סביבת הגידול האופטימלית לצורך גידול בקטריות אשר מיועדות -ביו
בתחום הרפואה הוא יכול להמליץ על תזונה אשר תקטין משמעותית את קצב . לייצור חומרים רצויים
.הגידול של פתוגנים ואפילו של תאים סרטניים
11
11
שימוש במטבוליזם של גליקנים באוכלוסיית הבקטריות במעי של יונקים
חוזה את סוגי הבקטריות במעי ומגלה התאמות באוכלוסיה הקשורות לסוג
המזון העיקרי של היונקים
:מבוסס על המאמר
Glycan metabolism of the mammalian gut microbiota predicts bacterial species abundance and reveals diet-specific adaptations
Omer Eilam, Raphy Zarecki, Matthew Oberhardt, Martin Kupiec, Uri Gophna & Eytan Ruppin
:במאמר זה שני הכותבים הראשונים תרמו במידה זהה
: וכמו כן הוצג בכנס בשםNature Methodsהמאמר נשלח לפרסום ב
Exploring human host-microbiome interactions in health and disease
(8-10.5.2012 Cambridge UK)
המאמר מהווה דוגמא לשימוש במודלים מטבוליים בכדי לנתח את אוכלוסיית הבקטריות במערכת
העיכול של יונקים
גליקנים מהווים את מקור התזונה העיקרי של הבקטריות במערכות העיכול של יונקים ולכן הבנת
עד היום המודלים . המטבוליזם הגליקני של הבקטריות הינו קריטי במחקר אוכלוסיית בקטריות זו
של עבודה זו מוצגת מערכת חישובית 4בפרק . המטבולים של הבקטריות לא הכילו התייחסות לגליקנים
חדשנית ומורכבת המשמשת לצורך הכנסת ראקציות השוברות גליקנים למודלים המטבוליים של
-הבקטריות ולצורך ניתוח יכולת הניצול של גליקנים במערכות עיכול של יונקים על בסיס מידע מטה
המערכת החישובית מהווה הרחבה . גנומי שנלקח ממערכות העיכול של המארחים של אותן אוכלוסיות
המערכת החישובית שפיתחנו חוזה בפעם הראשונה . חשובה של מחוללי המודלים המטבוליים הקיימים
י פרוייקט המיקרוביום האנושי "י מאות הבקטריות אשר מופו ע"את יכולת העיכול של אלפי גליקנים ע
(HMP – Human Microbiome Project) ואשר עבורם ישנו מודל מטבולי ומספקת מבט מטבולי
12
12
בעזרת המערכת שפיתחנו אנו מראים כי היכולת של בקטריה לפרק גליקנים מסוג . חדש על אוכלוסיה זו
דבר המראה כי , מתואמת בצורה חזקה עם השכיחות שלה(polysaccharides)סוכרים -של פולי
אנו מסבירים באמצעות יכולת . היכולת לפרק את הגליקנים מהווה יתרון סלקטיבי של אותן בקטריות
בנינו באמצעות . פרוק הגליקנים את ההבדלים באוכלוסיות הבקטריות הנמצאים בצימחונים ובטורפים
אשר יכלה לזהות בדיוק רב את סוג הדיאטה (classifier)מערכת לומדת , המערכת החישובית שפותחה
וזאת בצורה , של היונק על בסיס הגליקנים אותם מפרקת אוכלוסית הבקטריות במערכת העיכול שלו
הרבה יותר טובה ממערכות לומדות אשר השתמשו רק בידע המטה גנומי של אוכלוסית הבקטריות
כאשר הפעלנו את אותה מערכת לומדת על דגימות של אוכלוסיות של בקטריות . במערכת העיכול
במערכות העיכול של אזרחים אמריקאים מצאנו שרוב אוכלוסיות הבקטריות הינן בעלות העדפה לפרוק
גלינקנים שמקורם מהחיי וזאת לעומת אוכלוסיות של בקטריות ממערכות העיכול של אזרחים מונצואלה
. אשר בהן התגלתה העדפה לפרוק גליקנים שמקורם מהצומח
ביוטיות אשר יכולות -ביוטיות ופרה-הפלטפורמה החישובית שמוצגת פותחת את הדלת למתן תחזיות פרו
.להשפיע על בריאותנו
13
13
ניתוח התוצאות
היא שמערכות ביולוגיות הינן מערכות מקושרות ושלימוד ובחינה " ביולוגיה של מערכות"לב הגישה של
אלא נדרשת גישה , של כל מרכיב בניפרד אינו מספיק בכדי להבין את המורכבות הכללית של המערכת
הינם מימוש של (GSSM)המודלים המטבוליים ברמת התא . הוליסטית להבנת הביולוגיה של המערכת
המודלים המטבוליים . ברמת התא הבודד" ביולוגיה של מערכות"י ה"הגישה ההוליסטית המיוצגת ע
רפואיים , טכנולוגיים-הוכיחו את עצמם ככלי שמספק תחזיות להתנהגות אורגניזמים בהרבה תהליכים ביו
המאמץ הרב שהיה נדרש בכדי ליצור מודלים מטבוליים של בקטריות באופן . ובמערכות לניקוי רעלים
הגביל את , ידני בנוסף לכך שלא היה תקן אחיד לשמות המטבוליטים והראקציות במודלים שיוצרו
המחקר באמצעות המודלים המטבוליים לניתוח פעולתן של בקטריות בודדות הפועלות לבד או
לעומת זאת ידוע כי פרוקריוטים חיים ומשגשגים באוכלוסיות צפופות ומגוונות . באוכלוסיות טהורות
והיחסים בין חברי האוכלוסיות ובין עצמם וכן בינם לבין הסביבה משפיעים רבות על פעילות האוכלוסיה
ולכן נדרש היה לבנות תשתית אשר תאפשר למדל את , ככלל ופעילות האורגניזמים הבודדים בפרט
. פעולתם של אוכלוסיות של אורגניזמים בכדי לעלות רמה מעבר לרמת התא הבודד
אשר סיפקה מודלים מטבוליים לאלפי , הופעתם של מחוללי מודלים מטבוליים עבור פרוקריוטים
אורגניזמים תוך שמירה על קונבנציית שמות אחידה למטבוליטים ולראקציות המרכיבים את המודלים
Gemone Scale)מרמת הגנום הבודד " סולם ההוליסטי"איפשרה לבצע את העליה הנדרשת ב
Metabolic Model : GSSM) לרמת האוכלוסיה(Community genome-scale metabolic
model : C-GSSM) . למרות שהמודלים המחוללים אוטומטית אינם עדיין באיכות של חלק מהמודלים
עיקר . הם מאפשרים לשאול שאלות אשר לא ניתן היה לשאול בעבר ברמת האוכלוסיה, המחוללים ידנית
עבודה זו היה במתן מענה חישובי עבור חלק מהשאלות שעצם קיומם של מודלים אלו של פרוקריוטים
.איפשר לשאול
מתן מענה לסוגים שונים של שאלות בעזרת מודלים מטבוליים של אלפי פרוקריוטים
בעבודה זו ניסיתי לענות על דוגמאות של שאלות מסוגים שונים הקשורות לאוכלוסיות של בקטריות
ואשר כעת באמצעות הכמות הגדולה של המודלים המטבוליים של הפרוקריוטים שקיימים ניתן היה
. לענות
הקבוצה הראשונה של השאלות עסקה ביחסים של תחרות ושיתוף פעולה בין סוגים שונים של בקטריות
על האינטראקציות בין (הגדול ביותר עד היום) נעשה מחקר רחב היקף 2בפרק . בסביבות גידול שונות
. זוגות שונים של בקטריות
14
14
הקבוצה השניה של השאלות עסקה בהשוואה בין מודלים מטבוליים של פרוקריוטים שונים ובתכונות
מצאנו דרך לחזות את קצב הגידול של פרוקריוטים בסביבת גידול 3בפרק . שניתן ללמוד מהשוואה זו
השיטה שנמצאה התגלתה כמדוייקת הרבה יותר מהשיטות . נתונה בהתבסס על עקרונות תרמודינמיים
השיטה שמצאנו נתנה תוצאות . הקודמות לחיזוי קצב גידול אשר בהן נעשה שימוש במודלים מטבוליים
דומות לאלו שהתקבלו משיטות קינטיות אשר דרשו ידע רב יותר על פרמטרים בסביבת הגידול והיו
. מסובכות הרבה יותר לחישוב
. הקבוצה השלישית של השאלות עסקה ביחסים בין אוכלוסיות של בקטריות והאורגניזם המארח שלהם
. עסקנו ביחסים בין אוכלוסיית הבקטריות במערכת העיכול של יונקים והמארח שלהם4בפרק
15
15
כיוונים עתידים
בשל הזמינות של אלפי המודלים המטבוליים של , עבודה זו מכילה מספר מחקרים פורצי דרך
העבודה רק גירדה את קצה השטח במרחב השאלות . פרוקריוטים אשר מחוללי המודלים המטבולים יצרו
מציגים הצעות להרחבות במסגרת שאלות המחקר 2-4פרקים . אשר מודלים אלו מאפשרים לענות
אך בסעיף זה אנסה להציג מספר נושאים נוספים אשר אני חושב כי , הספציפיות עליהם הם ניסו לענות
. כדאי לחקור בהתבסס על המחקר שנעשה בעבודה זו ובהתבסס על המודלים המטבוליים הקיימים
שיטות חדשות לסימולציה של אוכלוסיות
ישנן . בכמוסטט (steady state)בעבודה זו התרכזנו במידול של אוכלוסיות במצב של שיווי משקל
נוספות המשתמשות במודלים מטבוליים ואשר משמשות כיום אך ורק למידול אורגניזמים בודדים יטותש
אחת מהשיטות הזו מורידה את הדרישה של מידול במצב של . ואשר ניתן להרחיבם למידול אוכלוסיות
. [14]: ומתוארת בdynamic flux balance analysis – dynamic fbaשיווי משקל והיא נקראת
בשיטה זו אנו מכניסים . cellular automata [15]שיטה זו הינה ישום של אלגורתמים ממשפחת ה
הבעיה העיקרית של גישה זו הינה ריבוי . את מרכיב הזמן והשינויים בסביבה לתוך החישוב המטבולי
אך יחד עם זאת , הפרמטרים האפשריים אשר עלולים להשפיע על פעילות המערכת לה עושים סימולציה
גישה זו דורשת הרבה פחות פרמטרים מאשר אלו הנדרשים בבניית סימולציה דינמית מלאה של
אינה סבירה בעת עבודה עם סביבות מורכבות כאוכלוסיות ולכן " כמוסטט"לדעתי הנחת ה.אוכלוסיות
.cellular automataנדרשת עבודה עם אלגוריתימים ממשפחת ה
בניית אוכלוסיות יעודיות
פותחת הרבה הזדמנויות , היכולת לתכנן מבנים אופטמליים של אוכלוסיות של בקטריות למטרות שונות
בתחומים של פרוק רעלנים וביצירת אוכלוסיות , טכנולוגיה של יצור חומרים-בתחומים של ביו
.המעודדות סגסוג של בקטריות מועילות ותחרות מול פתוגנים
שימוש באוכלוסיות של בקטריות לצרכים האמורים יכול להוות תחליף לפעולות המורכבות של הוצאת
. והכנסת גנים לבקטריות בודדות הנעשות היום במסגרת תהליכי ההנדסה הגנטית
.השיטות והכלים שהוצגו בעבודה זו מאפשרות לתכנן אוכלוסיות יעודיות של בקטריות ליעודים נדרשים
16
16
דוגמאות של תכונות פרוקריוטים אשר יכולים להילמד מתוך הניתוח המשותף של המודלים
המטבוליים שלהם
של עבודה זו בחנו את תכונת קצב הגידול של הפרוקריוטים באמצעות המודלים המטבוליים 3בפרק
להלן מספר דוגמאות . ישנם עוד הרבה שאלות אשר יכולות להיחקר בעזרת אותם מודלים. שלהם
:לשאלות אלו
?האם ישנו פעפוע של תהליכים מטבוליים לאורך העץ הפילוגנטי
? האם ישנה קורלציה בין המרחק הפילוגנטי של פרוקריוטים וסביבת הגידול שלהם
? של פרוקריוטים PHהאם ישנם מרכיבים מטבוללים המשחקים תפקיד בקביעת רגישות ה
כיום המודלים המטבוליים הקיימים אינם מדוייקים מספיק בכדי לתת מענה מדוייק ברמת האורגניזם
חלק מהתכונות יכולות לספק סיגנל ברור במצטבר , אך בניתוח של מספר רב של מודלים, הבודד
.3ולהוביל להבנות חדשות ולתגליות חדשות כפי שקרה בפרק
חקירה של היחסים בין אוכלוסיות ספציפיות של בקטריות והמארח שלהם
התחלנו לחקור את האוכלוסיה של הבקטריות הנמצאות במערכת העיכול של יונקים בכלל ושל 4בפרק
חלקן קשורות ל , ישנם הרבה אוכלוסיות של בקטריות בטבע אשר פועלות בתוך מארח. אנשים בפרט
.מחלות ופתוגנים ואחרות קשורות ליחסים פרזיטים וסימביוטים
דוגמא נוספת אשר אני מאמין כי היא חשובה הינה תכנון של אוכלוסיות של בקטריות אשר יכולות
אין צורך להסביר את החשיבות של מחקר . לשמש כמדשנים של צמחים או כקוטלי מזיקים של צמחים
בכיוון זה בתקופה בה אוכלוסיית העולם גדלה בקצב מהיר וישנן ספקות לגבי היכולת לענות על דרישות
. המזון של אוכלוסיה זו
סיכום
הופעתם של מחוללי המודלים המטבוליים עבור פרוקריוטים ואיתם הופעת אלפי מודלים מטבוליים עבור
השוואה בין . פתחה תת תחום חדש בתוך תחום המחקר של המודלים המטבוליים, פרוקריוטים שונים
( C-GSSMs)מודלים מטבוליים של אורגניזמים שונים ובניית מודלים של אוכלוסיות של פרוקריוטים
מראה פוטנציאל גבוה במתן מענה לשאלות הקשורות להתנהגות פרוקריוטים בטבע ובתכנון אוכלוסיות
17
17
כפי שניתן לראות בתוצאות שמראה עבודה זו וזאת למרות שהיא רק שרטה את קצה , לצרכים יעודיים
.השטח בתת תחום זה
אך ברור כי התוצאות והשיטות שהוצגו בעבודה הינן , התוצאות המוצגות בעבודה הינן מעודדות ביותר
חלקיות ומוגבלות בעיקר בגלל שהן מתרכזות רק בהיבט המטבולי של היחסים בין אורגניזמים תוך
כמו כן צפוי כי הנחת . התעלמות מרגולציה ברמת התא הבודד והתעלמות מהתמחות ברמת האוכלוסיות
אשר הייתה בסיסית בעבודה זו תתבטל ככל שיעשה שימוש במודלים יותר דינמיים " שווי משקל"ה
.במידול אוכלוסיות
פתיחה של תת ענף חדש של מחקר היא תמיד מרתקת ואני מוצא את עצמי בר מזל להיות שותף במעשה
.זה
18
18
ביבליוגרפיה לתקציר העברי
1. Kell, D.B., Systems biology, metabolic modelling and metabolomics in drug discovery and development. Drug Discov Today, 2006. 11(23-24): p. 1085-92.
2. Hucka, M., et al., The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 2003. 19(4): p. 524-31.
3. Price, N.D., J.L. Reed, and B.O. Palsson, Genome-scale models of microbial cells: evaluating the consequences of constraints. Nat Rev Microbiol, 2004. 2(11): p. 886-97.
4. Ashburner, M., et al., Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 2000. 25(1): p. 25-9.
5. Kanehisa, M., et al., The KEGG resource for deciphering the genome. Nucleic Acids Res, 2004. 32(Database issue): p. D27 7-80.
6. Karp, P.D., et al., The MetaCyc Database. Nucleic Acids Res, 2002. 30(1): p. 59-61.
7. Aziz, R.K., et al., SEED servers: high-performance access to the SEED genomes, annotations, and metabolic models. PLoS One, 2012. 7(10): p. e48053.
8. Thiele ,I. and B.O. Palsson, A protocol for generating a high-quality genome-scale metabolic reconstruction. Nat Protoc, 2010. 5(1): p. 93-121.
9. Henry, C.S., et al., High-throughput generation, optimization and analysis of genome-scale metabolic models. Nat Biotechnol, 2010. 28(9): p. 977-82.
10. Stolyar, S., et al., Metabolic modeling of a mutualistic microbial community. Mol Syst Biol, 2007. 3: p. 92.
11. Wintermute, E.H. and P.A. Silver, Emergent cooperation in microbial metabolism. Mol Syst Biol, 2010. 6: p .407.
12. Freilich, S., et al., Competitive and cooperative metabolic interactions in bacterial communities. Nat Commun, 2011. 2: p. 589.
13. Adadi, R., et al., Prediction of microbial growth rate versus biomass yield by a metabolic network with kinetic parameters. PLoS Comput Biol, 2012. 8(7): p. e1002575.
14. Mahadevan, R., J.S. Edwards, and F.J. Doyle, 3rd, Dynamic flux balance analysis of diauxic growth in Escherichia coli. Biophys J, 2002. 83(3): p. 1331-40.
15. Wolfram, S., A new kind of science. 2 002 , Champaign, IL: Wolfram Media. xiv, 1197 p.
19
19
.דף זה הושאר ריק בכוונה