Upload
frank-oliver
View
215
Download
0
Embed Size (px)
Citation preview
Applying graph theoretic approaches to microbial metagenomes: Ecological perspectives on functionWolfgang Hankeln1,2, Pier Luigi Buttigieg1,2, Ivaylo Kostadinov1,2, Renzo Kottmann1, Pelin Yilmaz1,2, Melissa Beth Duhaime1,2, Frank Oliver Glöckner1,2 1 Max Planck Institute for Marine Microbiology, Celsiusstraße 1, 28359 Bremen, Germany 2 Jacobs University Bremen gGmbH, Campus Ring 1, 28759 Bremen, Germany
ABSTRACT Graph theoretic approaches of fer a powerful means to visualise and examine data arising from microbial environmental metagenomes. Here, inter-correlations of protein domains with no known function detected in marine microbial metagenomes were visualised as networks. The topology of these networks combined with putative functional assignments of some participant protein domains allowed the generation of hy potheses concerning the ecological functions these networks may reveal. These methods may be employed to offer an ecological pers pective in future attempts to describe the function of uncharacterised protein domains.
Categories and Subject Descriptors J.3 [LIFE AND MEDICAL SCIENC ES]: Visualization of environmental metagenomic data – protein network analysis in metagenomic datasets
General Terms Algorithms, Measurement, Experimentation, Theory.
Keywords Ecology, Ecological Genomics, Unknown Function, Function Inference.
1.INTRODUCTION
Ecological com munities com prise a range of organism s and populations occupy ing intricately interwoven niches in a given environment. These com munities are constantly under external and internal pressures to m aintain functions that contribute to their survival and fitness while dispensing of those that hinder such suc cess. As a re sult of this selective pressure, genes contributing to common phenoty pes are often co-selected across environments [1,2] . Particularly in microbial studies, datasets emerging from environmental genom ics (or metagenomics) offer unique perspectives on the asso ciation of genomic elements across environments and ecosystems. The Global Ocean Sampling (GOS) expedition [3] metagenomes describe globally distributed marine microbial communities and offer an excellent opportunity
to detect ecologically preserved functional modules. This project investigated the distribution of protein domains with unknown function across the GOS metagenomes and used graph theoretic approaches to detect modular structure from which functional hypotheses were derived.
2.METHODS A collection of 10,133,846 m illion unassem bled reads from the GOS sampling sites GS000a-GS023, GS025-GS051, GS108a-GS117b, GS119-GS123, GS148-GS 149, and MOVE858 were downloaded from the CAMERA web portal [4] and queried by all hidden Markov models (HMMs) bel onging to protein domains of unknown function present in the PFAM 23.0, pfam_fs database [5] using a single TimeLogic DeCy pher card (Active Motif, Inc., Carlsbad, CA). Hits with an upper E-value threshold of 1× 10-10 were used in further analysis.
Results were stored in MegDB [6 ], and converted to a “sites × domain” matrix wherein the abundance of each dom ain present at a given site was recorded. This m atrix was then imported into the R statistical computing environment [7].
The correlation between each of the dom ains detected in the GOS dataset was determ ined using Spearm an’s rho statistic with the rcorr() function from the package Hmisc [8].
These correlations were us ed to generate adjacency matrices where vertices were connected by an edge if their correlation across sampling sites was greater than rho thresholds of 0.80 (not shown) and 0.90, respectively , w ith p-values lower than 0.01. Distances (1 – rho) were assigne d to all edges. Graphs were generated in the igraph [9] R package . The igraph function layout.fruchterman.reingold() was applied (default settings) to embed graphs using force-directed placem ent after the Fruchterman-Reingold algorithm [10].
Minimum spanning trees (MSTs) were calculated for each graph with Prim’s algorithm for weighted graphs [11] using the igraph function minimum.spanning.tree().
PFAM domain descriptions indicating potential functions were used, when possible, to guide the assignment of domains into broad functional categories outlined in the SEED project [12]. These as signments were then us ed as vertex attributes and represented as colours in the network graphs and MSTs.
Permission to make digital or hard copies of all or part of this wor k for personal or classroom use is granted without fee provided that copies are not m ade or distr ibuted for pr ofit or commercial advantage and that copies bear this notice and the full citation on the fir st page. T o copy otherwise, to r epublish, to post on ser vers or to redistribute to lists, requires prior specific permission and/or a fee. ACM-BCB 2010, Niagara Falls, NY, USA Copyright 2010 ACM ISBN 978-1-4503-0438-2... $10.00.
Clusters of neighbouring vertices were considered to be putative ecological functional modules. Hy potheses of the corresponding functions were then generated bas ed on the placement and connectivity of vertices repr esenting categoris ed dom ains. Vertices corresponding to uncategorised domains within or connected to putative modules were considered to participate in the hypothesised function.
478
1289
59150
461
2555
185143
6
751
525159
9551013 255
1820
205
179
2237
1995
258
1092
520
1230
1467
328
1326
1989
561
6
924
1492256
779
404
1730
58381
721
339
1078
218
1295
2166
161
1982
2093
2155
0
387165
3931112
1824
212
03
989
1602
1800
1365
2518
140
1997
564
1662
482
552160
952521
14992065
2214
1285
1818
849
112 281015
2307
2083
21
2991009
805
1651
88537
1244
502
407
1816
Unknown
A I Photosynthesis
Respiration
IV Regulation and Cell signaling
Cell Division and Cell Cycle
3.RESULTS & DISCUSSION The Spearman’s rank correlation (rho > 0.90, p < 0.01) between functionally uncharacterised protein domains detected across the GOS metagenomes is visualised as a Fruchterman-Reingold embedded network in Figure 1A. In s ome cas es, dom ains were assigned putative functional roles represented by the colour scheme in Figure 1. Five sub-ne tworks were noted. Sub-network I comprises domains with a range of putative functions, prominently: Respiration, Cell Division and Cell Cycle, Regulation and Cell signalling, and Membrane Transport. Together, these assignments s uggest a functional constellation suitable for a general response to changing nutrient availability in the ecos ystem. S ub-network II is alm ost entirely com prised of domains putatively linked to Photosynthesis with one domain assigned to Regulation and Cell Si gnalling. The suggestion is of a common function tied to primary productivity , most likely under the regulation of the domain represented by vertex 1662. Sub-
network III features domains know n to co-occur, including one with putative functionality in Secondary Metabolism . Sub-networks IV and V each contain two dom ains with one of the former’s putatively associated with Phosphorous Metabolism. Due to the small size of sub-networ ks III, IV and V, it is difficult to adequately hypothesise on thei r function. The domains in sub-network III, however, are often observed together, suggesting they may comprise a protein com plex or participate clos ely in a shared mechanism. A forest calculated using Prim ’s MST algorithm for weighted graphs (where weights were defi ned by 1-rho for each pair of connected vertices) is shown in Figure 1B. Hence, only those edges representing the strongest correlations (rendered as the shortest paths between the vertices ) are cons idered, revealing the most prominent co-occurrence detected. This approach m ay reveal ‘hubs’ of each hy pothesised response. For exam ple, the Regulation and Cell Signalling domai n represented by vertex 6, central to sub-network I, has b een characteris ed as an EamA domain (PF00892), found in the protein PecM. This protein is
Figure 1: Network visualization of protein domain correlation across Global Ocean Sampling expedition metagenomes. A) Fruchterman-Reingold placement of protein domains. Vertices represent protein domains with no known function and are
connected by an edge when correlated with a Spearman’s rho > 0.90 (p < 0.01). When possible, domains were assigned to a putative metabolic category and coloured accordingly B) Corresponding MST calculated with Prim’s algorithm for weighted graphs. See
text for details.
12
183
4III DNA Metabolism
RNA Metabolism
Phosphorus Metabolism V
Amino Acids and Derivatives
II Protein Metabolism
Fatty Acids Lipids and
Isoprenoids
Metabolism of Aromatic
Compounds B 2160
2093 Cofactors Vitamins 1009 1326 149 779 2166 1467 Prosthetic Groups Pigments 1982 989 849 1078
I 185
1289
59
150
461
2555
2237 165 1013 Secondary Metabolism 1989 328 140 205 2155
Cell Wall and Capsule 2256 161 143 1820 2065 255 6 1365 1112 387 924 Membrane Transport 1244 299 393 173037 952 159 1285 Motility and Chemotaxis 28 179 55 805 721
1602 IV 81 1800 Stress Response 1015 502 583 482 1499 212 955 885 520 1295
2083 21
III403
112 407 258 218 1824
404 521 2214 339 525 561 126
V 751 1230 1092 2518 1816 1997
II 1662 2307 1818 564 1830
1995 1651
479
involved in pectinase, cellulase and blue pigment regulation. The regulation of lytic enzymes and pigments is likely to be central in responding to vary ing nutrient availability and primary productivity – particularly in the form of algal and cy anobacterial compounds – as well as irradiati on in the water column. Thus, investigators attem pting to characteris e the putative primary productivity and photosy nthesis dom ains in this network may benefit from considering the regulatory role of the EamA domain. A similar topology is observed around vertex 1662 in sub-network II. The hy potheses gene rated above are summarised in Table 1.
Table 1. Keyword summaries of hypotheses generated to account for sub-network membership in Figure 1.
Sub-network Hypotheses (keywords)
I Population booms; Nutrient transport and metabolism
II Primary productivity; Photosynthesis
III Protein complex / Mechanism associated with ligase activity
IV NA
V NA
4.CONCLUSION Environmental genomics is continuing to enrich the scientific understanding of ecosy stem composition and function. Here, graph-theoretic approaches effectively embedded results describing correlations of prot ein domain distribution across a global set of m arine, epipelagic m icrobial com munities. Combined with quality m etadata, these visualisations allowed hypotheses concerning the function of individual domains to be propagated across sub-networks. In this manner, a new, ecological perspective and context is put fort h to assist future attem pts in characterising these domains of unknown function.
5.ACKNOWLEDGEMENTS The authors would like to acknowledge the Max Planck Society for funding this work.
6.AUTHOR CONTRIBUTIONS WH and PLB contributed equally to the design and implementation of the analy sis and wrote the m anuscript. IK and RK conducted the metagenome processing with the DeCy pher system and data mana gement. PY, MBD, and PLB performed the annotation and categorisation of PFAM domains as well as hypothesis generation. FOG supervised the work and revised the manuscript.
7.REFERENCES
[1] Martinez, A., Tyson, G.W., and Delong, E.F. 2010. Widespread known and novel phosphonate utilization pathways in marine bacteria revealed by functional screening and metagenomic analyses. Environ. Microbiol.12, 1: 222-38.
[2] Kurokawa, K., Itoh, T., Kuwahara, T., et al. 2007. Comparative metagenomics revealed commonly enriched gene sets in human gut microbiomes. DNA Res. 14, 4: 169-81. DOI= 10.1093/dnares/dsm018.
[3] Rusch, D. B.; Halpern, A. L.; Sutton, G.; Heidelberg, K. B.; et al. 2007. The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific. PloS Biol. 5:e77 DOI= http://dx.doi.org/10.1371/journal.pbio.0050077.
[4] Seshadri, R.; Kravitz, S. A.; Smarr, L.; Gilna, P. & Frazier, M. 2007. CAMERA: a community resource for metagenomics. PloS Biol. 5:e75 DOI= http://dx.doi.org/10.1371/journal.pbio.0050075.
[5] Bateman, A.; Coin, L.; Durbin, R.; Finn, R. D.; Hollich, V.; Griffiths-Jones, S.; Khanna, A.; Marshall, M.; Moxon, S.; Sonnhammer, E. L. L.; Studholme, D. J.; Yeats, C. & Eddy, S. R. 2004. The Pfam protein families database. Nucleic Acids Res. 32:D138-D141 DOI= http://dx.doi.org/10.1093/nar/gkh121.
[6] Kottmann, R.; Kostadinov, I.; Duhaime, M. B.; Buttigieg, P. L.; Yilmaz, P.; Hankeln, W.; Waldmann, J. & Glöckner, F. O. 2009. Megx.net: integrated database resource for marine ecological genomics. Nucleic Acids Res., DOI= http://dx.doi.org/10.1093/nar/gkp918.
[7] R Development Core Team. 2008. R: A Language and Environment for Statistical Computing. Http://www.R-project.org.
[8] Harrell, F. E. & with contributions from many other users. 2008. Hmisc: Harrell Miscellaneous. http://biostat.mc.vanderbilt.edu/s/Hmisc.
[9] Csardi, G. & Nepusz, T. 2006. The igraph software package for complex network research. InterJournal. Complex Systems:1695 http://igraph.sf.net.
[10] Fruchterman, T. M. J. & Reingold, E. M. 1991. {Graph drawing by force-directed placement}. Software Pract Ex. 21:1129-1164 http://doi.wiley.com/10.1002/spe.4380211102.
[11] Prim, R. C. 1957. Shortest connection networks and some generalizations. Bell Systems Technical Journal. :1389-1401.
[12] Overbeek, R.; Begley, T.; Butler, R. M.; et al. 2005. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 33:5691-5702 DOI= http://dx.doi.org/10.1093/nar/gki866.
480