3
Applying graph theoretic approaches to microbial metagenomes: Ecological perspectives on function Wolfgang Hankeln 1,2 , Pier Luigi Buttigieg 1,2 , Ivaylo Kostadinov 1,2 , Renzo Kottmann 1 , Pelin Yilmaz 1,2 , Melissa Beth Duhaime 1,2 , Frank Oliver Glöckner 1,2 1 Max Planck Institute for Marine Microbiology, Celsiusstraße 1, 28359 Bremen, Germany 2 Jacobs University Bremen gGmbH, Campus Ring 1, 28759 Bremen, Germany ABSTRACT Graph theoretic approaches of fer a powerful means to visualise and examine data arising from microbial environmental metagenomes. Here, inter-correlations of protein domains with no known function detected in marine microbial metagenomes were visualised as networks. The topology of these networks combined with putative functional assignments of some participant protein domains allowed the generation of hy potheses concerning the ecological functions these networks may reveal. These methods may be employed to offer an ecological pers pective in future attempts to describe the function of uncharacterised protein domains. Categories and Subject Descriptors J.3 [LIFE AND MEDICAL SCIENC ES]: Visualization of environmental metagenomic data – protein network analysis in metagenomic datasets General Terms Algorithms, Measurement, Experimentation, Theory. Keywords Ecology, Ecological Genomics, Unknown Function, Function Inference. 1.INTRODUCTION Ecological com munities com prise a range of organism s and populations occupy ing intricately interwoven niches in a given environment. These com munities are constantly under external and internal pressures to m aintain functions that contribute to their survival and fitness while dispensing of those that hinder such suc cess. As a re sult of this selective pressure, genes contributing to common phenoty pes are often co-selected across environments [1,2] . Particularly in microbial studies, datasets emerging from environmental genom ics (or metagenomics) offer unique perspectives on the asso ciation of genomic elements across environments and ecosystems. The Global Ocean Sampling (GOS) expedition [3] metagenomes describe globally distributed marine microbial communities and offer an excellent opportunity to detect ecologically preserved functional modules. This project investigated the distribution of protein domains with unknown function across the GOS metagenomes and used graph theoretic approaches to detect modular structure from which functional hypotheses were derived. 2.METHODS A collection of 10,133,846 m illion unassem bled reads from the GOS sampling sites GS000a-GS023, GS025-GS051, GS108a- GS117b, GS119-GS123, GS148-GS 149, and MOVE858 were downloaded from the CAMERA web portal [4] and queried by all hidden Markov models (HMMs) bel onging to protein domains of unknown function present in the PFAM 23.0, pfam_fs database [5] using a single TimeLogic DeCy pher card (Active Motif, Inc., Carlsbad, CA). Hits with an upper E-value threshold of 1× 10 -10 were used in further analysis. Results were stored in MegDB [6 ], and converted to a “sites × domain” matrix wherein the abundance of each dom ain present at a given site was recorded. This m atrix was then im ported into the R statistical computing environment [7]. The correlation between each of the dom ains detected in the GOS dataset was determ ined using Spearm an’s rho statistic with the rcorr() function from the package Hmisc [8]. These correlations were us ed to generate adjacency matrices where vertices were connected by an edge if their correlation across sampling sites was greater than rho thresholds of 0.80 (not shown) and 0.90, respectively , w ith p-values lower than 0.01. Distances (1 – rho) were assigne d to all edges. Graphs were generated in the igraph [9] R package . The igraph function layout.fruchterman.reingold() was applied (default settings) to embed graphs using force-directed placem ent after the Fruchterman-Reingold algorithm [10]. Minimum s panning trees (M STs) were calculated for each graph with Prim’s algorithm for weighted graphs [11] using the igraph function minimum.spanning.tree(). PFAM domain descriptions indicating potential functions were used, when possible, to guide the assignment of domains into broad functional categories outlined in the SEED project [12]. These as signments were then us ed as vertex attributes and represented as colours in the network graphs and MSTs. Permission to make digital or hard copies of all or part of this wor k for personal or classroom use is granted without fee provided that copies are not m ade or distr ibuted for pr ofit or commercial advantage and that copies bear this notice and the full citation on the fir st page. T o copy otherwise, to r epublish, to post on ser vers or to redistribute to lists, requires prior specific permission and/or a fee. ACM-BCB 2010, Niagara Falls, NY, USA Copyright 2010 ACM ISBN 978-1-4503-0438-2... $10.00. Clusters of neighbouring vertices were considered to be putative ecological functional modules. Hy potheses of the corresponding functions were then generated bas ed on the placement and connectivity of vertices repr esenting categoris ed dom ains. Vertices corresponding to uncategorised domains within or connected to putative modules were considered to participate in the hypothesised function. 478

[ACM Press the First ACM International Conference - Niagara Falls, New York (2010.08.02-2010.08.04)] Proceedings of the First ACM International Conference on Bioinformatics and Computational

Embed Size (px)

Citation preview

Page 1: [ACM Press the First ACM International Conference - Niagara Falls, New York (2010.08.02-2010.08.04)] Proceedings of the First ACM International Conference on Bioinformatics and Computational

Applying graph theoretic approaches to microbial metagenomes: Ecological perspectives on functionWolfgang Hankeln1,2, Pier Luigi Buttigieg1,2, Ivaylo Kostadinov1,2, Renzo Kottmann1, Pelin Yilmaz1,2, Melissa Beth Duhaime1,2, Frank Oliver Glöckner1,2 1 Max Planck Institute for Marine Microbiology, Celsiusstraße 1, 28359 Bremen, Germany 2 Jacobs University Bremen gGmbH, Campus Ring 1, 28759 Bremen, Germany

ABSTRACT Graph theoretic approaches of fer a powerful means to visualise and examine data arising from microbial environmental metagenomes. Here, inter-correlations of protein domains with no known function detected in marine microbial metagenomes were visualised as networks. The topology of these networks combined with putative functional assignments of some participant protein domains allowed the generation of hy potheses concerning the ecological functions these networks may reveal. These methods may be employed to offer an ecological pers pective in future attempts to describe the function of uncharacterised protein domains.

Categories and Subject Descriptors J.3 [LIFE AND MEDICAL SCIENC ES]: Visualization of environmental metagenomic data – protein network analysis in metagenomic datasets

General Terms Algorithms, Measurement, Experimentation, Theory.

Keywords Ecology, Ecological Genomics, Unknown Function, Function Inference.

1.INTRODUCTION

Ecological com munities com prise a range of organism s and populations occupy ing intricately interwoven niches in a given environment. These com munities are constantly under external and internal pressures to m aintain functions that contribute to their survival and fitness while dispensing of those that hinder such suc cess. As a re sult of this selective pressure, genes contributing to common phenoty pes are often co-selected across environments [1,2] . Particularly in microbial studies, datasets emerging from environmental genom ics (or metagenomics) offer unique perspectives on the asso ciation of genomic elements across environments and ecosystems. The Global Ocean Sampling (GOS) expedition [3] metagenomes describe globally distributed marine microbial communities and offer an excellent opportunity

to detect ecologically preserved functional modules. This project investigated the distribution of protein domains with unknown function across the GOS metagenomes and used graph theoretic approaches to detect modular structure from which functional hypotheses were derived.

2.METHODS A collection of 10,133,846 m illion unassem bled reads from the GOS sampling sites GS000a-GS023, GS025-GS051, GS108a-GS117b, GS119-GS123, GS148-GS 149, and MOVE858 were downloaded from the CAMERA web portal [4] and queried by all hidden Markov models (HMMs) bel onging to protein domains of unknown function present in the PFAM 23.0, pfam_fs database [5] using a single TimeLogic DeCy pher card (Active Motif, Inc., Carlsbad, CA). Hits with an upper E-value threshold of 1× 10-10 were used in further analysis.

Results were stored in MegDB [6 ], and converted to a “sites × domain” matrix wherein the abundance of each dom ain present at a given site was recorded. This m atrix was then imported into the R statistical computing environment [7].

The correlation between each of the dom ains detected in the GOS dataset was determ ined using Spearm an’s rho statistic with the rcorr() function from the package Hmisc [8].

These correlations were us ed to generate adjacency matrices where vertices were connected by an edge if their correlation across sampling sites was greater than rho thresholds of 0.80 (not shown) and 0.90, respectively , w ith p-values lower than 0.01. Distances (1 – rho) were assigne d to all edges. Graphs were generated in the igraph [9] R package . The igraph function layout.fruchterman.reingold() was applied (default settings) to embed graphs using force-directed placem ent after the Fruchterman-Reingold algorithm [10].

Minimum spanning trees (MSTs) were calculated for each graph with Prim’s algorithm for weighted graphs [11] using the igraph function minimum.spanning.tree().

PFAM domain descriptions indicating potential functions were used, when possible, to guide the assignment of domains into broad functional categories outlined in the SEED project [12]. These as signments were then us ed as vertex attributes and represented as colours in the network graphs and MSTs.

Permission to make digital or hard copies of all or part of this wor k for personal or classroom use is granted without fee provided that copies are not m ade or distr ibuted for pr ofit or commercial advantage and that copies bear this notice and the full citation on the fir st page. T o copy otherwise, to r epublish, to post on ser vers or to redistribute to lists, requires prior specific permission and/or a fee. ACM-BCB 2010, Niagara Falls, NY, USA Copyright 2010 ACM ISBN 978-1-4503-0438-2... $10.00.

Clusters of neighbouring vertices were considered to be putative ecological functional modules. Hy potheses of the corresponding functions were then generated bas ed on the placement and connectivity of vertices repr esenting categoris ed dom ains. Vertices corresponding to uncategorised domains within or connected to putative modules were considered to participate in the hypothesised function.

478

Page 2: [ACM Press the First ACM International Conference - Niagara Falls, New York (2010.08.02-2010.08.04)] Proceedings of the First ACM International Conference on Bioinformatics and Computational

1289

59150

461

2555

185143

6

751

525159

9551013 255

1820

205

179

2237

1995

258

1092

520

1230

1467

328

1326

1989

561

6

924

1492256

779

404

1730

58381

721

339

1078

218

1295

2166

161

1982

2093

2155

0

387165

3931112

1824

212

03

989

1602

1800

1365

2518

140

1997

564

1662

482

552160

952521

14992065

2214

1285

1818

849

112 281015

2307

2083

21

2991009

805

1651

88537

1244

502

407

1816

Unknown

A I Photosynthesis

Respiration

IV Regulation and Cell signaling

Cell Division and Cell Cycle

3.RESULTS & DISCUSSION The Spearman’s rank correlation (rho > 0.90, p < 0.01) between functionally uncharacterised protein domains detected across the GOS metagenomes is visualised as a Fruchterman-Reingold embedded network in Figure 1A. In s ome cas es, dom ains were assigned putative functional roles represented by the colour scheme in Figure 1. Five sub-ne tworks were noted. Sub-network I comprises domains with a range of putative functions, prominently: Respiration, Cell Division and Cell Cycle, Regulation and Cell signalling, and Membrane Transport. Together, these assignments s uggest a functional constellation suitable for a general response to changing nutrient availability in the ecos ystem. S ub-network II is alm ost entirely com prised of domains putatively linked to Photosynthesis with one domain assigned to Regulation and Cell Si gnalling. The suggestion is of a common function tied to primary productivity , most likely under the regulation of the domain represented by vertex 1662. Sub-

network III features domains know n to co-occur, including one with putative functionality in Secondary Metabolism . Sub-networks IV and V each contain two dom ains with one of the former’s putatively associated with Phosphorous Metabolism. Due to the small size of sub-networ ks III, IV and V, it is difficult to adequately hypothesise on thei r function. The domains in sub-network III, however, are often observed together, suggesting they may comprise a protein com plex or participate clos ely in a shared mechanism. A forest calculated using Prim ’s MST algorithm for weighted graphs (where weights were defi ned by 1-rho for each pair of connected vertices) is shown in Figure 1B. Hence, only those edges representing the strongest correlations (rendered as the shortest paths between the vertices ) are cons idered, revealing the most prominent co-occurrence detected. This approach m ay reveal ‘hubs’ of each hy pothesised response. For exam ple, the Regulation and Cell Signalling domai n represented by vertex 6, central to sub-network I, has b een characteris ed as an EamA domain (PF00892), found in the protein PecM. This protein is

Figure 1: Network visualization of protein domain correlation across Global Ocean Sampling expedition metagenomes. A) Fruchterman-Reingold placement of protein domains. Vertices represent protein domains with no known function and are

connected by an edge when correlated with a Spearman’s rho > 0.90 (p < 0.01). When possible, domains were assigned to a putative metabolic category and coloured accordingly B) Corresponding MST calculated with Prim’s algorithm for weighted graphs. See

text for details.

12

183

4III DNA Metabolism

RNA Metabolism

Phosphorus Metabolism V

Amino Acids and Derivatives

II Protein Metabolism

Fatty Acids Lipids and

Isoprenoids

Metabolism of Aromatic

Compounds B 2160

2093 Cofactors Vitamins 1009 1326 149 779 2166 1467 Prosthetic Groups Pigments 1982 989 849 1078

I 185

1289

59

150

461

2555

2237 165 1013 Secondary Metabolism 1989 328 140 205 2155

Cell Wall and Capsule 2256 161 143 1820 2065 255 6 1365 1112 387 924 Membrane Transport 1244 299 393 173037 952 159 1285 Motility and Chemotaxis 28 179 55 805 721

1602 IV 81 1800 Stress Response 1015 502 583 482 1499 212 955 885 520 1295

2083 21

III403

112 407 258 218 1824

404 521 2214 339 525 561 126

V 751 1230 1092 2518 1816 1997

II 1662 2307 1818 564 1830

1995 1651

479

Page 3: [ACM Press the First ACM International Conference - Niagara Falls, New York (2010.08.02-2010.08.04)] Proceedings of the First ACM International Conference on Bioinformatics and Computational

involved in pectinase, cellulase and blue pigment regulation. The regulation of lytic enzymes and pigments is likely to be central in responding to vary ing nutrient availability and primary productivity – particularly in the form of algal and cy anobacterial compounds – as well as irradiati on in the water column. Thus, investigators attem pting to characteris e the putative primary productivity and photosy nthesis dom ains in this network may benefit from considering the regulatory role of the EamA domain. A similar topology is observed around vertex 1662 in sub-network II. The hy potheses gene rated above are summarised in Table 1.

Table 1. Keyword summaries of hypotheses generated to account for sub-network membership in Figure 1.

Sub-network Hypotheses (keywords)

I Population booms; Nutrient transport and metabolism

II Primary productivity; Photosynthesis

III Protein complex / Mechanism associated with ligase activity

IV NA

V NA

4.CONCLUSION Environmental genomics is continuing to enrich the scientific understanding of ecosy stem composition and function. Here, graph-theoretic approaches effectively embedded results describing correlations of prot ein domain distribution across a global set of m arine, epipelagic m icrobial com munities. Combined with quality m etadata, these visualisations allowed hypotheses concerning the function of individual domains to be propagated across sub-networks. In this manner, a new, ecological perspective and context is put fort h to assist future attem pts in characterising these domains of unknown function.

5.ACKNOWLEDGEMENTS The authors would like to acknowledge the Max Planck Society for funding this work.

6.AUTHOR CONTRIBUTIONS WH and PLB contributed equally to the design and implementation of the analy sis and wrote the m anuscript. IK and RK conducted the metagenome processing with the DeCy pher system and data mana gement. PY, MBD, and PLB performed the annotation and categorisation of PFAM domains as well as hypothesis generation. FOG supervised the work and revised the manuscript.

7.REFERENCES

[1] Martinez, A., Tyson, G.W., and Delong, E.F. 2010. Widespread known and novel phosphonate utilization pathways in marine bacteria revealed by functional screening and metagenomic analyses. Environ. Microbiol.12, 1: 222-38.

[2] Kurokawa, K., Itoh, T., Kuwahara, T., et al. 2007. Comparative metagenomics revealed commonly enriched gene sets in human gut microbiomes. DNA Res. 14, 4: 169-81. DOI= 10.1093/dnares/dsm018.

[3] Rusch, D. B.; Halpern, A. L.; Sutton, G.; Heidelberg, K. B.; et al. 2007. The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific. PloS Biol. 5:e77 DOI= http://dx.doi.org/10.1371/journal.pbio.0050077.

[4] Seshadri, R.; Kravitz, S. A.; Smarr, L.; Gilna, P. & Frazier, M. 2007. CAMERA: a community resource for metagenomics. PloS Biol. 5:e75 DOI= http://dx.doi.org/10.1371/journal.pbio.0050075.

[5] Bateman, A.; Coin, L.; Durbin, R.; Finn, R. D.; Hollich, V.; Griffiths-Jones, S.; Khanna, A.; Marshall, M.; Moxon, S.; Sonnhammer, E. L. L.; Studholme, D. J.; Yeats, C. & Eddy, S. R. 2004. The Pfam protein families database. Nucleic Acids Res. 32:D138-D141 DOI= http://dx.doi.org/10.1093/nar/gkh121.

[6] Kottmann, R.; Kostadinov, I.; Duhaime, M. B.; Buttigieg, P. L.; Yilmaz, P.; Hankeln, W.; Waldmann, J. & Glöckner, F. O. 2009. Megx.net: integrated database resource for marine ecological genomics. Nucleic Acids Res., DOI= http://dx.doi.org/10.1093/nar/gkp918.

[7] R Development Core Team. 2008. R: A Language and Environment for Statistical Computing. Http://www.R-project.org.

[8] Harrell, F. E. & with contributions from many other users. 2008. Hmisc: Harrell Miscellaneous. http://biostat.mc.vanderbilt.edu/s/Hmisc.

[9] Csardi, G. & Nepusz, T. 2006. The igraph software package for complex network research. InterJournal. Complex Systems:1695 http://igraph.sf.net.

[10] Fruchterman, T. M. J. & Reingold, E. M. 1991. {Graph drawing by force-directed placement}. Software Pract Ex. 21:1129-1164 http://doi.wiley.com/10.1002/spe.4380211102.

[11] Prim, R. C. 1957. Shortest connection networks and some generalizations. Bell Systems Technical Journal. :1389-1401.

[12] Overbeek, R.; Begley, T.; Butler, R. M.; et al. 2005. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 33:5691-5702 DOI= http://dx.doi.org/10.1093/nar/gki866.

480