112
PUBLIC LIBRARY of SCIENCE | plosbiology.org | Special Collection | MARCH 2007 Oceanic Metagenomics in A collection of articles from the J. Craig Venter Institute’s Global Ocean Sampling expedition

Plos Biology Venter Collection Low

Embed Size (px)

Citation preview

Page 1: Plos Biology Venter Collection Low

PU

BL

IC L

IBR

AR

Y o

f SC

IEN

CE

| SP

EC

IAL

OC

EA

NIC

ME

TA

GE

NO

MIC

S C

OL

LE

CT

ION

| MA

RC

H 2

00

7

committed to making scientifi c and medical literature a public resource

www.plos.org

PUBLIC LIBRARY of SCIENCE | plosbiology.org | Special Collection | MARCH 2007

Oceanic Metagenomics in

A collection of articles from the J. Craig Venter Institute’s

Global Ocean Sampling expedition

Page 2: Plos Biology Venter Collection Low
Page 3: Plos Biology Venter Collection Low

Publisher Information

PLoS Biology (ISSN-1544-9173, eISSN-1545-7885) is published monthly by the Public Library of Science. All works published in PLoS journals are open access, subject to the terms of the Creative Commons Attribution License (http:⁄⁄creativecommons.org/licenses/by/2.5/). Copyright is retained by the authors. PLoS Biology is freely available online: http:⁄⁄plosbiology.org

Correspondence

Public Library of Science185 Berry St., Ste. 3100San Francisco, CA 94107 USA

email: [email protected]: +1 415.624.1200fax: +1 415.546.4090

PLoS European Editorial Offi ce7 Portugal PlaceCambridge CB5 8AF UK

email: [email protected]: +44 (0)1223-463-330fax: +44 (0)1223-463-348

Display Advertising

Patric Donaghy, [email protected]: +1 415.564.8612

Manuscript Submission

online: http:⁄⁄biology.plosjms.org

March 2007 | Oceanic Metagenomics Collection

Editorial StaffHemai Parthasarathy, Managing EditorNatalie Bouaravong, Editorial AssistantJami Milton Dantzker, Associate EditorJacob Evans, Editorial AssistantLiza Gross, Science WriterEmma Hill, Associate EditorCatriona MacCallum, Senior EditorRobert Shields, Senior EditorJanelle Weaver, Associate Editor

Board of DirectorsHarold Varmus, Chairman & Co-founderPatrick O. Brown, Co-founderMichael B. Eisen, Co-founderBrian DrukerPaul GinspargAllan GolstonCalestous JumaLawrence LessigElizabeth MarincolaRichard SmithRosalind L. SmythBeth Weil

Editorial BoardAnurag AgrawalJulie AhringerShizuo AkiraRichard AldrichGöran ArnqvistJames AsheAnthony BarnoskyNick BartonKonrad BaslerMichael BatePeter BeckerPamela BjörkmanPeer BorkHenry BourneLon CardonJames CarringtonLars ChittkaJoanne ChoryJeffrey DanglTitia De LangeFrans de WaalJoseph DeRisiAndrew DillinAndy DobsonFord EbnerSean EddyThomas EdlundThomas EgwangJonathan EisenSteve Elledge

Stephen EllnerMichael EmermanManfred FahleSusan GasserMikhail GelfandRichard GibbsMargaret GoodellDouglas GreenBryan GrenfellJames HaberHiroshi HamadaWilliam HarrisPaul HarveyNicholas HastieR. Scott HawleyAnders HedenströmJoseph HeitmanDan HerschlagWinston HideDavid HillisBrigid HoganFred HughsonTim HuntLaurence HurstGerald JoyceJim KadonagaLaurent KellerChristopher KempChaitan KhoslaJoel KingsolverThomas KirkwoodTom KornbergMark KrasnowArthur D. LanderAndre LevchenkoMichael LichtenSusan LindquistDavid LipmanEdison LiuMichel LoreauGeorgina MacePhilippa MarrackAlfonso Martínez-AriasRowena MatthewsMarkus MeisterBénédicte MichelEmmanuel MignotTom MisteliNancy MoranCraig MoritzDavid NemazeeEric NestlerMohamed NoorRoel NusseSteve O’RahillySvante PääboNipam PatelDavid PennyGreg PetskoLennart PhilipsonRon PlasterkDietmar PlenzHidde PloeghWalt ReidCallum RobertsRichard Roberts

Sarah Rowland-JonesGerry RubinMick RuggUeli SchiblerManfred SchliwaDavid SchneiderMatthew P. ScottIdan SegevBen SheldonDaniel SimberloffKai SimonsMandyam SrinivasanDerek StempleCharles StevensBill SugdenSally TempleJanet ThorntonChris Tyler-SmithLeslie UngerleiderJoan ValentineMatt van de RijnAntonio Vidal-PuigHerbert W. VirginMatt WaldorPeter WalterGary WardDetlef WeigelJonathan S. WeissmanMarv WickensKen H. WolfePhillip D. ZamoreRobert ZatorreHuda Y. Zoghbi

ProductionAshley ClarkAnthony FloresAlexis Wynne MogulChelsea E. Scholl

MarketingLiz Allen, DirectorAllison HawxhurstCatherine Silvestre

IT & WebRichard Cave, DirectorAndrew BergeronSusanne DeRisiJosh KlavirCéline NadeauTim SullivanRussell UmanElisa Webb

StaffMark Gritton, Chief Executive Offi cerMark Patterson, Director of PublishingSteve Borostyan, Chief Financial Offi cerBarbara Cohen, PLoS Executive EditorJanice Pettey, Development DirectorIsis ChotoDonna OkuboRobert Viera

PLoS Biology | www.plosbiology.org i

Page 4: Plos Biology Venter Collection Low
Page 5: Plos Biology Venter Collection Low

Editorial ____________________________________________________________________________

Global Ocean Sampling Collection S1Hemai Parthasarathy, Emma Hill, e83

Catriona MacCallum

Synopses of Research Articles _________________________________________

Untapped Bounty: Sampling the Seas S3to Survey Microbial Biodiversity e85

Liza Gross

Feature _____________________________________________________________________________

Sorcerer II: The Search for Microbial Diversity S9Roils the Waters e74

Henry Nicholls

Essay _________________________________________________________________________________

Environmental Shotgun Sequencing: S13Its Potential and Challenges for Studying e82

the Hidden World of Microbes

Jonathan A. Eisen

Community Page _____________________________________________________________

CAMERA: A Community Resource S18for Metagenomics e75

Rekha Seshadri, Saul A. Kravitz, Larry Smarr,

Paul Gilna, Marvin Frazier

Research Articles ______________________________________________________________

The Sorcerer II Global Ocean Sampling S22Expedition: Northwest Atlantic through e77

Eastern Tropical Pacifi c

Douglas B. Rusch, Aaron L. Halpern, Granger Sutton, et al.

The Sorcerer II Global Ocean Sampling S56Expedition: Expanding the Universe e16

of Protein Families

Shibu Yooseph, Granger Sutton, Douglas B. Rusch, et al.

Structural and Functional Diversity S91of the Microbial Kinome e17

Natarajan Kannan, Susan S. Taylor, Yufeng Zhai, et al.

PUBLIC LIBRARY of SCIENCE www.plos.org Oceanic Metagenomics Collection | March 2007

About the Cover

Every paper we publish is freely available online for you to read, download, copy, distribute and use—no permissions required. All articles are archived in PubMed Central.

Aboard the Sorcerer II, the Global Ocean Sampling

expedition made its way from Canada, through the

Panama Canal, into the South Pacifi c, collecting

genomic sequences from marine microorganisms.

The resulting data are explored in three papers in this

special collection from the March 2007 issue of PLoS

Biology (see Rusch et al., e77; Yooseph et al., e16; and

Kannan et al., e17).

Cover credit: Image provided by the J. Craig Venter Institute

doi:10.1371/journal.pbio.0050088.g001

March 2007 | Oceanic Metagenomics CollectionPLoS Biology | www.plosbiology.org iii

Page 6: Plos Biology Venter Collection Low
Page 7: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S1 0369

Editorial

Special Section from March 2007 | Volume 5 | Issue 3 | e83

Today, PLoS Biology publishes landmark metagenomics papers from the J. Craig

Venter Institute’s Global Ocean Sampling expedition [1–3]. These papers describe the initial analyses of several gigabasepairs’ worth of sequence data from oceanic microbes collected during the Sorcerer II expedition, as the ship made her way down from Canada, through the Panama Canal, and fi nally out beyond the Galapagos Islands well into the tropical Pacifi c and the South Pacifi c Gyre. Results from the fi rst foray of this research mission into the Sargasso Sea were published three years ago [4]. As described in the accompanying Synopsis [5], the new voyage has added information from multiple biomes and several-fold more data.

Analysis of these data poses not only scientifi c challenges [6], but also signifi cant legal hurdles. Craig Venter is no stranger to issues of intellectual property—his previous incarnation as the president of Celera saw him embroiled in controversy over the decision to “privatize” aspects of his company’s work in sequencing the human genome. Now, at the head of the Global Ocean Sampling project, Venter fi nds himself on the side of greater accessibility, negotiating the claims of individual governments on the genomic wealth within their waters. In particular, as of this writing, there is an active negotiation with the Ecuadorian government (which has seen more than one change of power since the expedition began) over restricting commercial reuse of these data. Henry Nicholls describes this tangled legal landscape in an accompanying Feature [7].

Although extensive in scope, the papers presented here only touch the surface of the wealth of information to be gleaned from these data, which are freely available for all to explore from their desktops: the trace reads and processed data have been deposited in the National Center for Biotechnology Information’s Trace Archive (http://www.ncbi.nlm.nih.gov/Traces) (with the exception of that fraction of the trace data acquired from Ecuadorian coastal waters), annotated with extensive geographical and physicochemical metadata. The assemblies and associated annotated peptides will be delivered to GenBank (http:⁄⁄www.ncbi.nlm.nih.gov/Genbank) around the time of publication, and will become available after GenBank has processed them. More immediately, and potentially more usefully, these data are also freely available through a specially built database, CAMERA—Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (http:⁄⁄camera.calit2.net)—which provides greater annotation and analysis capabilities [8]. (CAMERA was funded by the Gordon and Betty Moore Foundation, which also supports PLoS.)

The proponents of open-access publishing, ourselves included, often cite as an inspiration the power that open access to DNA sequence databases has had in transforming scientifi c discovery. As our founders noted in the inaugural issue of PLoS Biology, “With great foresight, it was decided in the early 1980s that published DNA sequences should be deposited in a central repository, in a common format, where they could be freely accessed and used by anyone. Simply giving scientists free and unrestricted access to the raw sequences led them to develop the powerful methods, tools, and resources that have made the whole much greater than the sum of the individual sequences....Now imagine the possibilities if the same creative explosion that was fueled by open access to DNA sequences were to occur for the much larger body of published scientifi c results.” [9]

But the publishing reality in genomics research has been less inspiring. Although sequence data are publicly available and free to be reused by the community, the same creative license has not yet been awarded to the key papers resulting from the major genome projects, which are commonly published in subscription-based journals. Many of these genomics papers are “freely” available from publisher Web sites, but their use remains restricted, and to claim that freedom to read an article is the main benefi t of open access is to miss the promise inspired by DNA sequence databases.

While we and other open-access journals have both enjoyed and been grateful for strong support from the genomics community, we are also disappointed that authors of landmark genomics papers, who adamantly support open access to sequence data, have not taken the opportunity to provide further leadership for their community by promoting open access to the scientifi c literature. We encourage all researchers to apply the same standards to their papers as they would to their data, regardless of the publisher. As Jensen et al. stated in a recent review about the benefi ts of text mining for the scientifi c community, “It is the restricted access to the full text of papers…that is currently the greatest limitation…” [10].

Global Ocean Sampling CollectionHemai Parthasarathy*, Emma Hill, Catriona MacCallum

Citation: Parthasarathy H, Hill E, MacCallum C (2007) Global Ocean Sampling collection. PLoS Biol 5(3): e83. doi:10.1371/journal.pbio.0050083

Copyright: © 2007 Parthasarathy et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Hemai Parthasarathy is Managing Editor, Emma Hill is Associate Editor, and Catriona MacCallum is Senior Editor at PLoS Biology.

* To whom correspondence should be addressed. E-mail: [email protected]

This article is part of the Oceanic Metagenomics collection in PLoS Biology. The full collection is available online at http://collections.plos.org/plosbiology/gos-2007.php.

Page 8: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S2 0370

Acknowledgments

PLoS Biology relies on the support of our academic editors and reviewers in selecting and improving manuscripts for publication. We would like to extend particular thanks to our editorial board members Sean Eddy, Jonathan Eisen, and Nancy Moran, our guest editors Simon Levin and Tony Pawson, and our anonymous peer reviewers for their contributions to this collection of articles.

References1. Rusch DB, Halpern AL, Sutton G,

Heidelberg KB, Williamson S, et al. (2007) The Sorcerer II Gobal Ocean Sampling expedition: Northwest Atlantic through

eastern tropical Pacifi c. PLoS Biol 5: e77. doi:10.1371/journal.pbio.0050077

2. Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al. (2007) The Sorcerer II Global Ocean Sampling expedition: Expanding the universe of protein families. PLoS Biol 5: e16. doi:10.1371/journal.pbio.0050016

3. Kannan N, Taylor SS, Zhai Y, Venter JC, Manning G (2006) Structural and functional diversity of the microbial kinome. PLoS Biol 5: e17. doi:10.1371/journal.pbio.0050017

4. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al (2004) Environmental genome shotgun sequencing of the Sargasso Sea. Science 304: 58–60.

5. Gross L (2007) Untapped bounty: Sampling the seas to survey microbial biodiversity. PLoS Biol 5: e85. doi:10.1371/journal.pbio.0050085

6. Eisen JA (2007) Environmental shotgun sequencing: The potential and challenges of random and fragmented sampling of the hidden world of microbes. PLoS Biol 5: e82. doi:10.1371/journal.pbio.0050082

7. Seshadri R, Kravitz SA, Smarr L, Gilna P, Frazier M (2007) CAMERA: A community resource for metagenomics. PLoS Biol 5: e75. doi:10.1371/journal.pbio.0050075

8. Nicholls H (2007) Sorcerer II: The search for microbial diversity roils the waters. PLoS Biol 5: e74. doi:10.1371/journal.pbio.0050074

9. Brown PO, Eisen MB, Varmus HE (2003) Why PLoS became a publisher. PLoS Biol 1: e36. doi:10.1371/journal.pbio.0000036

10. Jensen LJ, Saric J, Bork P (2006) Literature mining for the biologist: From information retrieval to biological discovery. Nat Rev Genet 7: 119–129.

Special Section from March 2007 | Volume 5 | Issue 3 | e83

Page 9: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S3 0371

Being invisible to the naked eye, microbes managed to escape scientifi c scrutiny until the mid-17th century, when Leeuwenhoek invented the microscope. These cryptic organisms continued to thwart scientists’ efforts to probe, describe, and classify them until about 40 years ago, owing largely to a limited morphology that defi es traditional taxonomic methods and an enigmatic physiology that makes them notoriously diffi cult to cultivate.

Most of what we know about the biochemical diversity of microbes comes from the tiny fraction that submit to lab investigations. Not until scientists determined that they could use molecular sequences to identify species and determine their evolutionary heritage, or phylogeny, did it begin to become apparent just how diverse microbes are. We now know that microbes are the most widely distributed organisms on earth, having adapted to environments as diverse as boiling sulfur pits and the human gut. Accounting for half of the world’s biomass, microbes provide essential ecosystem services by cycling the mineral nutrients that support life on earth. And marine microbes remove so much carbon dioxide from the atmosphere that some scientists see them as a potential solution to global warming.

Yet even as scientists describe seemingly endless variations on the cosmopolitan microbial lifestyle, the concept of a bacterial species remains elusive. Some bacterial species (such as anthrax) appear to have little genetic variation while in others (such as Escherichia coli ) individuals can have completely different sets of genes, challenging scientists to explain the observed diversity.

The emerging fi eld of environmental genomics (or metagenomics) aims to capture the full measure of microbial diversity by trading the lens of the microscope (and biochemistry) for the lens of genomics (and bioinformatics). By recovering communities of microbial genes where they live, environmental genomics avoids the need to culture uncooperative organisms. And by linking these data to details relating to sequence collection sites, such as pH, salinity, and water temperature, it sheds light on the biological processes encoded in the genes.

The largest metagenomic dataset collected so far comes from the Sorcerer II expedition, named after the yacht J. Craig Venter transformed into a marine research vessel. In a pilot study of the Sargasso Sea, Venter’s team identifi ed 1.2 million genes and inferred the presence of at least 1,800 bacterial species. But the genetic and taxonomic diversity of the

data imposed new challenges on existing genome assembly methods and other analysis techniques. The researchers designed the Sorcerer II Global Ocean Sampling (GOS) expedition to see if collecting more samples would improve their assembly and lead to a better estimate of the number and diversity of microbial genes in the oceans.

And now, in three new studies, Venter’s team has combined the expedition’s latest bounty—6.5 million sequencing “reads”—with the Sargasso Sea data. The result is a geographically diverse environmental genomic dataset of 6.3 billion base pairs—twice the size of the human genome. (To learn about the voyage and sampling methods, see Box 1.) In the fi rst paper, Douglas Rusch, Aaron Halpern, and colleagues attempt to describe the immense amount of microbial diversity in the seas, and determine how—or if—that diversity is structured and what might be shaping that structure. In the second paper, Shibu Yooseph et al. study the millions of proteins in the GOS sequences to see if we’re close to discovering all the proteins in nature. And in the third study, Natarajan Kannan, Gerard Manning, and colleagues classify thousands of kinases into 20 distinct families, revealing their structural and functional diversity and an unexpected importance in prokaryotic regulation.

Extracting Meaning from Metagenomic Datasets

The GOS samples used in the Rusch et al. study were collected over the course of a year from a wide range of aquatic environments—including estuaries, lakes, and open oceans—then pumped through serial fi lters. After extracting the genetic material from the microbe-encrusted fi lters, Rusch et al. used shotgun sequencing to study the genes present in the samples. DNA is forced through a tiny nozzle that smashes it into bits; the fragments are cloned and the letters of the genetic code are scanned from both ends to create “reads.” Reads are then assembled, much like a jigsaw puzzle, starting with contiguous fragments (“contigs”) that are then mapped onto “scaffolds,” which order and orient sets of contigs on a chromosome. (For more on shotgun sequencing, see Box 2.) Using a conservative sequence similarity requirement, most reads failed to assemble, suggesting that the samples contained great microbial diversity.

With only a bare bones assembly to guide their investigation, Rusch et al. tried a different approach. They used the 584 completed and draft microbial genomes already available in public databases as points of reference and relaxed search parameters to detect even remote similarity to GOS sequences. Although the majority of GOS reads matched up with one or more of the reference genomes, the loose criteria prevented the researchers from drawing meaningful inferences about kinship.

To boost their inference power, they required that similarity to a reference genome extend nearly the full length of a read (producing “recruited reads”). A substantial majority of reads failed this criterion, with only 30% of the GOS data being recruited. The bulk of these aligned to three genera of widely distributed marine microbes— Pelagibacter , Synechococcus , and Prochlorococcus —which accounted for about 15% of the recruited reads. The remaining recruited reads appear to signal conserved genes rather than closely related

Untapped Bounty: Sampling the Seas to Survey Microbial Biodiversity Liza Gross | doi:10.1371/journal.pbio.0050085

Special Section from March 2007 | Volume 5 | Issue 3 | e85

Page 10: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S4 0372 Special Section from March 2007 | Volume 5 | Issue 3 | e85

The Sorcerer II expedition was inspired by the British

Challenger expedition (1872–1876), a pioneering oceanography

research project that discovered hundreds of new genera and

nearly 5,000 new marine species. Its gun stations replaced

with research stations, the Challenger circumnavigated the

oceans, stopping every 320 kilometers to recover specimens

from bottom, intermediate, and surface depths to explore the

diversity of macroscopic marine life. At each stop, the crew

recorded the location, what they used to extract the sample, the

depth of the sample, and several observations related to water

and atmospheric conditions. The Sorcerer II followed a similar

sampling schedule, traveling nearly 9,000 kilometers to collect

samples of microbial marine life and record the water’s location,

depth, pH, salinity, and temperature.

The GOS crew collected samples from surface waters of

diverse, mostly marine aquatic environments. The samples were

collected between August 2003 and May 2004 during a six-leg

journey that followed a path from northeastern Canada to the

South Pacifi c Gyre. Venter’s crew collected microbial samples

by pumping 200 liters of surface seawater through a series of

increasingly fi ne fi lters, which they labeled, froze, and sent back

to the lab of the J. Craig Venter Institute in Maryland.

After a stop in the Gulf of Maine, the expedition sampled three

sites along Nova Scotia, including a “highly eutrophic” coastal

embayment in Halifax. The crew set sail again in November,

starting in Newport Harbor, Rhode Island, and ending in the

Delaware Bay, one of several estuaries targeted on the journey.

The next leg began in Chesapeake Bay. The largest US estuary,

Chesapeake Bay contains a rich mix of freshwater and marine

organisms. Estuaries are complex hydrodynamic environments

that are highly sensitive to runoff from agricultural and urban

development (which can dump massive amounts of nitrogen

and phosphorous into watersheds). Microbial communities

collected from estuaries promise to provide valuable insights

into the metabolic and physiological adaptations required by

such environments. Continuing down the Atlantic seaboard, the

expedition stopped near Cape Hatteras, North Carolina, and the

Florida Keys before passing through the Caribbean and ending

near Panama, where the crew collaborated with scientists at the

Smithsonian Tropical Research Institute.

The fourth leg of the voyage sampled sites in the Eastern

Pacifi c, including Cocos Island, about 500 kilometers southwest

of Costa Rica. A highly productive ecosystem inhabits the waters

off the island, a result of ocean currents buffeting the coast and

causing nutrient upwellings that mix with warm surface waters.

The crew made one last stop in the open ocean, then headed for

the Galapagos Islands.

Owing in part to its position near major ocean currents and

atmospheric transition zones, the Galapagos Archipelago

sits within a hydrographically complex region. Unique

oceanographic features there support a diverse set of habitats

and endemic species, found within several discrete zones

distinguished by temperature. This microbial mother lode held

the crew’s attention for two months, while they extensively

sampled the region.

By early March 2004, the crew had collected the last three

samples used in these studies, from two open ocean sites

and a lagoon in a coral reef in the South Pacifi c Gyre. Follow

these links to learn more about the Sorcerer II (http://www.

sorcerer2expedition.org/version1/HTML/main.htm) and the

Challenger (http://hercules.kgs.ku.edu/hexacoral/expedition/

challenger_1872-1876/challenger.html) expeditions.

doi:10.1371/journal.pbio.0050085.g001

H.M.S. Challenger (Image: NOAA, Steve Nicklas)

doi:10.1371/journal.pbio.0050085.g002

The Sorcerer II (Image: J. Craig Venter Institute)

Box 1. Following the Sorcerer II ’s Hunt for Microbes

Page 11: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S5 0373

organisms. Most of the GOS sequences failed to be identifi ed, in part because so few surface water microbes have been sequenced.

A novel comparative genomic method. Focusing on the reads that recruited to these most abundant genera, Rusch et al. generated “fragment recruitment plots.” These graphics represent relatedness and diversity of environmental sequences to a reference genome by showing where a read aligns with the reference genome (indicated by a horizontal bar) and its degree of similarity to the reference sequence (indicated by its vertical position). Recruited reads were color-coded based on sample origin to indirectly depict their associated metadata (for example, salinity and pH). These plots provided a visual tool to explore genetic diversity at the sequence and gene level, genome structure and evolution, and taxonomic and evolutionary relationships. (For more on fragment recruitment plots, see the accompanying poster, doi:10.1371/journal.pbio.0050077.sd001.)

Distinct recruitment patterns, easily detected by bands of color, emerged for each organism. In some cases, a single reference genome had multiple color bands, distinguished by their similarity and sample provenance. Because bands appeared to represent unique, closely related, and geographically distinct populations—and showed a novel level of diversity across the entire genome—the researchers termed each band a subtype. A tremendous amount of sequence diversity appeared in the subtypes, which also harbored substantial sequence variation at the protein level, some likely refl ecting adaptations to local environments. This fi nding reveals a potential locus of microbial diversity—at the level of subtype rather than at the level of species, or ribotype (based on a segment of a ribosomal RNA gene called 16S rRNA)—and offers clues to why it emerged (perhaps in response to local pressures) and how it evolved.

A novel sequence assembly method. Because such high levels of sequence diversity among organisms confound standard whole genome assembly software, and most of the GOS data correspond to organisms for which there is no appropriate reference genome, Rusch et al. used an “extreme assembly” approach to investigate the genomes of other abundant GOS populations. They used greatly reduced requirements for sequence similarity in the assemblers to generate longer contigs and capture more of the GOS data in an assembly. While some of the resulting larger assemblies corresponded to known reference genomes, others did not, allowing the researchers to study microbes without cultivated or sequenced counterparts. And because these larger assemblies could potentially provide functional insights into uncharacterized organisms, they might identify conditions that would allow scientists to grow them in the lab.

Many of the large contigs failed to align in any signifi cant way with known genomes, so the researchers tried to match them with “seed fragments” from known taxonomic groups. By starting assembly from reads mated to the 16S rRNA gene—one of the most common marker genes used for classifying microbes—they could generate large contigs associated with many of the abundant GOS ribotypes. Fragment recruitment plots of these assemblies again revealed multiple subtypes, providing further support for the presence of multiple evolutionarily distinct subtypes within a given ribotype.

Evidence for environmental adaptations. A computational approach designed to identify groups of samples with similar

genomic content revealed that tropical and temperate samples shared the least amount of genomic material. Some samples, however, were very similar.

While untangling all the factors that may affect genetic makeup of a sample is beyond current datasets and methods, the researchers demonstrated that specifi c genetic differences can be related to environmental factors. Several genes occurred up to seven times more frequently in a pair of samples from the Caribbean than they did in a pair from the eastern Pacifi c, even though both pairs had similar ribotype and genetic profi les. Many of these genes govern the metabolism and transport of phosphate (required for microbial growth), likely refl ecting functional adaptations in the microbial communities to the measured differences in phosphate availability in the Caribbean and Pacifi c samples.

The researchers also explored diversity at the gene level by looking for evidence of functional differences in one gene family, proteorhodopsins, light-activated proton pumps with a slightly murky biological role. Proteorhodopsins were abundant in all the GOS and Sargasso Sea samples. In keeping with the diverse light environments sampled during the expedition, the researchers found a strong correlation between sequence variation and sample provenance. They hypothesize that the distribution of given variants refl ects adaptation to the most abundant light spectra in their habitats.

Altogether, these results reveal the power of metagenomic approaches to capture the true measure of microbial diversity by uncovering genomic differences that would not have been apparent using traditional marker-based approaches. The breadth of this newly revealed diversity may come as a surprise to even inveterate microbe hunters.

The Expanding Protein Universe

Along with insights into microbial diversity, metagenomics promises to help us understand the vast number of proteins in nature. By randomly sampling DNA sequences from communities of organisms, metagenomic studies overcome selection and culturing biases that arise from focusing on a particular organism or a set of proteins, to provide an expansive view of protein diversity and evolution.

Proteins are typically grouped into families based on their evolutionary relationship, which can then be used to guide investigations of their biological roles. Proteins in the same family share similar amino acid sequences and three-dimensional conformations. Using amino acid sequence similarity as a measure to identify and group protein sequences from the GOS data with sequences from a comprehensive set of known proteins, Shibu Yooseph et al. evaluated the impact of the GOS data on our understanding of known proteins and studied the rate of discovery of protein families with new sequences. To group related sequences and predict proteins, they developed a novel sequence clustering technique based on full-length sequence similarity.

Identifying proteins in metagenomics data. Hypothetical proteins can be predicted by searching for open reading frames (ORFs), sequences fl anked by nucleotide triplets (called codons) that signal the beginning and end of translation but don’t necessarily encode a protein. Because the GOS data contain many fragmentary sequences, Yooseph et al. allowed ORFs to be terminated at the end of a sequence, resulting in a partial or truncated ORF. They used

Special Section from March 2007 | Volume 5 | Issue 3 | e85

Page 12: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S6 0374

the ORFs to generate a set of predicted proteins based on the results of a series of clustering steps and statistical analyses.

After performing pairwise comparisons (of every sequence against every other sequence) of the more than 28 million sequences in the combined dataset, the researchers identifi ed conserved groups of sequences after accounting for redundancy due to identical and near-identical sequences. They then used profi le methods to merge and expand these groups of sequences. While pairwise comparisons capture the most closely related sequences (or homologs), profi le methods (the researchers used both PSI-BLAST and hidden Markov models) detect more distantly related sequences by combining homologs into multiple sequence alignments to generate “profi les.” (For more on these methods, see Box 2.)

From the clusters obtained by the above procedure, clusters of spurious sequences (that overlap true protein regions on the genome) were identifi ed in addition to clusters of noncoding conserved sequences (based on tests showing no selection on their codons). Sequences in these clusters were removed; those remaining were labeled as predicted proteins. The researchers identifi ed nearly 6 million proteins in the GOS dataset—1.8 times the number already in public databases. Comparing the predicted protein clusters to known prokaryotic and nonprokaryotic protein databases

revealed GOS counterparts in nearly all known prokaryotic protein families; nearly 2,000 clusters appeared unique to the GOS dataset.

Since they couldn’t use sequence similarity to infer function for the unique GOS sequences, the researchers relied on the assumption that proteins with similar roles are more likely to reside in the same genomic neighborhood. This analysis implicated several GOS-only clusters in photosynthesis or electron transport. Such clusters may come from viruses, as many viral parasites of photosynthetic bacteria express the photosynthetic genes of their hosts. Interestingly, though most of the sequences in GOS-only clusters appeared to be bacterial, a higher than expected proportion of them were fl agged as viral. If such novel GOS protein families pan out as viral, the researchers argue, “we are far from exploring the molecular diversity of viruses.”

Insights into evolutionary and functional diversity. To compare ocean versus terrestrial life at the biochemical level, Yooseph et al. compared GOS sequences to those of land-dwelling prokaryotes. Nearly 70% of protein domains varied between the two classes of microbes, mostly refl ecting the distinct biochemical requirements of the two environments, as well as the different taxonomic groupings in the two datasets. The researchers were surprised to fi nd little evidence

Special Section from March 2007 | Volume 5 | Issue 3 | e85

Bioinformatics relies on statistics and computer power

to synthesize and interpret huge datasets. Here’s a brief

introduction to some of the environmental genomics methods

used in the GOS studies.

Shotgun sequencing decodes genetic material by randomly

shredding it into millions of fragments. The DNA sequence of

each end of a fragment is determined; the two ends of a given

fragment (or insert) can be associated, and constitute a “mate

pair.” These random sequencing “reads” are then reassembled

with a computer. Based on sequence similarity, overlapping

reads are identifi ed and merged into longer sequences called

“contigs.” Contigs are organized into larger (but not necessarily

continuous) pieces of a genome, called “scaffolds,” based

on mate pairs. The resulting assemblies can link genes to

their regulatory elements, guide investigations of biological

pathways, and connect unknown sequences with taxonomic

markers to suggest evolutionary relationships.

Sequence similarity detection allows functional and taxonomic

characterization of genomic sequences. Once the shotgunned

sequences have been organized into a library of sequence

“scaffolds” and translated into hypothetical proteins, the next

step uses sequence similarity to fi gure out what the proteins

are and to identify families. Similarity can also associate a new

sequence with an approximate location on the tree of life.

Sequence–sequence (pairwise) methods, the fi rst step for

identifying closely related sequences, compare all sequences

to all other sequences in a pairwise manner. These methods

(such as BLAST) allow all collected sequences to be compared

with one another (and with all sequences already available in

public databases) and reliably clustered into families of related

sequences with high sequence similarity, or homology.

Profi le methods are used to identify more remote

relationships. Profi le methods use multiple sequence

alignments of previously identifi ed families to compute

“position-specifi c scoring matrixes” (PSSMs). Each position in

the alignment is associated with a set of scores that reward or

penalize the alignment of a given amino acid to the position.

Profi le methods can be more sensitive than simple sequence

similarity methods because they give more weight to signals at

sites that are conserved within a protein family and less weight

to more variable positions.

Initially, the advantages of profi le methods for detecting

remote homology were limited to well-characterized families,

as construction of a profi le required some expertise. However,

this changed with the fully automated integration of this step

into PSI-BLAST. PSI-BLAST begins with a pairwise (sequence–

sequence) similarity search, but then iteratively runs alternating

steps of building a profi le from the current set of similar

sequences and using the profi le to re-search the database for

additional matching sequences.

Hidden Markov models (HMMs) employ statistical methods

to model the likelihood of different amino acids at any given

position of the sequence in an underlying alignment. Like

some profi le methods, HMMs use a probability-based method

to determine the score of aligning an observed amino acid to

a given position in a protein family, but HMMs improve upon

profi les by more sophisticated modeling of variation in protein

length, storing the probabilities of insertions or deletions at

each position of the model. HMMs have a good track record for

identifying more distantly related protein sequences.

Profi le–profi le methods are the most recent enhancement to

sequence homology detection methods. As the name suggests,

profi le–profi le methods compare one profi le to another. Because

each profi le implicitly encodes more information than a single

sequence, these methods identify relationships that cannot be

detected by comparing individual sequences.

Box 2. Bioinformatic Methods at a Glance

Page 13: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S7 0375

of domains specifi c to gram-positive bacteria (defi ned by their unique cell wall), even though this bacterial group makes up nearly 12% of the GOS dataset. They also found a relative dearth of components related to fl agella (the whip-like tail of microbial motility), possibly refl ecting the reduced need for self-propulsion in the ocean.

Using a comprehensive protein family database (called Pfam), the researchers compared the kingdom distribution of known protein domains in the GOS data to that of proteins in public databases. In this process, some families that were previously thought to be single-kingdom turned out to have members in multiple kingdoms. For example, indoleamine 2,3-dioxygenase (IDO), an enzyme linked to the immune system in mammals, was considered unique to eukaryotes. But the IDO Pfam search turned up matches to ten GOS sequences identifi ed as bacterial—suggesting that the proteins may have arisen much earlier than previously thought, or perhaps arose through lateral gene transfer (from an unrelated organism).

The sheer size of the GOS dataset—which nearly doubles the number of proteins—greatly expands the functional diversity of known protein families, providing valuable insights into their evolution. For example, the researchers found a 10-fold increase in the number and type of proteins involved in repairing ultraviolet radiation damage, likely refl ecting the hazards of living in surface waters. A similar boost in phosphatases—which function in such fundamental biological processes as cell signaling, development, and cell division—highlighted important differences in the way one phosphatase (protein phosphatase 2C) functions in prokaryotes and eukaryotes.

And the unexpected abundance of a nitrogen metabolism catalyst typically associated with eukaryotes (type II glutamine synthetase) suggested two possible evolutionary mechanisms: either lateral gene transfer from eukaryotes, or gene duplication prior to the divergence of prokaryotes and eukaryotes. (The researchers suspect gene duplication.) The diversity of the GOS sequences also promises to characterize sequences with no similarity to known sequences (known as ORFans): over 6,000 ORFans pair up with GOS sequences representing some 600 organisms, paving the way for further study of their identity and function.

As GOS protein predictions are tested, some of these proteins will expand existing protein families while others will carve out GOS-specifi c families. Both results will help researchers determine priority targets for structural studies—an essential strategy for dealing with the fl ood of protein discoveries. And given that the GOS sequences represent mostly microbes from the ocean’s surface—yet point to substantial viral diversity as well—the rate of protein discovery indicates that a comprehensive catalog of proteins in nature is far from complete.

Variations on a Theme: A Single Fold Spawns a Diverse

Kinase Superfamily

Cellular life chugs along under the power of enzymes, proteins that catalyze the scores of chemical reactions required for life. One of the largest protein families in eukaryotes, the eukaryotic protein kinases (ePKs), regulates the activity of a large fraction of all proteins and almost all biological pathways by phosphorylating proteins. Phosphorylation activates its target by transferring

a phosphate group from adenosine triphosphate (ATP) to a specifi c amino acid on the protein, releasing energy and inducing structural changes that alter the protein’s activity. (Dephosphorylation removes the phosphate group, restoring the protein to its original conformation and inactive state.) One cell can contain hundreds of different protein kinases, each charged with phosphorylating one or many different proteins.

Bacteria and other prokaryotes, conventional wisdom held, rely mostly on structurally distinct kinases (histidine kinases) to mediate protein phosphorylation and cell signaling. But it now emerges that ePK-like kinases (ELKs), once thought to be minor players, are more prevalent and widespread than the histidine kinases. Although ePKs and ELKs typically exhibit very low sequence similarity, they share similar phosphorylation mechanisms and the same structural fold (the protein kinase–like, or PKL, fold).

Since PKL kinases conserve both fold and mechanism of action, they provide a robust model for determining how sequence variation corresponds to functional diversity. Unfortunately, comprehensive comparisons had been frustrated by a lack of sequence information for the prokaryotic ELK families relative to the well-studied eukaryotic domains. But now, thanks to the Sorcerer II expedition, sequence databases are brimming with microbial sequences, including a 3-fold increase in ELK sequences. Taking advantage of the bounty, Natarajan Kannan, Gerard Manning, and colleagues surveyed the global PKL landscape, and identifi ed over 45,000 PKLs, which they classifi ed into 20 families. Surprisingly, PKLs appear to usurp the histidine kinases as the core regulator of prokaryotic signaling and cell behavior.

Cataloging the number and diversity of PKL families. To detect kinase sequences, Kannan et al. searched over 17 million predicted proteins in the GOS dataset and 5 million-plus predicted and known protein sequences in public databases. Kinase sequences were detected using hidden Markov model (HMM) profi les of known PKLs along with a model that predicts kinases on the basis of a few ultra-conserved motifs. The sensitivity of the HMMs allowed the researchers to discover very remote new members of these families and to classify and organize the tens of thousands of sequences. Both approaches iterate through multiple runs of the clustered results to refi ne the family alignments and to classify clusters with little similarity to known PKL families as potentially novel. (For more on these methods, see Box 2.)

The public databases, it turned out, harbored nearly 25,000 ePKs and over 5,000 ELKs. Over 16,000 GOS sequences fell into 20 PKL families—doubling the size of most families. Three main superfamily clusters emerged, distinguished by the most abundant members: choline and aminoglycoside kinases (CAKs), a “particularly diverse” family harboring kinases that facilitate colonization by benefi cial and pathogenic bacteria; ePKs, almost exclusively eukaryotic except for a similar bacterial kinase (pknB); and a cluster of kinases, including Rio and Bud32, that are conserved between archaea and eukaryotes. Three families bore no sequence similarity to any other families save for a group of key motifs.

Overall, the 20 families exhibit signifi cant functional and sequence diversity. Most of the families have not yet been fully investigated, though they do include some characterized members. Those with known kinase activity target small

Special Section from March 2007 | Volume 5 | Issue 3 | e85

Page 14: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S8 0376

molecules (such as lipids and amino acids) and seem to play regulatory roles, in contrast to many other structurally unrelated small molecule kinases, which affect metabolism.

Functional diversity springs from a set of core residues. Because sequence similarity ranged from “very low to almost undetectable,” the researchers used sequence profi les—models built from entire families to highlight their core characteristics—to both discover and classify kinase sequences. They found several novel families, and greatly extended the breadth of previously defi ned families. With these methods to refi ne the relationships within and between PKL families, the researchers explored the traits that unite or distinguish them.

Ten key amino acid residues of the catalytic domain consistently turned up in each family. This “core pattern of conservation,” the researchers explain, represents an ancient evolutionary innovation, spanning not just the three divisions of life—which diverged 1–2 billion years ago—but also the diverse families. The conservation of these residues across and within the families suggests that they play an essential role. And, indeed, six of those already characterized mediate ATP binding and catalysis.

Yet despite the seemingly universal presence of the ten residues, their occurrence in individual subfamilies showed a surprising pattern: all but one of these “core” residues had either disappeared or changed in individual families—though the proteins retained their fold and function—suggesting an unexpected fl exibility for catalytic cores. To test this possibility, the researchers focused on one of the ten residues—the catalytic lysine K72, which repositions ATP’s phosphates. Present in ePKs, K72 is replaced by a different conserved amino acid in three CAK subfamilies. These subfamilies had corresponding substitutions near other key motifs, and structural modeling showed how these coordinated replacements could still result in an active enzyme.

A number of features (including amino acid motifs and secondary structure) emerged as family-specifi c, being highly conserved within but not between families. And as was seen in the CAK analysis, many family-specifi c residues occur near one of the ten key residues, suggesting that they may help

direct substrates to the catalytic core or infl uence the nature of the reaction.

Evolutionary insights and beyond. Altogether these results reveal the vast functional and phylogenetic diversity that can occur in even just a subset of proteins, even though they retain a common catalytic fold and function. The massive sequence comparisons in this study not only identifi ed the core of the PKL kinase, but also revealed the specifi c motifs underlying each family, including the ePKs. And the fl exibility of several key regions within ePKs may underlie the huge expansion of these enzymes in eukaryotes. This structural fl exibility may give kinases the ability to integrate multiple regulatory signals, and account for their almost universal involvement in the regulation of eukaryotic pathways.

These results set the stage for more in-depth structural and biochemical studies to elucidate the diverse functions carried out by these critical regulators of cell behavior. This study also demonstrates how metagenomic datasets, by covering an unbiased diversity of life, can refi ne our understanding of well-studied protein families, such as the ePKs, and shed light on their evolution. Kannan et al. hope that others take advantage of the environmental metagenomic largesse to pursue “similar insights into virtually every gene family with prokaryotic relatives.”

Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al.

(2007) The Sorcerer II Gobal Ocean Sampling expedition: Northwest

Atlantic through eastern tropical Pacifi c. doi:10.1371/journal.

pbio.0050077

Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al. (2007)

The Sorcerer II Global Ocean Sampling expedition: Expanding the

universe of protein families. doi:10.1371/journal.pbio.0050016

Kannan N, Taylor SS, Zhai Y, Venter JC, Manning G (2006) Structural

and functional diversity of the microbial kinome. doi:10.1371/journal.

pbio.0050017

This article is part of the Oceanic Metagenomics collection in PLoS Biology . The full collection is available online at http:⁄⁄collections.plos.org/plosbiology/gos-2007.php.

Special Section from March 2007 | Volume 5 | Issue 3 | e85

Page 15: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S9 0380

Feature

Special Section from March 2007 | Volume 5 | Issue 3 | e74

Craig Venter is not short of ambition. With the human genome fresh off the sequencing machines, he set his sights on a project of even grander scale: to

describe the immense wealth of genetic information living in the world’s oceans. This voyage into biologically uncharted waters was, according to the Web site of the expedition vessel Sorcerer II, inspired in part by the voyage of H. M. S. Beagle [1]. Venter, it seems, would like to be remembered as the Charles Darwin of the 21st century (Figure 1).

This is the largest effort to describe the genetic diversity in the world’s oceans. The voyage around national and international waters, collecting from around 150 sites and interrogating samples at the level of the gene rather than at the level of the organism, has already turned up between 5 and 6 million genes. Most of these genes have never been seen before, says Venter. Analysing this immense collection of data, the researchers discovered that many of the genes encode proteins that fall outside standard classifi cation schemes. Proteins grouped within their own unique kingdoms are turning up in other kingdoms as well—forcing the team to reconsider the evolutionary relationships of established kingdoms. “This project is revealing some of the biggest discoveries about the environment,” says Venter. (For more on these discoveries see the synopsis of the research articles [2].)

Untapped Diversity

The Sorcerer II probably captured only a tiny fraction of the genetic diversity out there, says Mitchell Sogin, Director of the Josephine Bay Paul Center in Comparative Molecular Biology and Evolution at the Marine Biological Laboratory in Woods Hole, Massachusetts. In August 2006, Sogin and his colleagues published a detailed analysis of variable stretches of ribosomal

RNA collected from the marine microbial world (Figure 2) [3]. “We estimate there are at least 25,000 different kinds of microbes per litre of seawater,” says Sogin. “But I wouldn’t be surprised if it turns out there are 100,000 or more.” A few of these microbes are common, and Venter will probably use them to recover complete gene sequences, he says. “The vast majority of low-abundance organisms are going undetected.”

Venter is more than aware that there’s a lot more to be discovered, but for the moment the goal is to sequence as many genes, in their entirety, as possible from these ecologically rich environments. These data raise a host of intriguing questions: in particular, what is the structure and function of the novel proteins these genes encode, and what role do they play in the metabolism of these undescribed microbes? Just as Darwin’s work drove a change in the way we see the world, so Venter is hoping these marine data will do the same in years to come.

Legal Framework

But times have changed. In the 21st century, there are plenty of hurdles to clear before the collecting and describing of biodiversity—even microscopic biodiversity—can go ahead.

Sorcerer II: The Search for Microbial Diversity

Roils the WatersHenry Nicholls

Citation: Nicholls H (2007) Sorcerer II: The search for microbial diversity roils the waters. PLoS Biol 5(3): e74. doi:10.1371/journal.pbio.0050074

Copyright: © 2007 Henry Nicholls. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abbreviations: UNCLOS, United Nations Convention on the Law of the Sea

Henry Nicholls is a freelance science journalist based in London, United Kingdom. His book Lonesome George was nominated for the 2006 Guardian First Book Award. E-mail: [email protected]

This article is part of the Oceanic Metagenomics collection in PLoS Biology. The full collection is available online at http://collections.plos.org/plosbiology/gos-2007.php.

doi:10.1371/journal.pbio.0050074.g001

Figure 1. Two of a Kind?

The young Charles Darwin (left) and Craig Venter (right). (Photo: J. Craig Venter Institute)

“If Darwin were alive today trying to do his experiments, he would not have

been allowed to.”

Page 16: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S10 0381

The 1982 United Nations Convention on the Law of the Sea (UNCLOS) endowed coastal nations with the sovereign right to explore and exploit all resources within their “exclusive economic zone”—usually a body of water stretching 200 nautical miles out to sea [4]. Most coastal states exercise this right, granting permits to outsiders wanting to conduct research in their waters.

The 1992 Convention on Biological Diversity went on to set out some basic principles that might encourage sharing of benefi ts arising from genetic resources [5]. Where parties to the convention have got round to incorporating these principles into their own legislation, the result has been that anyone wishing to conduct research on these resources must agree to terms set by the host government.

Beyond national waters (with a few exceptions) are the “high seas”. Here, there is little regulation. According to UNCLOS, mineral resources on the deep seabed are considered the “common heritage of mankind”; this means that any benefi ts deriving from them should be shared with the international community. But when it comes to biological resources, just about anything goes.

The Rise of Bioprospecting

In areas beyond national jurisdiction, there has been an increase in so-called bioprospecting, the search for and exploitation of commercially valuable compounds from genetic resources. In 2005, researchers at the United Nations University scoured patent offi ce databases for inventions based on the genomic features of deep seabed organisms [6].They found that private companies such as Roche, Diversa, and New England Biolabs are after patents on DNA polymerases developed from deep-sea thermophilic bacteria that promise to enhance the molecular biologist’s expanding toolbox. Others like Sederma (based in France) and California Tan (based in the US) have used enzymes from similar microorganisms to develop skin products boasting UV- and heat-resistant properties.

There are plenty of not-for-profi t organisations interested in the applications of discoveries from the deep. For example,

Harbor Branch Oceanographic Institution, an oceanographic research and education institution based in Florida, is after compounds from marine organisms that might have biomedical potential. The institution has patents on, among others, potential anti-cancer agents derived from the marine sponges Discodermia dissoluta and Forcepia triabilis (Figure 3).

Deep-sea exploration, and the lengthy research and development that follows, is an expensive business. This means it’s a realistic option for only the world’s wealthiest nations. At least that’s the concern being expressed by some developing countries that would like to see a piece of this action, says David Leary of the Centre for Environmental Law at Macquarie University in Sydney, Australia.

These countries are seeking a change to UNCLOS that requires biological resources to be treated in the same way as mineral resources and any benefi ts deriving from them to be shared with the wider community. But others fear tighter regulation of such activities will only stifl e pure marine scientifi c research. The Philippines was one of the fi rst countries to regulate access to its genetic resources, says Sam Johnston, an expert on international environmental law based in Melbourne, Australia, and a senior research fellow at the United Nations University Institute of Advanced Studies in Yokohama, Japan. “It basically closed down all research,” he says. “A lot of researchers around the world have found the red tape prohibitive.”

Finding a balance between the unregulated status quo and cumbersome controls over research on marine biodiversity is now the concern of a United Nations working group [7]. “Some countries see this as the early stage of negotiating a new UNCLOS,” says Leary. But, he warns, “this could take 10 or 15 years before we see a result.”

One compromise might be for coastal states to allow all research on their genetic resources with the proviso that exploitation of any commercial application is subject to further negotiation. Another possibility is for the patent system to take responsibility for seeing that benefi ts are shared fairly, only granting patents based on biological resources if a royalty is paid into a global commons trust fund.

Ecological Impact

Whilst the UN goes in search of this kind of middle ground, both pure and applied research in the high seas continues apace—and this is cause for another concern. “There’s a number of sites that are so popular that there’s concern about the intensity of research,” says Leary. Repeated visits to the same deep-sea spot could not only result in unsustainable collection of some species and infl uence local hydrological and environmental conditions, but increase the likelihood that one person’s experiment will infl uence that of another. So far, little thought has been devoted to this consequence of unregulated access, says Leary. “I haven’t yet seen any clear

doi:10.1371/journal.pbio.0050074.g002

Figure 2. A Remotely Operated Platform Samples Vent Fluids from the Northeast Pacifi c Ocean

(Photo: NOAA, http://oceanexplorer.noaa.gov)

Special Section from March 2007 | Volume 5 | Issue 3 | e74

“No effort ever attempted to incorporate data from such vastly

divergent sources to meet the needs of such a wide range of scientifi c

interests.”

Page 17: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S11 0382

scientifi c data on the extent of the environmental impact of bioprospecting or marine scientifi c research,” he says.

Clearly, the environmental impact of carrying off 150-odd barrels of seawater for analysis isn’t something that Venter and his colleagues had to worry about. But navigating the complex legal territory was. “If Darwin were alive today trying to do his experiments, he would not have been allowed to,” says Venter.

At least, that is, without help from a lawyer. Sorcerer II collected samples in the waters of 17 coastal states and obtained all necessary permits, says Bob Friedman, Vice President for Environmental and Energy Policy at the J. Craig Venter Institute. Some countries required detailed agreements thrashing out how benefi ts deriving from these data would be shared. All of these are posted on the Sorcerer II Web site, says Friedman [8]. Most countries, however, have not decided how they might regulate access to their genetic resources, he says.

In addition to getting the paperwork in order, Venter encouraged collaboration with local scientists. What’s more, the entire metagenomic database will be put in the public domain. The gene sequences should be of tremendous value to each of the countries involved, says Venter. In particular, it will help them monitor and manage the health of their marine ecosystems more effectively, he predicts. To ensure that this vast dataset will be available to all, the Gordon and Betty Moore Foundation has stumped up $24.5 million dollars for a seven-year project to design a new database to host it and new tools to interrogate it (Box 1).

Yet, it seems, all these undertakings and assurances have not been enough to steer this expedition clear of controversy. In 2004, when the Sorcerer II dropped anchor

just off Hiva Oa, an island in the Marquesas archipelago in the Pacifi c Ocean, tensions escalated. Although the plan to sample seawater around the islands had the backing of local French Polynesian authorities and scientists, the French government in Paris had other ideas, says Venter. “We were placed under house arrest.” Eventually, after a further round of intense negotiations, the Sorcerer II was allowed out of the harbour to collect its seawater samples and continue on its way.

Last year, a Canadian-based non-governmental organisation—the Action Group on Erosion, Technology and Concentration—dedicated to “the advancement of cultural

Box 1. Zooming in on CAMERA

CAMERA is the convenient acronym for the cumbersomely

named Community Cyberinfrastructure for Advanced Marine

Microbial Ecology Research and Analysis. “This resource

will focus on providing easy-to-use tools for uploading,

downloading, searching, and analysis of genomic datasets,” says

Paul Gilna, CAMERA’s executive director, based at the California

Institute for Telecom and Information Technology in La Jolla,

California.

Researchers will also be able to clothe the bare genetic

sequences in a wealth of other data, such as GPS coordinates

and depth of collection, the water temperature, its oxygen

content, salinity and pH. The site could well draw upon other

resources that enrich these metadata, says Gilna. For example,

satellite imagery associated with the sampling sites, and other

data types, such as microscopy stills and high-defi nition video,

could become important metadata that help researchers

characterise the environments from which samples were taken.

Crucially, CAMERA will allow researchers to record the source

of each genetic sequence. Many coastal countries now want a

share of commercial applications that derive from their marine

resources. Countries may be happy to see genetic sequences

placed in CAMERA provided they are acknowledged and

commercial exploitation of their sequence is not permitted

without their consent.

But handling such immense datasets poses considerable

technological challenges. The GOS database alone contains

around 6 billion bases—the equivalent of two entire human

genomes. And the number and size of this kind of database

will only mushroom in coming years, making it necessary to

develop high-speed optical networks, grid-based computing,

and new visualisation technologies. “We are quickly approaching

a ‘tipping point’,” says Gilna. “These datasets will start to follow

exponential, rather than linear trends, much as was the case for

DNA sequencing.”

Finally, there’s the tricky task of satisfying all researchers who

could benefi t from this resource. “The scientifi c communities—

from studies on biodiversity and biogeochemistry to evolution

and genomes—have different interests, different data

expectations, different vocabularies, and different levels of

experience with using computational tools and databases,” says

John Wooley, a pharmacologist at the University of California,

San Diego, who is working on CAMERA. “Before metagenomics,

no effort ever attempted to incorporate data from such vastly

divergent sources to meet the needs of such a wide range of

scientifi c interests.” For more on CAMERA, see the Community

Page article by Seshadri et al. [13].

doi:10.1371/journal.pbio.0050074.g003

Figure 3. Marine Sponges That Have Generated Products with Anti-Cancer Promise

(A) Discodermia dissoluta. (Photo: NOAA) (B) Forcepia triabilis. (Photo: T. Piper, NOAA)

Special Section from March 2007 | Volume 5 | Issue 3 | e74

Page 18: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S12 0383

and ecological diversity and human rights” labelled Venter a “biopirate”, accusing him of “fl agrant disregard for national sovereignty over biodiversity” [9]. In several countries, there’s real concern about how he managed his collecting, claims Pat Mooney, Executive Director of the group. Although the data are going into the public domain, it is laboratories like Venter’s that are best placed to exploit it, he argues. “There’s a handful of folk around the planet that can understand such stuff,” says Mooney.

Venter is adamant that this whole project is just pure, clean marine scientifi c research. Indeed, the Sorcerer II Web site explicitly states that “no intellectual property rights will be sought by the Venter Institute on these genomic sequence data” [10]. Venter sums up the goal of the project: “We were just trying to answer some basic questions about the diversity of microbes on the planet,” he says.

But, says environmental lawyer Johnston, the distinction between pure and applied research is becoming increasingly blurred. To illustrate this, he cites a strain of thermophilic Bacillus collected from Antarctica in the early 1980s as part of a study into the worldwide distribution and characteristics of such extremophiles. Years later, the same sample, taken out of storage and subjected to further study, turned out to contain a talented enzyme that has the promise to revolutionise DNA extraction for forensic analysis [11]. “The collector undertook the act in the purest form but ultimately the use of it has changed in the course of two decades,” says Johnston. “So much depends on the perspective at which you look at the issue.”

This means that there are likely to be several different takes on the same research. What for one person is pure marine scientifi c research can be another person’s bioprospecting and yet another’s biopiracy. There are very few cases where everyone agrees there has been outright theft of a biological resource and very few cases where everyone is happy there’s been proper benefi t sharing, says Johnston. “Even the best-designed programmes where there’s enormous consultation with the local people have found it’s diffi cult to get the right kind of consensus and buy-in,” he says [12].

So, keen as Venter might be to put the controversy of his human-genome-sequencing days behind him, this kind of research strays into unknown biological, legal, and ethical territory. And in this environment, allegations of biopiracy are almost inevitable. This, however, is unlikely to deter a man like Venter. “If it’s in the Darwin school of biopiracy, then fi ne,” he says. �

References1. Sorcerer II Expedition (2005) Expedition info—Environmental genomics.

Available: http:⁄⁄www.sorcerer2expedition.org. Accessed 19 January 2007.2. Gross L (2007) Untapped bounty: Sampling the seas to survey microbial

biodiversity. PLoS Biol 5: e85. doi:10.1371/journal.pbio.00500853. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, et al. (2006)

Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc Natl Acad Sci U S A 103: 12115–12120.

4. United Nations (1982) United Nations convention on the law of the sea of 10 December 1982. New York: United Nations. Available: http:⁄⁄www.un.org/Depts/los/convention_agreements/convention_overview_convention.htm. Accessed 18 January 2007.

5. Secretariat of the Convention on Biological Diversity (2002) Bonn guidelines on access to genetic resources and fair and equitable sharing of the benefi ts arising out of their utilization. Montreal: Secretariat of the Convention on Biological Diversity. Available: https:⁄⁄www.biodiv.org/doc/publications/cbd-bonn-gdls-en.pdf. Accessed 16 January 2007.

6. Arico S, Salpin C (2005) UNU-IAS report—Bioprospecting of genetic resources in the deep seabed: Scientifi c, legal and policy aspects. Yokohama (Japan): United Nations University Institute of Advanced Studies. Available: http:⁄⁄www.ias.unu.edu/binaries2/DeepSeabed.pdf. Accessed 16 January 2007.

7. International Institute for Sustainable Development (2006) Ad Hoc Open-ended Informal Working Group to study issues relating to the conservation and sustainable use of marine biological diversity beyond areas of national jurisdiction. Winnipeg (Canada): International Institute for Sustainable Development. Available: http:⁄⁄www.iisd.ca/oceans/marinebiodiv. Accessed 16 January 2007.

8. Sorcerer II Expedition (2005) Collaborative agreements. Available: http:⁄⁄www.sorcerer2expedition.org/permits. Accessed 22 January 2007.

9. Coalition Against Biopiracy (2006) Captain Hook awards for biopiracy 2006. Available: http:⁄⁄www.captainhookawards.org/winners/2006_pirates. Accessed 18 January 2007.

10. Sorcerer II Expedition (2005) Agreements. Available: http:⁄⁄www.sorcerer2expedition.org. Accessed 19 January 2007.

11. Moss D, Harbison AS, Saul DJ (2003) An easily automated, closed-tube forensic DNA extraction procedure using a thermostable proteinase. Int J Legal Med 117: 340–349.

12. Laird SA, Wynberg R, Johnston S (2006) Recent trends in the biological prospecting. 29th Antarctic Treaty Consultative Meeting. Available: http:⁄⁄www.ias.unu.edu/binaries2/ATCM29_May2006.doc. Accessed 16 January 2007.

13. Seshadri R, Kravitz SA, Smarr L, Gilna P, Frazier M (2007) CAMERA: A community resource for metagenomics. PLoS Biol 5: e75. doi:10.1371/journal.pbio.0050075

Special Section from March 2007 | Volume 5 | Issue 3 | e74

Page 19: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S13 0384

Since their discovery in the 1670s by Anton van Leeuwenhoek, an incredible amount has been

learned about microorganisms and their importance to human health, agriculture, industry, ecosystem functioning, global biogeochemical cycles, and the origin and evolution of life. Nevertheless, it is what is not known that is most astonishing. For example, though there are certainly at least 10 million species of bacteria, only a few thousand have been formally described [1]. This contrasts with the more than 350,000 described species of beetles [2]. This is one of many examples indicative of the general diffi culties encountered in studying organisms that we cannot readily see or collect in large samples for future analyses. It is thus not surprising that most major advances in microbiology can be traced to methodological advances rather than scientifi c discoveries per se.

Examples of these key revolutionary methods (Table 1) include the use of microscopes to view microbial cells, the growth of single types of organisms in the lab in isolation from other types (culturing), the comparison of ribosomal RNA (rRNA) genes to construct the fi rst tree of life that included microbes [3], the use of the polymerase chain reaction (PCR) [4] to clone rRNA genes from organisms

without culturing them [5–7], and the use of high-throughput “shotgun” methods to sequence the genomes of cultured species [8]. We are now in the midst of another such revolution—this one driven by the use of genome sequencing methods to study microbes directly in their natural habitats, an approach known as metagenomics, environmental genomics, or community genomics [9].

In this essay I focus on one particularly promising area of metagenomics—the use of shotgun genome methods to sequence random fragments of DNA from microbes in an environmental sample. The randomness and breadth of this environmental shotgun sequencing (ESS)—fi rst used only a few years ago [10,11] and now being used to assay every microbial system imaginable from the human gut [12] to waste water sludge [13]—has the potential to reveal novel and fundamental insights into the hidden world of microbes and their impact on our world. However, the complexity of analysis required to realize this potential poses unique interdisciplinary challenges, challenges that make the approach both fascinating and frustrating in equal measure.

Who Is Out There? Typing

and Counting Microbes in

the Environment

One of the most important and conceptually straightforward steps in studying any ecosystem involves cataloging the types of organisms and the numbers of each type. For a long time, such typing and counting was an almost insurmountable problem in microbiology. This is largely because physical appearance does not provide a valid taxonomic picture in microbes. Appearance evolves so rapidly that two closely related taxa may look wildly different and two distantly related

taxa may look the same. This vexing problem was partially overcome in the 1980s through the use of rRNA-PCR (Table 1). This method allows microorganisms in a sample to be phylogenetically typed and counted based on the sequence of their rRNA genes, genes that are present in all cell-based organisms. In essence, a database of rRNA sequences [14,15] from known organisms functions like a bird fi eld guide, and fi nding a rRNA-PCR product is akin to seeing a bird through binoculars. Rather than counting species, this approach focuses on “phylotypes,” which are defi ned as organisms whose rRNA sequences are very similar to each other (a cutoff of >97% or >99% identical is frequently used). The ability to use phylotyping to determine who was out there in any microbial sample has revolutionized environmental microbiology [16], led to many discoveries [e.g.,17], and convinced many people (myself included) to become microbiologists.

Essay

Special Section from March 2007 | Volume 5 | Issue 3 | e82

Environmental Shotgun Sequencing:

Its Potential and Challenges for Studying

the Hidden World of MicrobesJonathan A. Eisen

Citation: Eisen JA (2007) Environmental shotgun sequencing: Its potential and challenges for studying the hidden world of microbes. PLoS Biol 5(3): e82. doi:10.1371/journal.pbio.0050082

Series Editor: Simon Levin, Princeton University, United States of America

Copyright: © 2007 Jonathan A. Eisen. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abbreviations: ESS, environmental shotgun sequencing; PCR, polymerase chain reaction; rRNA, ribosomal RNA

Jonathan A. Eisen is at the University of California Davis Genome Center, with joint appointments in the Section of Evolution and Ecology and the Department of Medical Microbiology and Immunology, Davis, California, United States of America. Web site: http://phylogenomics.blogspot.com. E-mail: [email protected]

This article is part of the Oceanic Metagenomics collection in PLoS Biology. The full collection is available online at http://collections.plos.org/plosbiology/gos-2007.php.

Essays articulate a specifi c perspective on a topic of

broad interest to scientists.

Page 20: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S14 0385

The selective targeting of a single gene makes rRNA-PCR an effi cient method for deep community sampling [18]. However, this effi ciency comes with limitations, most of which are complemented or circumvented by the randomness and breadth of ESS. For example, examination of the random samples of rRNA sequences obtained through ESS has already led to the discovery of new taxa—taxa that were completely missed by PCR because of its inability to sample all taxa equally well (e.g., [19]). In addition, ESS provides the fi rst robust sampling of genes other than rRNA, and many of these genes can be more useful for some aspects of typing and counting. Some universal protein coding genes are better than rRNA both for distinguishing closely related strains (because of third position variation in codons) and for estimating numbers of individuals (because they vary less in copy number between species than do rRNA genes) [10]. Perhaps most signifi cantly, ESS is providing groundbreaking insights into the diversity of viruses [20,21], which lack rRNA genes and thus were left out of the previous revolution.

Certainly, many challenges remain before we can fully realize the potential of ESS for the typing and counting of species, including making automated yet accurate phylogenetic trees of every gene, determining which genes are most useful for which taxa, combining data from different genes even when we do not know if they come from the same organisms, building up databases of genes other than rRNA, and making up for the lack of depth of sampling. If these challenges are met, ESS has the potential to rewrite much of what we thought we knew about the phylogenetic diversity of microbial life.

What Are They Doing? Top Down

and Bottom Up Approaches to

Understanding Functions in

Communities

A community is, of course, more than a list of types of organisms. One approach to understanding the properties and functioning of a microbial community is to start with studies of the different types of organisms and build up from these individuals to the community. Ideally, to do this one would culture each of

the phylotypes and study its properties in the lab. Unfortunately, many, if not most, key microbes have not yet been cultured [22]. Thus, for many years, the only alternative was to make predictions about the biology of particular phylotypes based on what was known about related organisms. Unfortunately, this too does not work well for microbes since very closely related organisms frequently have major biological differences. For example, Escherichia coli K12 and E. coli O157:H7 are strains of the same species (and considered to be the same phylotype), with genomes containing only about 4,000 genes, yet each possesses hundreds of functionally important genes not seen in the other strain [23]. Such differences are routine in microbes, and thus one cannot make any useful inferences about what particular phylotypes are doing (e.g., type of metabolism, growth properties, role in nutrient cycling, or pathogenicity) based on the activities of their relatives.

These diffi culties—the inability to culture most microbes and the functional disparities between close relatives—led to one of the fi rst kinds

Table 1. Some Major Methods for Studying Individual Microbes Found in the Environment

Method Summary Comments

Microscopy Microbial phenotypes can be studied by making them more visible. In conjunction

with other methods, such as staining, microscopy can also be used to count taxa

and make inferences about biological processes.

The appearance of microbes is not a reliable indicator of

what type of microbe one is looking at.

Culturing Single cells of a particular microbial type are grown in isolation from other

organisms. This can be done in liquid or solid growth media.

This is the best way to learn about the biology of a

particular organism. However, many microbes are

uncultured (i.e., have never been grown in the lab in

isolation from other organisms) and may be unculturable

(i.e., may not be able to grow without other organisms).

rRNA-PCR The key aspects of this method are the following: (a) all cell-based organisms

possess the same rRNA genes (albeit with different underlying sequences); (b) PCR

is used to make billions of copies of basically each and every rRNA gene present in

a sample; this amplifi es the rRNA signal relative to the noise of thousands of other

genes present in each organism’s DNA; (c) sequencing and phylogenetic analysis

places rRNA genes on the rRNA tree of life; the position on the tree is used to infer

what type of organism (a.k.a. phylotype) the gene came from; and (d) the numbers

of each microbe type are estimated from the number of times the same rRNA gene

is seen.

This method revolutionized microbiology in the 1980s by

allowing the types and numbers of microbes present in

a sample to be rapidly characterized. However, there are

some biases in the process that make it not perfect for all

aspects of typing and counting.

Shotgun genome

sequencing of cultured

species

The DNA from an organism is isolated and broken into small fragments, and then

portions of these fragments are sequenced, usually with the aid of sequencing

machines. The fragments are then assembled into larger pieces by looking

for overlaps in the sequence each possesses. The complete genome can be

determined by fi lling in gaps between the larger pieces.

This has now been applied to over 1,000 microbes, as well

as some multicellular species, and has provided a much

deeper understanding of the biology and evolution of life.

One limitation is that each genome sequence is usually a

snapshot of one or a few individuals.

Metagenomics DNA is directly isolated from an environmental sample and then sequenced.

One approach to doing this is to select particular pieces of interest (e.g., those

containing interesting rRNA genes) and sequence them. An alternative is ESS,

which is shotgun genome sequencing as described above, but applied to an

environmental sample with multiple organisms, rather than to a single cultured

organism.

This method allows one to sample the genomes of

microbes without culturing them. It can be used both for

typing and counting taxa and for making predictions of

their biological functions.

doi:10.1371/journal.pbio.0050082.t001

Special Section from March 2007 | Volume 5 | Issue 3 | e82

Page 21: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S15 0386

of metagenomic analyses, wherein predictions of function were made from analysis of the sequence of large DNA fragments from representatives of known phylotypes. This approach has provided some stunning insights, such as the discovery of a novel form of phototrophy in the oceans [24]. However, this large insert approach has the same limitation as predicting properties from characterized relatives—a single cell cannot possibly represent the biological functions of all members of a phylotype.

ESS provides an alternative, more global way of assessing biological functions in microbial communities. As when using the large insert approach, functions can be predicted from sequences. However, in this case the predicted functions represent a random sampling of those encoded in the genomes of all the organisms present. This approach has unquestionably been wildly successful in terms of gene discovery. For example, analysis of ESS data has revealed novel forms of every type of gene family examined, as well as a great number of completely novel families (e.g., [25]). However, there is a major caveat when using ESS data to make community-level inferences. Ecosystems are more than just a bag of genes—they are made up of compartments (e.g., cells, chromosomes,

and species), and these compartments matter. The key challenge in analyzing ESS data is to sort the DNA fragments (which are usually less than 1,000 base pairs long relative to genome sizes of millions or billions of bases) into bins that correspond to compartments in the system being studied.

A recent study by myself and colleagues illustrates the importance of compartments when interpreting ESS data. When we analyzed ESS data from symbionts living inside the gut of the glassy-winged sharpshooter (an insect that has a nutrient-limited diet), we were able to bin the data to two distinct symbionts [26]. We then could infer from those data that one of the symbionts synthesizes amino acids for the host while the other synthesizes the needed vitamins and cofactors. Modeling and understanding of this ecosystem are greatly enhanced by the demonstration of this complementary division of labor, in comparison to simply knowing that amino acids, vitamins, and cofactors are made by “symbionts.”

How does one go about binning ESS data? A variety of approaches have been developed, some of which are described in Table 2. In considering the different binning methods and their limitations, the fi rst question one needs to ask is, what are we

trying to bin? Is it fragments from the same chromosome from a single cell, which would be useful for studying chromosome structure? If so, then perhaps genome assembly methods are the best. What if instead, as in the sharpshooter example, we are trying to have each bin include every fragment that came from a particular species, knowledge which may be useful for predicting community metabolic potential? If the level of genetic polymorphism among individual cells from the same species is high, then genome assembly methods may not work well (the polymorphisms will break up assemblies). A better approach might be to look for species-specifi c “word” frequencies in the DNA, such as ones created by patterns in codon usage. The challenge is, how do we tune the methods to fi nd the right target level of resolution? If we are too stringent, most bins will include only a few fragments. But if we are too relaxed, we will create artifi cial constructs that may prove biologically misleading, such as grouping together sequences from different species. To make matters more complex, most likely the stringency needed will vary for different taxa present in the sample.

Another critical issue is the diversity of the system under study. Generally, binning works better when there are

Table 2. Methods of Binning

Method Description Comments

Genome assembly Identify regions of overlap between different fragments

from the same organism to build larger contiguous pieces

(contigs).

Getting deep enough sampling for this to work is very expensive

except for low diversity systems or for very abundant taxa.

Reference genome alignment Identify ESS fragments or contigs that are very similar

to already assembled sections of the genome of single

microbial types.

(a) One of the most effective ways to sort through ESS data, if the

reference genome is very closely related to an organism in the sample;

(b) the reason why more reference genomes are needed; (c) does

not handle regions present in uncultured organisms but not in the

reference.

Phylogenetic analysis Build evolutionary trees of genes encoded by ESS fragments

or contigs. Assign fragments or contigs to taxonomic

groups based on nearest neighbor(s) in trees.

(a) Very powerful, but level of resolution depends on whether

fragments encode useful phylogenetic markers and on how well

sampled the database is for the neighbor analysis; (b) would work

much better if more genomes were available from across the tree of

life.

Word frequency and nucleotide

composition analysis

Measure word frequency and composition of each

fragment. Group by clustering algorithms or principal

component analysis.

(a) Has the potential to work because organisms sometimes have

“signatures” of word frequencies that are found throughout the

genome and are different between species; (b) very challenging for

small fragments.

Population genetics Build alignments of fragments or contigs with similarity

to each other (but not as much as needed for assembly).

Examine haplotype structure, predicted effective

population size, and synonymous and non synonymous

substitution patterns.

May be most useful as a way of subdividing bins created by other

methods.

Note that some methods can be applied to ESS fragments or to bins identifi ed by other methods.

doi:10.1371/journal.pbio.0050082.t002

Special Section from March 2007 | Volume 5 | Issue 3 | e82

Page 22: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S16 0387

few different phylotypes present, all of which are distantly related and form discrete populations. This is why binning works well for the sharpshooter system and other relatively isolated, low diversity environments. Binning increases in diffi culty exponentially as the number of species increases: the populations and species start to merge together, and the populations get more and more polymorphic and variable in relative abundance (such as in the paper about the Global Ocean Sampling expedition in this issue [27]). Further complicating binning is the phenomenon of lateral gene transfer, where genes are exchanged between distantly related lineages at rates that are high enough that random sampling of a genome will frequently include genes with multiple histories.

Despite these challenges, I believe we can develop effective binning methods for complex communities. First, we can combine different approaches together, such as using one method to sort in a relaxed manner and then using another to subdivide the bins provided by the fi rst method. Second, we can incorporate new approaches such as population genetics into the analysis [28]. In addition, the lessons learned here can be applied to other aspects of metagenomics (e.g., the counting and typing discussed above) and provide insights into the nature of microbial genomes and the structure of microbial populations and communities.

Comparative Metagenomics

So far, I have discussed issues relating mostly to intrasample analysis of ESS data. However, the area with perhaps the most promise involves the comparative analysis of different samples. This work parallels the comparative analysis of genomes of cultured species. Initial studies of that type compared distantly related taxa with enormous biological differences. What has been learned from these studies pertains mostly to core housekeeping functions, such as translation and DNA metabolism, and to other very ancient processes [29,30]. It was not until comparisons were made between closely related organisms that we began to understand events that occurred on shorter time scales, such as selection, gene transfer, and mutation processes [31].

Similarly, the initial comparisons of ESS data involved comparisons of wildly different environments [32], yielding insights into the general structure of communities. But as more comparisons are made between similar communities [33,34], such as those sampled during vertical and horizontal ocean transects [27,35–37], we will begin to learn about shorter time scale processes such as migration, speciation, extinction, responses to disturbance, and succession. It is from a combination of both approaches—comparing both similar and very divergent communities—that we will be able to understand the fundamental rules of microbial ecology and how they relate to ecological principles seen in macro-organisms.

Conclusions

In promoting some of the exciting opportunities with ESS, I do not want to give the impression that it is fl awless. It is helpful in this respect to compare ESS to the Internet. As with the Internet, ESS is a global portal for looking at what occurs in a previously hidden world. Making sense of it requires one to sort through massive, random, fragmented collections of bits of information. Such searches need to be done with caution because any time you analyze such a large amount of data patterns can be found. In addition, as with the Internet, there is certainly some hype associated with ESS that gives relatively trivial fi ndings more attention than they deserve. Overall, though, I believe the hype is deserved. As long as we treat ESS as a strong complement to existing methods, and we build the tools and databases necessary for people to use the information, it will live up to its revolutionary potential. �

Acknowledgments

I thank Simon Levin, Joshua Weitz, Jonathan Dushoff, Maria-Inés Benito, Doug Rusch, Aaron Halpern, and Shibu Yooseph for helpful discussions, and Melinda Simmons, Merry Youle, and three anonymous reviewers for helpful comments on the manuscript. The writing of this paper was supported by National Science Foundation Assembling the Tree of Life Grant 0228651 to Jonathan A. Eisen and by the Defense Advanced Research Projects Agency under grants HR0011-05-1-0057 and FA9550-06-1-0478.

References1. Gould SJ (1996) Full house: The spread of

excellence from Plato to Darwin. New York: Harmony Books. 244 p.

2. Evans AV, Bellamy CL (1996) An inordinate fondness for beetles. New York: Holt. 208 p.

3. Woese C, Fox G (1977) Phylogenetic structure of the prokaryotic domain: The primary kingdoms. Proc Natl Acad Sci U S A 74: 5088–5090.

4. Mullis K, Faloona F (1987) Specifi c synthesis of DNA in vitro via a polymerase-catalyzed chain reaction. Methods Enzymol 155: 335–350.

5. Reysenbach AL, Giver LJ, Wickham GS, Pace NR (1992) Differential amplifi cation of rRNA genes by polymerase chain reaction. Appl Environ Microbiol 58: 3417–3418.

6. Medlin L, Elwood HJ, Stickel S, Sogin ML (1988) The characterization of enzymatically amplifi ed eukaryotic 16S-like ribosomal RNA-coding regions. Gene 71: 491–500.

7. Weisburg W, Barns S, Pelletier D, Lane D (1991) 16S ribosomal DNA amplifi cation for phylogenetic study. J Bacteriol 173: 697–703.

8. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, et al. (1995) Whole-genome random sequencing and assembly of Haemophilus infl uenzae Rd. Science 269: 496–512.

9. Handelsman J (2004) Metagenomics: Application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev 68: 669–685.

10. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al. (2004) Environmental genome shotgun sequencing of the Sargasso Sea. Science 304: 66–74.

11. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, et al. (2004) Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428: 37–43.

12. Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ, et al. (2006) Metagenomic analysis of the human distal gut microbiome. Science 312: 1355–1359.

13. Garcia Martin H, Ivanova N, Kunin V, Warnecke F, Barry KW, et al. (2006) Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities. Nat Biotechnol 24: 1263–1269.

14. Olsen GJ, Larsen N, Woese CR (1991) The ribosomal RNA database project. Nucleic Acids Res 19: 2017–2021.

15. Cole JR, Chai B, Farris RJ, Wang Q, Kulam-Syed-Mohideen AS, et al. (2007) The ribosomal database project (RDP-II): Introducing myRDP space and quality controlled public data. Nucleic Acids Res 35: D169–D172.

16. Pace NR (1997) A molecular view of microbial diversity and the biosphere. Science 276: 734–740.

17. Hugenholtz P, Pitulle C, Hershberger KL, Pace NR (1998) Novel division level bacterial diversity in a Yellowstone hot spring. J Bacteriol 180: 366–376.

18. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, et al. (2006) Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc Natl Acad Sci U S A 103: 12115–12120.

19. Baker BJ, Tyson GW, Webb RI, Flanagan J, Hugenholtz P, et al. (2006) Lineages of acidophilic archaea revealed by community genomic analysis. Science 314: 1933–1935.

20. Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, et al. (2006) The marine viromes of four oceanic regions. PLoS Biol 4: e368. doi:10.1371/journal.pbio.0040368

21. Edwards RA, Rohwer F (2005) Viral metagenomics. Nat Rev Microbiol 3: 504–510.

22. Leadbetter JR (2003) Cultivation of recalcitrant microbes: Cells are alive, well and revealing their secrets in the 21st century laboratory. Curr Opin Microbiol 6: 274–281.

Special Section from March 2007 | Volume 5 | Issue 3 | e82

Page 23: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S17 0388

23. Perna NT, Plunkett G 3rd, Burland V, Mau B, Glasner JD, et al. (2001) Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409: 529–533.

24. Beja O, Aravind L, Koonin EV, Suzuki MT, Hadd A, et al. (2000) Bacterial rhodopsin: Evidence for a new type of phototrophy in the sea. Science 289: 1902–1906.

25. Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al. (2007) The Sorcerer II Global Ocean Sampling expedition: Expanding the universe of protein families. PLoS Biol 5: e16. DOI: 10.1371/journal.pbio.0050016

26. Wu D, Daugherty SC, Van Aken SE, Pai GH, Watkins KL, et al. (2006) Metabolic complementarity and genomics of the dual bacterial symbiosis of sharpshooters. PLoS Biol 4: e188. doi:10.1371/journal.pbio.0040188

27. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al. (2007) The Sorcerer

II Gobal Ocean Sampling expedition: Northwest Atlantic through Eastern Tropical Pacifi c. PLoS Biol 5: e77. doi:10.1371/journal.pbio.0050077

28. Johnson PL, Slatkin M (2006) Inference of population genetic parameters in metagenomics: A clean look at messy data. Genome Res 16: 1320–1327.

29. Koonin EV, Mushegian AR (1996) Complete genome sequences of cellular life forms: Glimpses of theoretical evolutionary genomics. Curr Opin Genet Dev 6: 757–762.

30. Mushegian AR, Koonin EV (1996) A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proc Natl Acad Sci U S A 93: 10268–10273.

31. Eisen JA (2001) Gastrogenomics. Nature 409: 463, 465–466.

32. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, et al. (2005) Comparative

metagenomics of microbial communities. Science 308: 554–557.

33. Edwards RA, Rodriguez-Brito B, Wegley L, Haynes M, Breitbart M, et al. (2006) Using pyrosequencing to shed light on deep mine microbial ecology. BMC Genomics 7: 57.

34. Rodriguez-Brito B, Rohwer F, Edwards RA (2006) An application of statistics to comparative metagenomics. BMC Bioinformatics 7: 162.

35. DeLong EF (2005) Microbial community genomics in the ocean. Nat Rev Microbiol 3: 459–469.

36. DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, et al. (2006) Community genomics among stratifi ed microbial assemblages in the ocean’s interior. Science 311: 496–503.

37. Worden AZ, Cuvelier ML, Bartlett DH (2006) In-depth analyses of marine microbial community genomics. Trends Microbiol 14: 331–336.

Special Section from March 2007 | Volume 5 | Issue 3 | e82

Page 24: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S18 0394

Community Page

Special Section from March 2007 | Volume 5 | Issue 3 | e75

Microbes are responsible for most of the chemical transformations that are

crucial to sustaining life on Earth. Their ability to inhabit almost any environmental niche suggests that they possess an incredible diversity of physiological capabilities. However, we have little to no information on a majority of the millions of microbial species that are predicted to exist, mainly because of our inability to culture them in the laboratory.

A growing discipline called metagenomics allows us to study these uncultured organisms by deciphering their genetic information from DNA that is extracted directly from their environment, thus effectively bypassing the laboratory culture step. Metagenomics allows us to address the questions “who’s there?”, “what are they doing?”, and “how are they doing it?”, offering insights into the evolutionary history as well as previously unrecognized physiological abilities of uncultured communities.

Studies such as the J. Craig Venter Institute’s Global Ocean Sampling (GOS) expedition (in this issue) reveal a remarkable breadth and depth of microbial diversity in the oceans. To date, researchers have made signifi cant but largely preliminary inroads into understanding the biogeography of microbial populations across ecosystems. We know even less about the dynamic physiological processes

and complex interactions that impact global carbon cycles and ocean productivity. Marine microbes are thought to act as part of the biological conduit that transports carbon dioxide from the surface to the deep oceanic realms. By removing carbon from the atmosphere and sequestering it (in the form of organic matter), marine microorganisms may signifi cantly affect global climate. Although we now have numerous global and real-time methods to measure physical

and chemical parameters within the ocean, few methods or concepts have been developed to measure important microbial processes on a global scale. Even if the technology to make such measurements existed, we would presently not know what to measure or how to interpret those measurements.

We need a systematic way to explore the structure and function of ocean ecosystems, and their impact on global carbon processing and climate. Metagenomics has the potential to shed light on the genetic controls of these processes by investigating the key players, their roles, and community compositions that may change as a function of time, climate, nutrients, carbon dioxide, and anthropogenic factors. These studies include a substantial informatics component, requiring researchers to take on complex computational and mathematical challenges. Nonetheless, microbiologists have been quick to seize upon this modern technique, resulting in a deluge of sequence data, and an ever-widening gap between the rates of collecting data and interpreting it.

The Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis

(CAMERA) project [1] is an important fi rst step in attempting to bridge these gaps and in developing global methods for monitoring microbial communities in the ocean and their response to environmental changes. The aim is to create a rich, distinctive data repository and bioinformatics tools resource that will address many of the unique challenges of metagenomics and enable researchers to unravel the biology of environmental microorganisms (Figure 1). CAMERA’s database includes environmental metagenomic and genomic sequence data, associated environmental parameters (“metadata”), pre-computed search results, and software tools to support powerful cross-analysis of environmental samples.

The Community Page is a forum for organizations

and societies to highlight their efforts to enhance the

dissemination and value of scientifi c knowledge.

CAMERA: A Community Resource

for MetagenomicsRekha Seshadri*, Saul A. Kravitz, Larry Smarr, Paul Gilna, Marvin Frazier

Citation: Seshadri R, Kravitz SA, Smarr L, Gilna P, Frazier M (2007) CAMERA: A community resource for metagenomics. PLoS Biol 5(3): e75. doi:10.1371/journal.pbio.0050075

Copyright: © 2007 Seshadri et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abbreviations: CAMERA, Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis; GOS, Global Ocean Sampling

Rekha Seshadri, Saul A. Kravitz, and Marvin Frazier are at the J. Craig Venter Institute (JCVI) in Rockville, Maryland, United States of America. Larry Smarr and Paul Gilna are at the California Institute for Telecommunications and Information Technology (Calit2), a University of California San Diego (UCSD)/University of California Irvine partnership, La Jolla, California, United States of America. Larry Smarr is also the Harry E. Gruber Professor of Computer Science and Engineering at UCSD, La Jolla, California, United States of America. CAMERA is being developed by Calit2 at UCSD in collaboration with the JCVI, UCSD’s Center for Earth Observations and Applications (anchored by the Scripps Institution of Oceanography), the San Diego Supercomputer Center, and the University of California Davis.

* To whom correspondence should be addressed. E-mail: [email protected]

This article is part of the Oceanic Metagenomics collection in PLoS Biology. The full collection is available online at http://collections.plos.org/plosbiology/gos-2007.php.

We invite the research community to submit its

metagenomics data to CAMERA.

Page 25: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S19 0395

The initial release will include data and tools associated with the companion set of GOS expedition publications [2–4]; metagenome data from the Hawaii Ocean Time Series Station ALOHA [5] and marine viromes from four different oceanic regions[6]; standard nonredundant sequence databases (e.g., nrnt for nucleotides and nraa for amino acids[7]); and collections of microbial genome sequences, including a set of 155 marine microbial genomes funded by the Gordon and Betty Moore Foundation. The focal point for the CAMERA project is its Web site: http://camera.calit2.net. We invite the research community to submit its metagenomics data to CAMERA, and are establishing mechanisms to streamline this process. Here we describe some of the key challenges and features of the CAMERA project.

Accessibility of Metadata

Existing data repositories provide limited support for metadata and metadata-based queries—including any supplemental information for the sequence data, such as pH and temperature of water at the collection site—and therefore these metadata go underutilized by the research community. CAMERA will integrate sequence data with all available, relevant metadata, including physical information (e.g., temperature and sample method), chemical information (e.g., salinity and pH), temporal information, geospatial information, methodology and instrumentation used for data collection, and satellite images of the collection site. These contextual data allow researchers to derive correlations between deciphered ecology and the environmental conditions that may favor one community structure over another. One can envision a future where metadata from satellites and weather stations, and other physicochemical data, can be used to help interpret and inform scientists on how these factors affect microbial processes as well as community composition. CAMERA is working with other groups (e.g., Genome Standards Consortium) to establish standards for the information content and format of metagenomic data and metadata submissions.

New-Generation Bioinformatics

Tools

Analysis and comparison of complex metagenomic data is driving the development of a new class of bioinformatics and visualization software. CAMERA will integrate these tools with its database, couple them with large-scale compute resources, and make them widely available to the research community. Initially, CAMERA will support analytical tools used for analyses in the GOS publications [2–6]. An example is shown in Figure 2: a subset of metagenome sequence reads from GOS environmental samples is compared to a reference genome sequence (Synechococcus spp.) using BLASTN. The results and underlying metadata are displayed through an interactive graphical viewer, which helps users quickly identify sequence reads that are similar to the reference genome sequence, and potentially identify metabolic similarities between microbes in environmental samples and a reference microbe. A detailed description of this tool and its applications are provided in the GOS companion paper by Rusch et al. [2]. CAMERA will work closely with the community to identify and incorporate additional tools and workfl ows.

Large-Scale, Robust, and

Expandable Cyberinfrastructure

The enormousness of metagenomics datasets requires terascale computation and storage facilities. CAMERA is building a state-of-the-art

computational infrastructure to provide high-performance networking access and grid-based computing (applying the resources of many computers in a network to a single problem at the same time), and to support new ways of visualizing and interacting with the data. The distributed architecture of the CAMERA computational engine will be based on the National Science Foundation–funded OptIPuter project [8,9], which allows for use of dedicated 1- or 10-Gbps optical fi ber links between remote user laboratory clusters and the CAMERA compute complex. The data server complex itself will contain a large amount of rotating storage (ultimately several tens of terabytes replicated) and a large computational cluster (upwards of a thousand processors). It will be augmented on demand by a scalable back end provided by the recently upgraded National Science Foundation TeraGrid.

Recognition of the Sources

of Samples

The Convention on Biological Diversity grants countries certain rights over their genetic resources, including, for example, metagenomic sequence data of marine microbes taken from a country’s territorial waters. Many countries require, at minimum, that databases explicitly identify the country of origin of the DNA. Rules vary by country, and it is not a simple task to fi nd out what might be required. International harmonization of these rules is currently being debated by the over 150 countries that are party to the Convention on Biological Diversity. Agreements about the use of genetic resources are negotiated on a case-by-case basis with each researcher who wishes to sample within a country’s “exclusive economic zone,” typically 200 miles from its shoreline. Some of these “memoranda of understanding” impose additional requirements on the researchers. For example, the J. Craig Venter Institute’s agreement with Australia requires us to “use reasonable effort to notify Australia as soon as possible of any inquiries for commercial purposes.”

Current databases do not allow the original investigators to inform others about the details of an agreement, thus creating a signifi cant roadblock to both the collection and public release

doi:10.1371/journal.pbio.0050075.g001

Figure 1. Schematic of Intended Core Functions of the CAMERA Project

CBD, Convention on Biological Diversity.

Special Section from March 2007 | Volume 5 | Issue 3 | e75

Page 26: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S20 0396

of metagenomics data. To address

this issue, CAMERA data will only be

made available to users who register

by supplying a suitable E-mail address

and who acknowledge the potential

restriction on commercial use by

countries from which the data were

collected. To further comply with the

Convention on Biological Diversity, all

data objects served by CAMERA will

possess a mapping to the country of

origin of the underlying DNA sample.

doi:10.1371/journal.pbio.0050075.g002

Figure 2. CAMERA Fragment Recruitment Viewer

This tool graphically displays the results of a BLASTN sequence comparison of an available microbial genome against selected sequence read datasets. The example shown displays the abundance and distribution of Synechococcus spp. genome sequence in the selected sampling sites. The Synechococcus spp. genome coordinates are shown on the x-axis, while the y-axis shows the percent identity scores of the alignment to the selected Sargasso Sea and GOS sequence reads. The viewer incorporates metadata associated with the reads, allowing a user to quickly identify data of interest for further examination. The utility of the plot is to examine the biogeography and genomic variation of abundant microbes when a close reference genome exists.

Special Section from March 2007 | Volume 5 | Issue 3 | e75

Page 27: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S21 0397

Outreach and Training

Since the ultimate success of CAMERA will depend on the broader research community’s ability to make use of the novel cyberinfrastructure, a series of on-site and Web-based training programs will be provided to keep users apprised of CAMERA’s functionalities and to support integration of CAMERA’s service-oriented architecture into their computational fabrics. Finally, we envision interacting with the community on several fronts, including standardization of ontology, metadata, nomenclature, and tools, and incorporation or federation of existing tools and resources with CAMERA.

We believe that the data and community cyberinfrastructure provided by CAMERA will help researchers to advance understanding of the codependence or feedback between microbial communities and biogeochemical processes in oceans over time, and of how perturbations in the environment cause compositional changes (including extinction). Eventually, the expanded global environmental metagenomics datasets will enable better monitoring of environmental

change and the processes that control climate. Systematic and routine monitoring of genomic signatures of global microbial populations and processes overlaid with meteorological information and other metadata may help researchers explain past shifts in global climate as well as predict future changes. This knowledge may someday guide decisions about acceptable atmospheric levels of greenhouse gases, or guide strategies to increase sequestration of atmospheric carbon dioxide by changing ocean microbial compositions, in order to reverse the effects of global warming. �

Acknowledgments

The authors benefi ted from many discussions with members of the CAMERA team. We wish to thank Robert Friedman, Michael Press, Jasmine Pollard, and Matthew LaPointe at the J. Craig Venter Institute for their assistance in preparing the manuscript.

Funding. The authors acknowledge funding from the Gordon and Betty Moore Foundation to the California Institute for Telecommunications and Information Technology at the University of California, San Diego, and from National Science Foundation OptIPuter grant SCI-0225642.

References1. Smarr L (2006 March 21) The ocean of life:

Creating a community cyberinfrastructure for advanced marine microbial ecology research and analysis (a.k.a. CAMERA). Friday Harbor (Washington): Strategic News Service.

2. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al. (2007) The Sorcerer II Gobal Ocean Sampling expedition: Northwest Atlantic through eastern tropical Pacifi c. PLoS Biol 5: e77. doi:10.1371/journal.pbio.0050077

3. Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al. (2007) The Sorcerer II Global Ocean Sampling expedition: Expanding the universe of protein families. PLoS Biol 5: e16. doi:10.1371/journal.pbio.0050016

4. Kannan N, Taylor SS, Zhai Y, Venter JC, Manning G (2006) Structural and functional diversity of the microbial kinome. PLoS Biol 5: e17. doi:10.1371/journal.pbio.0050017

5. DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, et al. (2006) Community genomics among stratifi ed microbial assemblages in the ocean’s interior. Science 311: 496–503.

6. Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, et al. (2006) The marine viromes of four oceanic regions. PLoS Biol 4: e368. doi:10.1371/journal.pbio.0040368

7. Pruitt KD, Tatusova T, Maglott DR (2005) NCBI Reference Sequence (RefSeq): A curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 33: D501–D504.

8. Smarr L, Chien AA, DeFanti T, Leigh J, Papadopoulos PM (2003) The OptIPuter. Commun ACM 46: 58–67.

9. Taesombut N, Uyeda F, Chien AA, Smarr L, DeFanti T, et al. (2006) The OptIPuter: High-performance, QoS-guaranteed network service for emerging e-science applications. IEEE Commun 4: 38–45.

Special Section from March 2007 | Volume 5 | Issue 3 | e75

Page 28: Plos Biology Venter Collection Low

The Sorcerer II Global Ocean SamplingExpedition: Northwest Atlantic throughEastern Tropical PacificDouglas B. Rusch

1*, Aaron L. Halpern

1, Granger Sutton

1, Karla B. Heidelberg

1,2, Shannon Williamson

1, Shibu Yooseph

1,

Dongying Wu1,3

, Jonathan A. Eisen1,3

, Jeff M. Hoffman1

, Karin Remington1,4

, Karen Beeson1

, Bao Tran1

,

Hamilton Smith1

, Holly Baden-Tillson1

, Clare Stewart1

, Joyce Thorpe1

, Jason Freeman1

, Cynthia Andrews-Pfannkoch1

,

Joseph E. Venter1

, Kelvin Li1

, Saul Kravitz1

, John F. Heidelberg1,2

, Terry Utterback1

, Yu-Hui Rogers1

, Luisa I. Falcon5

,

Valeria Souza5

, German Bonilla-Rosso5

, Luis E. Eguiarte5

, David M. Karl6

, Shubha Sathyendranath7

, Trevor Platt7

,

Eldredge Bermingham8

, Victor Gallardo9

, Giselle Tamayo-Castillo10

, Michael R. Ferrari11

, Robert L. Strausberg1

,

Kenneth Nealson1,12

, Robert Friedman1

, Marvin Frazier1

, J. Craig Venter1

1 J. Craig Venter Institute, Rockville, Maryland, United States of America, 2 Department of Biological Sciences, University of Southern California, Avalon, California, United

States of America, 3 Genome Center, University of California Davis, Davis, California, United States of America, 4 Your Genome, Your World, Rockville, Maryland, United States

of America, 5 Departmento de Ecologıa Evolutiva, Instituto de Ecologıa, Universidad Nacional Autonoma de Mexico, Mexico City, Mexico, 6 Department of Oceanography,

University of Hawaii, Honolulu, Hawaii, United States of America, 7 Bedford Institute of Oceanography, Dartmouth, Nova Scotia, Canada, 8 Smithsonian Tropical Research

Institute, Balboa, Ancon, Republic of Panama, 9 Departamento de Oceanografıa, Universidad de Concepcion, Concepcion, Chile, 10 Escuela de Quımica, Universidad de Costa

Rica, San Pedro, Costa Rica, 11 Department of Environmental Sciences, Rutgers University, New Brunswick, New Jersey, United States of America, 12 Department of Earth

Sciences, University of Southern California, Los Angles, California, United States of America

The world’s oceans contain a complex mixture of micro-organisms that are for the most part, uncharacterized bothgenetically and biochemically. We report here a metagenomic study of the marine planktonic microbiota in whichsurface (mostly marine) water samples were analyzed as part of the Sorcerer II Global Ocean Sampling expedition.These samples, collected across a several-thousand km transect from the North Atlantic through the Panama Canal andending in the South Pacific yielded an extensive dataset consisting of 7.7 million sequencing reads (6.3 billion bp).Though a few major microbial clades dominate the planktonic marine niche, the dataset contains great diversity with85% of the assembled sequence and 57% of the unassembled data being unique at a 98% sequence identity cutoff.Using the metadata associated with each sample and sequencing library, we developed new comparative genomic andassembly methods. One comparative genomic method, termed ‘‘fragment recruitment,’’ addressed questions ofgenome structure, evolution, and taxonomic or phylogenetic diversity, as well as the biochemical diversity of genesand gene families. A second method, termed ‘‘extreme assembly,’’ made possible the assembly and reconstruction oflarge segments of abundant but clearly nonclonal organisms. Within all abundant populations analyzed, we foundextensive intra-ribotype diversity in several forms: (1) extensive sequence variation within orthologous regionsthroughout a given genome; despite coverage of individual ribotypes approaching 500-fold, most individualsequencing reads are unique; (2) numerous changes in gene content some with direct adaptive implications; and (3)hypervariable genomic islands that are too variable to assemble. The intra-ribotype diversity is organized intogenetically isolated populations that have overlapping but independent distributions, implying distinct environmentalpreference. We present novel methods for measuring the genomic similarity between metagenomic samples and showhow they may be grouped into several community types. Specific functional adaptations can be identified both withinindividual ribotypes and across the entire community, including proteorhodopsin spectral tuning and the presence orabsence of the phosphate-binding gene PstS.

Citation: Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al. (2007) The Sorcerer II Global Ocean Sampling expedition: Northwest Atlantic through easterntropical Pacific. PLoS Biol 5(3): e77. doi:10.1371/journal.pbio.0050077

Academic Editor: Nancy A. Moran, University of Arizona, United States of America

Received July 14, 2006; Accepted January 16, 2007; Published March 13, 2007

Copyright: � 2007 Rusch et al. This is an open-access article distributed under theterms of the Creative Commons Attribution License, which permits unrestricteduse, distribution, and reproduction in any medium, provided the original authorand source are credited.

Abbreviations: CAMERA, Cyberinfrastructure for Advanced Marine MicrobialEcology Research and Analysis; GOS, Global Ocean Sampling; NCBI, NationalCenter for Biotechnology Information

* To whom correspondence should be addressed. E-mail: [email protected]

This article is part of Global Ocean Sampling collection in PLoS Biology. The fullcollection is available online at http://collections.plos.org/plosbiology/gos-2007.php.

PLoS Biology | www.plosbiology.org | S22 Special Section from March 2007 | Volume 5 | Issue 3 | e770398

PLoS BIOLOGY

Page 29: Plos Biology Venter Collection Low

Introduction

The concept of microbial diversity is not well defined. Itcan either refer to the genetic (taxonomic or phylogenetic)diversity as commonly measured by molecular geneticsmethods, or to the biochemical (physiological) diversitymeasured in the laboratory with pure or mixed cultures.However, we know surprisingly little about either the geneticor biochemical diversity of the microbial world [1], in partbecause so few microbes have been grown under laboratoryconditions [2,3], and also because it is likely that there areimmense numbers of low abundance ribotypes that have notbeen detected using molecular methods [4]. Our under-standing of microbial physiological and biochemical diversityhas come from studying the less than 1% of organisms thatcan be maintained in enrichments or cultivated, while ourunderstanding of phylogenetic diversity has come from theapplication of molecular techniques that are limited in termsof identifying low-abundance members of the communities.

Historically, there was little distinction between geneticand biochemical diversity because our understanding ofgenetic diversity was based on the study of cultivatedmicrobes. Biochemical diversity, along with a few morpho-logical features, was used to establish genetic diversity via anapproach called numerical taxonomy [5,6]. In recent years thesituation has dramatically changed. The determination ofgenetic diversity has relied almost entirely on the use of geneamplification via PCR to conduct taxonomic environmentalgene surveys. This approach requires the presence of slowlyevolving, highly conserved genes that are found in otherwisevery diverse organisms. For example, the gene encoding thesmall ribosomal subunit RNA, known as 16S, based onsedimentation coefficient, is most often used for distinguish-ing bacterial and archaeal species [7–10]. The 16S rRNAsequences are highly conserved and can be used as aphylogenetic marker to classify organisms and place themin evolutionary context. Organisms whose 16S sequences areat least 97% identical are commonly considered to be the

same ribotype [11], otherwise referred to as species, opera-tional taxonomic units, or phylotypes.Although rRNA-based analysis has revolutionized our view

of genetic diversity, and has allowed the analysis of a largepart of the uncultivated majority, it has been less useful inpredicting biochemical diversity. Furthermore, the relation-ship between genetic and biochemical diversity, even forcultivated microbes, is not always predictable or clear. Forinstance, organisms that have very similar ribotypes (97% orgreater homology) may have vast differences in physiology,biochemistry, and genome content. For example, the genecomplement of Escherichia coli O157:H7 was found to besubstantially different from the K12 strain of the same species[12].In this paper, we report the results of the first phase of the

Sorcerer II Global Ocean Sampling (GOS) expedition, ametagenomic study designed to address questions related togenetic and biochemical microbial diversity. This survey wasinspired by the British Challenger expedition that took placefrom 1872–1876, in which the diversity of macroscopicmarine life was documented from dredged bottom samplesapproximately every 200 miles on a circumnavigation [13–15].Through the substantial dataset described here, we identified60 highly abundant ribotypes associated with the open oceanand aquatic samples. Despite this relative lack of diversity inribotype content, we confirm and expand upon previousobservations that there is tremendous within-ribotype diver-sity in marine microbial populations [4,7,8,16,17]. Newtechniques and tools were developed to make use of thesampling and sequencing metadata. These tools include: (1)the fragment recruitment tool for performing and visualizingcomparative genomic analyses when a reference sequence isavailable; (2) new assembly techniques that use metadata toproduce assemblies for uncultivated abundant microbial taxa;and (3) a whole metagenome comparison tool to compareentire samples at arbitrary degrees of genetic divergence.Although there is tremendous diversity within cultivated anduncultivated microbes alike, this diversity is organized intophylogenetically distinct groups we refer to as subtypes.Subtypes can occupy similar environments yet remain

genetically isolated from each other, suggesting that they areadapted for different environmental conditions or roleswithin the community. The variation between and withinsubtypes consists primarily of nucleotide polymorphisms butincludes numerous small insertions, deletions, and hyper-variable segments. Examination of the GOS data in theseterms sheds light on patterns of evolution and also suggestsapproaches towards improving the assembly of complexmetagenomic datasets. At least some of this variation can beassociated with functional characters that are a directresponse to the environment. More than 6.1 million proteins,including thousands of new protein families, have beenannotated from this dataset (described in the accompanyingpaper [18]). In combination, these papers bring us closer toreconciling the genetic and biochemical disconnect and tounderstanding the marine microbial community.We describe a metagenomic dataset generated from the

Sorcerer II expedition. The GOS dataset, which includes andextends our previously published Sargasso Sea dataset [19],now encompasses a total of 41 aquatic, largely marinelocations, constituting the largest metagenomic dataset yetproduced with a total of ;7.7 million sequencing reads. In

Author Summary

Marine microbes remain elusive and mysterious, even though theyare the most abundant life form in the ocean, form the base of themarine food web, and drive energy and nutrient cycling. We knowso little about the vast majority of microbes because only a smallpercentage can be cultivated and studied in the lab. Here we reporton the Global Ocean Sampling expedition, an environmentalmetagenomics project that aims to shed light on the role of marinemicrobes by sequencing their DNA without first needing to isolateindividual organisms. A total of 41 different samples were takenfrom a wide variety of aquatic habitats collected over 8,000 km. Theresulting 7.7 million sequencing reads provide an unprecedentedlook at the incredible diversity and heterogeneity in naturallyoccurring microbial populations. We have developed new bioinfor-matic methods to reconstitute large portions of both cultured anduncultured microbial genomes. Organism diversity is analyzed inrelation to sampling locations and environmental pressures. Takentogether, these data and analyses serve as a foundation for greatlyexpanding our understanding of individual microbial lineages andtheir evolution, the nature of marine microbial communities, andhow they are impacted by and impact our world.

PLoS Biology | www.plosbiology.org | S23 Special Section from March 2007 | Volume 5 | Issue 3 | e770399

Sorcerer II GOS Expedition

Page 30: Plos Biology Venter Collection Low

the pilot Sargasso Sea study, 200 l surface seawater wasfiltered to isolate microorganisms for metagenomic analysis.DNA was isolated from the collected organisms, and genomeshotgun sequencing methods were used to identify more than1.2 million new genes, providing evidence for substantialmicrobial taxonomic diversity [19]. Several hundred new anddiverse examples of the proteorhodopsin family of light-harvesting genes were identified, documenting their exten-sive abundance and pointing to a possible important role inenergy metabolism under low-nutrient conditions. However,substantial sequence diversity resulted in only limitedgenome assembly. These results generated many additionalquestions: would the same organisms exist everywhere in theocean, leading to improved assembly as sequence coverageincreased; what was the global extent of gene and gene familydiversity, and can we begin to exhaust it with a large butachievable amount of sequencing; how do regions of theocean differ from one another; and how are differentenvironmental pressures reflected in organisms and com-munities? In this paper we attempt to address these issues.

Results

Sampling and the Metagenomic DatasetMicrobial samples were collected as part of the Sorcerer II

expedition between August 8, 2003, and May 22, 2004, by theS/V Sorcerer II, a 32-m sailing sloop modified for marineresearch. Most specimens were collected from surface watermarine environments at approximately 320-km (200-mile)intervals. In all, 44 samples were obtained from 41 sites(Figure 1), covering a wide range of distinct surface marine

environments as well as a few nonmarine aquatic samples forcontrast (Table 1).Several size fractions were isolated for every site (see

Materials and Methods). Total DNA was extracted from oneor more fractions, mostly from the 0.1–0.8-lm size range.This fraction is dominated by bacteria, whose compactgenomes are particularly suitable for shotgun sequencing.Random-insert clone libraries were constructed. Dependingon the uniqueness of each sampling site and initial estimatesof the genetic diversity, between 44,000 and 420,000 clonesper sample were end-sequenced to generate mated sequenc-ing reads. In all, the combined dataset includes 6.25 Gbp ofsequence data from 41 different locations. Many of the clonelibraries were constructed with a small insert size (,2 kbp) tomaximize cloning efficiency. As this often resulted in matedsequencing reads that overlapped one another, overlappingmated reads were combined, yielding a total of ;6.4 Mcontiguous sequences, totaling ;5.9 Gbp of nonredundantsequence. Taken together, this is the largest collection ofmetagenomic sequences to date, providing more than a 5-foldincrease over the dataset produced from the Sargasso Seapilot study [19] and more than a 90-fold increase over theother large marine metagenomic dataset [20].

AssemblyAssembling genomic data into larger contigs and scaffolds,

especially metagenomic data, can be extremely valuable, as itplaces individual sequencing reads into a greater genomiccontext. A largely contiguous sequence links genes intooperons, but also permits the investigation of largerbiochemical and/or physiological pathways, and also connectsotherwise-anonymous sequences with highly studied ‘‘taxo-

Figure 1. Sampling Sites

Microbial populations were sampled from locations in the order shown. Samples were collected at approximately 200 miles (320 km) intervals along theeastern North American coast through the Gulf of Mexico into the equatorial Pacific. Samples 00 and 01 identify sets of sites sampled as part of theSargasso Sea pilot study [19]. Samples 27 through 36 were sampled off the Galapagos Islands (see inset). Sites shown in gray were not analyzed as partof this study.doi:10.1371/journal.pbio.0050077.g001

PLoS Biology | www.plosbiology.org | S24 Special Section from March 2007 | Volume 5 | Issue 3 | e770400

Sorcerer II GOS Expedition

Page 31: Plos Biology Venter Collection Low

Ta

ble

1.

Sam

plin

gLo

cati

on

san

dEn

viro

nm

en

tal

Dat

a

IDS

am

ple

Lo

cati

on

Co

un

try

Da

te,

mm

/dd

/yy

Tim

eL

oca

tio

nS

am

ple

De

pth

,

m

Wa

ter

De

pth

,

m

T(8

C)a

Sb

(pp

t)

Siz

e

Fra

ctio

n

(lm

)

Ha

bit

at

Ty

pe

Ch

la

Sa

mp

le

Mo

nth

(An

nu

al

6S

E)

mg

/m�

3

Go

od

Se

qu

en

ces

GS0

0a

Sarg

asso

Stat

ion

s1

3an

d1

1B

erm

ud

a(U

K)

02

/26

/03

3:0

03

183

296

99n

;6

383

594

299

w5

.0.

4,2

00

20

.02

0.5

36

.60

.1–

0.8

Op

en

oce

an0

.17

(0.0

.96

0.0

2)

64

4,5

51

10

:10

3181

095

099

n;

6481

992

799

w3

6.7

GS0

0b

Sarg

asso

Stat

ion

s1

3an

d1

1B

erm

ud

a(U

K)

02

/26

/03

3:3

53

183

291

099

n;

6383

597

099

w5

.0.

4,2

00

20

.02

0.5

36

.60

.22

–0

.8O

pe

no

cean

0.1

7(0

.0.9

60

.02

)3

17

,18

0

10

:43

3181

095

0n

;6

481

992

799

w3

6.7

GS0

0c

Sarg

asso

Stat

ion

s3

Be

rmu

da

(UK

)0

2/2

5/0

31

3:0

03

280

993

099

n;

6480

093

699

w5

.0.

4,2

00

19

.83

6.7

0.2

2–

0.8

Op

en

oce

an0

.17

(0.0

.96

0.0

2)

36

8,8

35

GS0

0d

Sarg

asso

Stat

ion

s1

3B

erm

ud

a(U

K)

02

/25

/03

17

:00

3183

296

99n

;6

383

594

299

w5

.0.

4,2

00

20

.03

6.6

0.2

2–

0.8

Op

en

oce

an0

.17

(0.0

.96

0.0

2)

33

2,2

40

GS0

1a

Hyd

rost

atio

nS

Be

rmu

da

(UK

)0

5/1

5/0

31

1:4

03

281

090

099

n6

483

090

099

w5

.0.

4,2

00

22

.93

6.7

3.0

–2

0.0

Op

en

oce

an0

.10

(0.1

06

0.0

1)

14

2,3

52

GS0

1b

Hyd

rost

atio

nS

Be

rmu

da

(UK

)0

5/1

5/0

31

1:4

03

281

090

099

n;

6483

090

099

w5

.0.

4,2

01

22

.93

6.7

0.8

–3

.0O

pe

no

cean

0.1

0(0

.10

60

.01

)9

0,9

05

GS0

1c

Hyd

rost

atio

nS

Be

rmu

da

(UK

)0

5/1

5/0

31

1:4

03

281

090

099

n;

6483

090

099

w5

.0.

4,2

02

22

.93

6.7

0.1

–0

.8O

pe

no

cean

0.1

(0.1

60

.01

)9

2,3

51

GS0

2G

ulf

of

Mai

ne

USA

08

/21

/03

6:3

24

283

091

199

n;

6781

492

499

w1

.01

06

18

.22

9.2

0.1

–0

.8C

oas

tal

1.4

(1.1

26

0.1

9)

12

1,5

90

GS0

3B

row

ns

Ban

k,G

ulf

of

Mai

ne

Can

ada

08

/21

/03

11

:50

4285

191

099

n;

6681

392

99w

1.0

11

91

1.7

29

.90

.1–

0.8

Co

asta

l1

.4(1

.12

60

.19

)6

1,6

05

GS0

4O

uts

ide

Hal

ifax

,N

ova

Sco

tia

Can

ada

08

/22

/03

5:2

54

488

914

99n

;6

383

894

099

w2

.01

42

17

.32

8.3

0.1

–0

.8C

oas

tal

0.4

(0.7

86

0.1

7)

52

,95

9

GS0

5B

ed

ford

Bas

in,

No

vaSc

oti

aC

anad

a0

8/2

2/0

31

6:2

14

484

192

599

n;

6383

891

499

w1

.06

41

5.0

30

.20

.1–

0.8

Emb

aym

en

t6

(6.7

66

0.9

8)

61

,13

1

GS0

6B

ayo

fFu

nd

y,N

ova

Sco

tia

Can

ada

08

/23

/03

10

:47

4586

942

99n

;6

485

694

899

w1

.01

11

1.2

0.1

–0

.8Es

tuar

y2

.8(1

.87

60

.18

)5

9,6

79

GS0

7N

ort

he

rnG

ulf

of

Mai

ne

Can

ada

08

/25

/03

8:2

54

383

795

699

n;

6685

095

099

w1

.01

39

17

.93

1.7

c0

.1–

0.8

Co

asta

l1

.4(1

.12

60

.19

)5

0,9

80

GS0

8N

ew

po

rtH

arb

or,

RI

USA

11

/16

/03

16

:45

4182

999

99n

;7

182

194

99w

1.0

12

9.4

26

.5c

0.1

–0

.8C

oas

tal

2.2

(1.5

96

0.1

7)

12

9,6

55

GS0

9B

lock

Isla

nd

,N

YU

SA1

1/1

7/0

31

0:3

04

185

928

99n

;7

183

698

99w

1.0

32

11

.03

1.0

c0

.1–

0.8

Co

asta

l4

.0(2

.72

60

.24

)7

9,3

03

GS1

0C

ape

May

,N

JU

SA1

1/1

8/0

34

:30

3885

692

499

n;

7484

196

99w

1.0

10

12

.03

1.0

c0

.1–

0.8

Co

asta

l2

.0(2

.75

60

.33

)7

8,3

04

GS1

1D

ela

war

eB

ay,

NJ

USA

11

/18

/03

11

:30

3982

594

99n

;7

583

091

599

w1

.08

11

.00

.1–

0.8

Estu

ary

4.8

(9.2

36

1.0

2)

12

4,4

35

GS1

2C

he

sap

eak

eB

ay,

MD

USA

12

/18

/03

11

:32

3885

694

999

n;

7682

592

99w

1.0

25

3.2

3.4

7c

0.1

–0

.8Es

tuar

y2

1.0

(15

.06

1.0

1)

12

6,1

62

GS1

3O

ffN

ags

He

ad,

NC

USA

12

/19

/03

6:2

83

680

914

99n

;7

582

394

199

w1

.02

09

.30

.1–

0.8

Co

asta

l3

.0(2

.24

60

.25

)1

38

,03

3

GS1

4So

uth

of

Ch

arle

sto

n,

SCU

SA1

2/2

0/0

31

7:1

23

283

092

599

n;

7981

595

099

w1

.03

11

8.6

0.1

–0

.8C

oas

tal

1.7

0(1

.92

60

.25

)1

28

,88

5

GS1

5O

ffK

ey

We

st,

FLU

SA0

1/0

8/0

46

:25

2482

991

899

n;

8384

912

99w

2.0

47

25

.33

6.0

0.1

–0

.8C

oas

tal

0.2

(0.2

76

0.0

9)

12

7,3

62

GS1

6G

ulf

of

Me

xico

USA

01

/08

/04

14

:15

2481

092

999

n;

8482

094

099

w2

.03

,33

32

6.4

35

.80

.1–

0.8

Co

asta

lse

a0

.16

(0.1

16

0.0

1)

12

7,1

22

GS1

7Y

uca

tan

Ch

ann

el

Me

xico

01

/09

/04

13

:47

2083

192

199

n;

8582

494

999

w2

.04

,51

32

7.0

35

.80

.1–

0.8

Op

en

oce

an0

.13

(0.0

96

0.0

1)

25

7,5

81

GS1

8R

osa

rio

Ban

kH

on

du

ras

01

/10

/04

8:1

21

882

912

99n

;8

384

795

99w

2.0

4,4

70

27

.43

5.4

0.1

–0

.8O

pe

no

cean

0.1

4(0

.09

60

.01

)1

42

,74

3

GS1

9N

ort

he

ast

of

Co

lon

Pan

ama

01

/12

/04

9:0

31

084

295

999

n;

8081

591

699

w2

.03

,33

62

7.7

35

.40

.1–

0.8

Co

asta

l0

.23

(0.1

56

0.0

2)

13

5,3

25

GS2

0La

keG

atu

nP

anam

a0

1/1

5/0

41

0:2

4989

952

99n

;7

985

091

099

w2

.04

28

.50

.06

0.1

–0

.8Fr

esh

wat

er

29

6,3

55

GS2

1G

ulf

of

Pan

ama

Pan

ama

01

/19

/04

16

:48

887

945

99n

;7

984

192

899

w2

.07

62

7.6

30

.70

.1–

0.8

Co

asta

l0

.50

(0.7

36

0.2

2)

13

1,7

98

GS2

22

50

mile

sfr

om

Pan

ama

Cit

yP

anam

a0

1/2

0/0

41

6:3

9682

993

499

n;

8285

491

499

w2

.02

,43

12

9.3

32

.30

.1–

0.8

Op

en

oce

an0

.33

(0.2

86

0.0

2)

12

1,6

62

GS2

33

0m

iles

fro

mC

oco

sIs

lan

dC

ost

aR

ica

01

/21

/04

15

:00

583

892

499

n;

8683

395

599

w2

.01

,13

92

8.7

32

.60

.1–

0.8

Op

en

oce

an0

.07

(0.1

96

0.0

2)

13

3,0

51

GS2

5D

irty

Ro

ck,

Co

cos

Isla

nd

Co

sta

Ric

a0

1/2

8/0

41

0:5

1583

391

099

n;

8785

916

99w

1.1

30

28

.33

1.4

0.8

–3

.0Fr

ing

ing

ree

f0

.11

(0.1

96

0.0

1)

12

0,6

71

GS2

61

34

mile

sN

Eo

fG

alap

ago

sEc

uad

or

02

/01

/04

16

:16

181

595

199

n;

9081

794

299

w2

.02

,37

62

7.8

32

.60

.1–

0.8

Op

en

oce

an0

.22

(0.2

86

0.0

2)

10

2,7

08

GS2

7D

evi

l’sC

row

n,

Flo

rean

aEc

uad

or

02

/04

/04

11

:41

181

295

899

s;9

082

592

299

w2

.02

.32

5.5

34

.90

.1–

0.8

Co

asta

l0

.40

(0.3

86

0.0

3)

22

2,0

80

GS2

8C

oas

tal

Flo

rean

aEc

uad

or

02

/04

/04

15

:47

181

391

99s;

9081

991

199

w2

.01

56

25

.0c

0.1

–0

.8C

oas

tal

0.3

5(0

.35

60

.02

)1

89

,05

2

GS2

9N

ort

hJa

me

sB

ay,

San

tig

oEc

uad

or

02

/08

/04

18

:03

081

290

99s;

9085

097

99w

2.0

12

26

.23

4.5

0.1

–0

.8C

oas

tal

0.4

0(0

.39

60

.03

)1

31

,52

9

GS3

0W

arm

see

p,

Ro

caR

ed

on

da

Ecu

ado

r0

2/0

9/0

41

1:4

2081

692

099

n;

9183

890

99w

19

.01

92

6.9

0.1

–0

.8W

arm

see

p3

59

,15

2

GS3

1U

pw

elli

ng

,Fe

rnan

din

aEc

uad

or

02

/10

/04

14

:43

081

894

99s;

9183

996

99w

12

.01

91

8.6

0.1

–0

.8C

oas

tal

up

we

llin

g0

.35

(0.3

96

0.0

3)

43

6,4

01

GS3

2M

ang

rove

,Is

abe

llaEc

uad

or

02

/11

/04

11

:30

083

593

899

s;9

184

910

99w

0.3

0.6

72

5.4

0.1

–0

.8M

ang

rove

14

8,0

18

GS3

3P

un

taC

orm

ora

nt

Lag

oo

n,

Flo

rean

aEc

uad

or

02

/19

/04

13

:35

181

394

299

s;9

082

594

599

w0

.20

.33

37

.64

6c

0.1

–0

.8H

ype

rsal

ine

69

2,2

55

GS3

4N

ort

hSe

amo

reEc

uad

or

02

/19

/04

17

:06

082

295

999

s;9

081

694

799

w2

.03

52

7.5

0.1

–0

.8C

oas

tal

0.3

6(0

.35

60

.02

)1

34

,34

7

GS3

5W

olf

Isla

nd

Ecu

ado

r0

3/0

1/0

41

6:4

4182

392

199

n;

9184

991

99w

2.0

71

21

.83

4.5

0.1

–0

.8C

oas

tal

0.2

8(0

.31

60

.02

)1

40

,81

4

GS3

6C

abo

Mar

shal

l,Is

abe

llaEc

uad

or

03

/02

/04

12

:52

081

915

99s;

9181

195

299

w2

.06

72

5.8

34

.60

.1–

0.8

Co

asta

l0

.65

(0.4

56

0.0

5)

77

,53

8

GS3

7Eq

uat

ori

alP

acif

icT

AO

Bu

oy

Inte

rnat

ion

al0

3/1

7/0

41

6:3

8185

892

699

s;9

580

953

99w

2.0

3,3

34

28

.80

.1–

0.8

Op

en

oce

an0

.21

(0.2

46

0.0

2)

65

,67

0

GS4

72

01

mile

sfr

om

Fre

nch

Po

lyn

esi

aIn

tern

atio

nal

03

/28

/04

15

:25

1087

953

99s;

13

582

695

899

w3

0.0

2,4

00

28

.63

7.3

0.1

–0

.8O

pe

no

cean

66

,02

3

GS5

1R

ang

iro

raA

toll

Fre

nch

Po

lyn

esi

a0

5/2

2/0

47

:04

1588

937

99s;

14

782

696

99w

1.0

10

27

.33

4.2

0.1

–0

.8C

ora

lre

ef

ato

ll1

28

,98

2

To

tal

7,6

97

,92

6

aT

em

pe

ratu

re.

bSa

linit

y.cM

eas

ure

men

tsw

ere

acq

uir

ed

fro

mn

ear

by

vess

els

and

/or

rese

arch

stat

ion

s.d

oi:1

0.1

37

1/j

ou

rnal

.pb

io.0

05

00

77

.t0

01

PLoS Biology | www.plosbiology.org | S25 Special Section from March 2007 | Volume 5 | Issue 3 | e770401

Sorcerer II GOS Expedition

Page 32: Plos Biology Venter Collection Low

nomic markers’’ such as 16S or recA, thus clearly identifyingthe taxonomic group with which they are associated. Theprimary assembly of the combined GOS dataset wasperformed using the Celera Assembler [21] with modifica-tions as previously described [19] and as given in Materialsand Methods. The assembly was performed with quitestringent criteria, beginning with an overlap cutoff of 98%identity to reduce the potential for artifacts (e.g., chimericassemblies or consensus sequences diverging substantiallyfrom the genome of any given cell). This assembly was thesubstrate for annotation (see the accompanying paper byYooseph et al. [18]).

The degree of assembly of a metagenomic sample providesan indication of the diversity of the sample. A few substantialassemblies notwithstanding, the primary assembly was strik-ingly fragmented (Table 2). Only 9% of sequencing readswent into scaffolds longer than 10 kbp. A majority (53%) ofthe sequencing reads remained unassembled singletons.Scaffolds containing more than 50 kb of consensus sequencetotaled 20.7 Mbp; of these, .75% were produced from asingle Sargasso Sea sample and correspond to the Burkholderiaor Shewanella assemblies described previously [19]. Theseresults highlight the unusual abundance of these twoorganisms in a single sample, which significantly affected

our expectations regarding the current dataset. Given thelarge size of the combined dataset and the substantial amountof sequencing performed on individual filters, the overall lackof assembly provides evidence of a high degree of diversity insurface planktonic communities. To put this in context,suppose there were a clonal organism that made up 1% ofour data, or ;60 Mbp. Even a genome of 10 Mbp—enormousby bacterial standards—would be covered ;6-fold. Such datamight theoretically assemble with an average contig ap-proaching 50 kb [22]. While real assemblies generally fallshort of theory for various reasons, Shewanella data make up,1% of the total GOS dataset, and yet most of the relevantreads assemble into scaffolds .50 kb. Thus, with few scaffoldsof significant length, we could conclude that there are veryfew clonal organisms present at even 1% in the GOS dataset.To investigate the nature of the implied diversity and to see

whether greater assembly could be achieved, we exploredseveral alternative approaches. Breaks in the primaryassembly resulted from two factors: incomplete sequencecoverage and conflicts in the data. Conflicts can breakassemblies when there is no consistent way to chain togetherall overlapping sequencing reads. As it was possible that therewould be fewer conflicts within a single sample (i.e., thatdiversity within a single sample would be lower), assemblieswere attempted with individual samples. However, the resultsdid not show any systematic improvements even in thosesamples with greater coverage (unpublished data). Uponmanual inspection, most assembly-breaking conflicts werefound to be local in nature. These observations suggested thatreducing the degree of sequence identity required forassembly could ameliorate both factors limiting assembly:effective coverage would increase and many minor conflictswould be resolved.Accordingly, we produced a series of assemblies based on

98%, 94%, 90%, 85%, and 80% identity overlaps for twosubsets of the GOS dataset, again using the Celera Assembler.Assembly lengths increased as the overlap cutoff decreasedfrom 98% to 94% to 90%, and then leveled off or evendropped as stringency was reduced below 90% (Table 3).Although larger assemblies could be generated using loweridentity overlaps, significant numbers of overlaps satisfyingthe chosen percent identity cutoff still went unused in eachassembly. This is consistent with a high rate of conflictingoverlaps and in turn diagnostic of significant polymorphism.In mammalian sequencing projects the use of larger insert

libraries is critical to producing larger assemblies because oftheir ability to span repeats or local polymorphic regions [23].The shotgun sequencing libraries from the GOS filters weretypically constructed from inserts shorter than 2 kb. Longerplasmid libraries were attempted but were much less stable.We obtained paired-end sequences from 21,419 fosmid clones(average insert size, 36 kb; [24,25]) from the 0.1-micronfraction of GS-33. The effect of these long mate pairs on theGS-33 assembly was quite dramatic, particularly at highstringency (e.g., improving the largest scaffold from 70 kb to1,247 kb and the largest contig from 70 kb to 427 kb). At leastfor GS-33 this suggests that many of the polymorphisms affectsmall, localized regions of the genome that can be spannedusing larger inserts. This degree of improvement may begreater than what could be expected in general, as thediversity of GS-33 is by far the lowest of any of the currently

Table 2. Summary Assembly Statistics

Category Statistic Value

Assembly

inputs

Number of reads used for assembly 7,697,926

Total read length (bp) 6,325,208,303

Number of ‘‘intigs’’ used for assemblya 6,389,523

Total intig length (bp)a 5,883,982,712

Assembly

outputs

Number of assembliesb 3,081,849

Total assembled consensus length (bp) 4,460,027,783

Percentage of unassembled reads 53%

Percentage of assembly at .13 coverage 15.3%

Base pairs in contigs � 10 kb 39,427,102

Base pairs in contigs � 50 kb 15,723,513

Base pairs in scaffolds � 2.5 kb (consensus bases) 458,196,599

Base pairs in scaffolds � 5 kb (consensus bases) 138,137,150

Base pairs in scaffolds � 10 kb (consensus bases) 65,238,481

Base pairs in scaffolds � 50 kb (consensus bases) 20,738,836

Base pairs in scaffolds � 100 kb (consensus bases) 16,005,244

Base pairs in scaffolds � 300 kb (consensus bases) 8,805,668

Percentage of assembly in scaffolds � 10 kbc 1.5%

Length of longest contig (bp) 977,960

Length of longest scaffold (bp) 2,097,794

N1d assembly bp 15,915

N10d assembly bp 2,533

N50d assembly bp 1,611

N1d contig bp 8,994

N10d contig bp 2,447

N50d contig bp (single reads)

aIntigs are overlapping mated reads that have been collapsed into a single sequence asinput into the assembler.bAssemblies refers to the total number of scaffolds, pairs of mated nonoverlappingsingletons, and singleton unmated reads.cFor comparison purposes, 10 kb is the average contig size predicted for 4.13 coveragefor an idealized shotgun assembly of a repeat-free, clonal genome [22].dN1 indicates the length of the next largest assembled sequence or contig such that 1%of the sequence data falls into longer assemblies or contigs. N10 and N50 indicate that10% and 50% of the data fell into larger assemblies or contigs.doi:10.1371/journal.pbio.0050077.t002

PLoS Biology | www.plosbiology.org | S26 Special Section from March 2007 | Volume 5 | Issue 3 | e770402

Sorcerer II GOS Expedition

Page 33: Plos Biology Venter Collection Low

sequenced GOS samples, yet it clearly indicates the utility ofincluding larger insert libraries for assembly.

Fragment RecruitmentIn the absence of substantial assembly, direct comparison

of the GOS sequencing data to the genomes of sequencedmicrobes is an alternative way of providing context, and alsoallows for exploration of genetic variation and diversity. Alarge and growing set of microbial genomes are available

from the National Center for Biotechnology Information(NCBI; http://www.ncbi.nlm.nih.gov). At the time of this study,we used 334 finished and 250 draft microbial genomes asreferences for comparison with the GOS sequencing reads.Comparisons were carried out in nucleotide-space using thesequence alignment tool BLAST [26]. BLAST parameterswere designed to be extremely lenient so as to detect evendistant similarities (as low as 55% identity). A largeproportion of the GOS reads, 70% in all, aligned to one ormore genomes under these conditions. However, many of thealignments were of low identity and used only a portion ofthe entire read. Such low-quality hits may reflect distantevolutionary relationships, and therefore less information isgained based on the context of the alignment. More stringentcriteria could be imposed requiring that the reads be alignedover nearly their entire length without any large gaps. Usingthis stringent criterion only about 30% of the reads alignedto any of the 584 reference genomes. We refer to these fullyaligned reads as ‘‘recruited reads.’’ Recruited reads are farmore likely to be from microbes closely related to thereference sequence (same species) than are partial align-ments. Despite the large number of microbial genomescurrently available, including a large number of marinemicrobes, these results indicate that a substantial majority ofGOS reads cannot be specifically related to available micro-bial genomes.The amount and distribution of reads recruited to any

given genome provides an indication of the abundance ofclosely related organisms. Only genomes from the fivebacterial genera Prochlorococcus, Synechococcus, Pelagibacter,Shewanella, and Burkholderia yielded substantial and uniformrecruitment of GOS fragments over most of a referencegenome (Table 4). These genera include multiple referencegenomes, and we observed significant differences in recruit-ment patterns even between organisms belonging to the samespecies (Figure 2A–2I). Three genera, Pelagibacter (Figure 2A),Prochlorococcus (Figure 2B–2F), and Synechococcus (Figure 2G–2I), were found abundantly in a wide range of samples andtogether accounted for roughly 50% of all the recruited reads(though only ;15% of all GOS sequencing reads). Bycontrast, although every genome tested recruited some GOSreads, most recruited only a small number, and these readsclustered at lower identity to locations corresponding to largehighly conserved genes (for typical examples see Figure 2E–2F). We refer to this pattern as nonspecific recruitment as itreflects taxonomically nonspecific signals, with the reads in

Table 4. Microbial Genera that Recruited the Bulk of the GOS Reads

Genus Read Count Best Strain

All Reads 80%þa 90%þb All Reads 80%þa 90%þb

Pelagibacter 922,677 195,539 36,965 HTCC1062 HTCC1062 HTCC1062

Prochlorococcus 208,999 159,102 84,325 MIT9312 MIT9312 MIT9312

Synechococcus 60,650 26,365 21,594 CC9902 RS9917 RS9917

Burkholderia 151,123 108,610 93,081 383 383 383

Shewanella 59,086 34,138 27,693 MR-1 MR-1 MR-1

Remaining 43,244 2,367 564 Buchnera aphidicola Str. Sg Buchnera aphidicola Str. APS Alteromonas macleodii

aReads aligned at or above 80% identity over the entire length of the read.bReads aligned at or above 90% identity over the entire length of the read.doi:10.1371/journal.pbio.0050077.t004

Table 3. Evaluation of Alternative Assembly Methods

Dataset Type Percent

Identity

Base Pairs in

10 k Contigs

Base Pairs in

100 k Contigs

GS33 plasmids WGSa 98 13,669,678 0

94 19,536,324 2,749,543

90 20,996,826 3,729,765

85 20,327,989 3,505,324

80 19,245,637 4,195,959

E-asmb 98 22,000,579 5,604,857

94 22,781,462 7,302,801

90 22,702,764 7,600,441

85 22,570,933 7,937,079

80 20,335,558 4,779,684

GS33 with fosmids WGSa 98 15,031,557 1,306,992

94 22,310,335 4,449,710

90 22,944,278 5,585,959

85 22,251,738 5,485,013

80 21,088,975 5,684,925

GS17,18,23,26 WGSa 98 185,058 0

94 5,422,366 213,755

90 10,694,783 373,822

85 11,514,421 800,290

80 9,004,221 879,401

E-asmb 98 2,047,524 0

94 10,668,547 1,184,881

90 15,215,981 2,634,227

85 15,786,515 3,132,152

80 13,767,929 2,942,160

Combined GOS WGSa 98 39,427,102 11,488,828

94 98,887,937 12,376,236

E-asmb 98 91,526,091 16,444,304

94 163,612,717 25,564,163

90 186,614,813 28,752,198

85 181,887,218 27,154,335

80 161,160,091 23,794,832

aWhole-genome shotgun (WGS) assembly performed with the Celera Assembler.bAssemblies performed using extreme assembly approach (E-asm).doi:10.1371/journal.pbio.0050077.t003

PLoS Biology | www.plosbiology.org | S27 Special Section from March 2007 | Volume 5 | Issue 3 | e770403

Sorcerer II GOS Expedition

Page 34: Plos Biology Venter Collection Low

Figure 2. Fragment Recruitment Plots

The horizontal axis of each panel corresponds to a 100-kb segment of genomic sequence from the indicated reference microbial genome. The verticalaxis indicates the sequence identity of an alignment between a GOS sequence and the reference genomic sequence. The identity ranges from 100%(top) to 50% (bottom). Individual GOS sequencing reads were colored to reflect the sample from which they were isolated. Geographically nearbysamples have similar colors (see Poster S1 for key). Each organism shows a distinct pattern of recruitment reflecting its origin and relationship to theenvironmental data collected during the course of this study.(A) P. ubique HTCC1062 recruits the greatest density of GOS sequences of any genome examined to date. The GOS sequences show geographic

PLoS Biology | www.plosbiology.org | S28 Special Section from March 2007 | Volume 5 | Issue 3 | e770404

Sorcerer II GOS Expedition

Page 35: Plos Biology Venter Collection Low

question often recruiting to distantly related sets of genomes.Most microbial genomes, including many of the marinemicrobes (e.g., the ubiquitous genus Vibrio), demonstrated thisnonspecific pattern of recruitment.

The relationship between the similarity of an individualsequencing read to a given genome and the sample fromwhichthe read was isolated can provide insight into the structure,evolution, and geographic distribution of microbial popula-tions. These relationships were assessed by constructing a‘‘percent identity plot’’ [27] in which the alignment of a read toa reference sequence is shown as a bar whose horizontalposition indicates location on the reference andwhose verticalposition indicates the percent identity of the alignment. Wecolored the plotted reads according to the samples to whichthey belonged, thus indirectly representing various forms ofmetadata (geographic, environmental, and laboratory varia-bles). We refer to these plots that incorporate metadata asfragment recruitment plots. Fragment recruitment plots ofGOS sequences recruited to the entire genomes of Pelagibacterubique HTCC1062, Prochlorococcus marinus MIT9312, and Syn-echococcus WH8102 are presented in Poster S1.

Within-Ribotype Population Structure and VariationCharacteristic patterns of recruitment emerged from each

of these abundant marine microbes consisting of horizontalbands made up of large numbers of GOS reads. These bandsseem constrained to a relatively narrow range of identitiesthat tile continuously (or at least uniformly, in the case whenabundance/coverage is lower) along ;90% of the referencesequence. The uninterrupted tiling indicates that environ-mental genomes are largely syntenic with the referencegenomes. Multiple bands, distinguished by degree of sim-ilarity to the reference and by sample makeup, may arise on asingle reference (Poster S1D and S1F). Each of these bandsappears to represent a distinct, closely related population werefer to as a subtype. In some cases, an abundant subtype is

highly similar to the reference genome, as is the case for P.marinus MIT9312 (Poster S1) and Synechococcus RS9917(unpublished data). P. ubique HTCC1062 and other Synecho-coccus strains like WH8102 show more complicated bandingpatterns (Poster S1D and S1F) because of the presence ofmultiple subtypes that produce complex often overlappingbands in the plots. Though the recruitment patterns can bequite complex they are also remarkably consistent over muchof the reference genome. In these more complicated recruit-ment plots, such as the one for P. ubique HTCC1062,individual bands can show sudden shifts in identity ordisappear altogether, producing a gap in recruitment thatappears to be specific to that band (see P. ubique recruitmentplots on Poster S1B and S1E, and specifically between 130–140 kb). Finally, phylogenetic analysis indicates that separatebands are indeed evolutionarily distinct at randomly selectedlocations along the genome.The amount of sequence variation within a given band

cannot be reliably determined from the fragment recruit-ment plots themselves. To examine this variation, weproduced multiple sequence alignments and phylogenies ofreads that recruited to several randomly chosen intervalsalong given reference genomes to show that there can beconsiderable within-subtype variation (Figure 3A–3B). Forexample, within the primary band found in recruitment plotsto P. marinus MIT9312, individual pairs of overlapping readstypically differ on average between 3%–5% at the nucleotidelevel (depending on exact location in the genome). Very fewreads that recruited to MIT9312 have perfect (mismatch-free)overlaps with any other read or to MIT9312, despite ;100-fold coverage. While many of these differences are silent (i.e.,do not change amino acid sequences), there is still consid-erable variation at the protein level (unpublished data). Theamount of variation within subtypes is so great that it is likelythat no two sequenced cells contained identical genomes.

stratification into bands, with sequences from temperate water samples off the North American coast having the highest identity (yellow to yellow-green colors). At lower identity, sequences from all the marine environments could be aligned to HTCC1062.(B) P. marinus MIT9312 recruits a large number of GOS sequences into a single band that zigzags between 85%–95% identity on average. Thesesequences are largely derived from warm water samples in the Gulf of Mexico and eastern Pacific (green to greenish-blue reads).(C) P. marinus MED4 recruits largely the same set of reads as MIT9312 (B) though the sequences that form the zigzag recruit at a substantially loweridentity. A small number of sequences from the Sargasso Sea samples (red) are found at high identity.(D) P. marinus NATL2A recruits far fewer sequences than any of the preceding panels. Like MED4, a small number of high-identity sequences wererecruited from the Sargasso samples.(E) P. marinus MIT9313 is a deep-water low-light–adapted strain of Prochlorococcus. GOS sequences were recruited almost exclusively at low identity invertical stacks that correspond to the locations of conserved genes. On the left side of this panel is a very distinctive pattern of recruitment thatcorresponds to the highly conserved 16S and 23S mRNA gene operon.(F) P. marinus CCMP1375, another deep-water low-light–adapted strain, does not recruit GOS sequences at high identity. Only stacks of sequences areseen corresponding to the location of conserved genes.(G) Synechococcus WH8102 recruits a modest number of high-identity sequences primarily from the Sargasso Sea samples. A large number of moderateidentity matches from the Pacific and hypersaline lagoon (GS33) samples are also visible.(H) Synechococcus CC9605 recruits largely the same sequences as does Synechococcus WH8102, but was isolated from Pacific waters. GOS sequencesfrom some of the Pacific samples recruit at high identity, while sequences from the Sargasso and hypersaline lagoon (bluish-purple) were recruited atmoderate identities.(I) Synechococcus CC9902 is distantly related to either of the preceding Synechococcus strains. While this strain also recruits largely the same sequencesas the WH8102 and CC9902 strains, they recruit at significantly lower identity.(J–O) Fragment recruitment plots to extreme assemblies seeded with phylogenetically informative sequences. Using this approach it is not onlypossible to assemble contigs with strong similarities to known genomes but to identify contigs from previously uncultured genomes. In each case a100-kb segment from an extreme assembly is shown. Each plot shows a distinct pattern of recruitment that distinguishes the panels from each other.(J) Seeded from a Prochlorococcus marinus-related sequence, this contig recruits a broad swath of GOS sequences that correspond to the GOSsequences that form the zigzag on P. marinus MIT9312 recruitment plots (see [B] or Poster S1 for comparison).(K–L) Seeded from SAR11 clones, these contigs show significant synteny to the known P. ubique HTCC1062 genome. (K) is strikingly similar to previousrecruitment plots to the HTCC1062 genome (see [A] or Poster S1). In contrast, (L) identifies a different strain that recruits high-identity GOS sequencesprimarily from the Sargasso Sea samples (red).(M–O) These three panels show recruitment plots to contigs belonging to the uncultured Actinobacter, Roseobacter, and SAR86 lineages.doi:10.1371/journal.pbio.0050077.g002

PLoS Biology | www.plosbiology.org | S29 Special Section from March 2007 | Volume 5 | Issue 3 | e770405

Sorcerer II GOS Expedition

Page 36: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S30 Special Section from March 2007 | Volume 5 | Issue 3 | e770406

Sorcerer II GOS Expedition

Page 37: Plos Biology Venter Collection Low

Identifying Genomic Structural Variation withMetagenomic Data

Variation in genome structure in the form of rearrange-ments, duplications, insertions, or deletions of stretches ofDNA can also be explored via fragment recruitment. The useof mated sequencing reads (pairs of reads from opposite endsof a clone insert) provides a powerful tool for assessingstructural differences between the reference and the environ-mental sequences. The cloning and sequencing processdetermines the orientation and approximate distance be-tween two mated sequencing reads. Genomic structuralvariation can be inferred when these are at odds with theway in which the reads are recruited to a reference sequence.Relative location and orientation of mated sequences providea form of metadata that can be used to color-code a fragmentrecruitment plot (Figure 4). This makes it possible to visuallyidentify and classify structural differences and similaritiesbetween the reference and the environmental sequences(Figure 5). For the abundant marine microbes, a high

proportion of mated reads in the ‘‘good’’ category (i.e., inthe proper orientation and at the correct distance) show thatsynteny is conserved for a large portion of the microbialpopulation. The strongest signals of structural differencestypically reflect a variant specific to the reference genomeand not found in the environmental data. In conjunction withthe requirement that reads be recruited over their entirelength without interruption, recruitment plots result inpronounced recruitment gaps at locations where there is abreak in synteny. Other rearrangements can be partiallypresent or penetrant in the environmental data and thus maynot generate obvious recruitment gaps. However, givensufficient coverage, breaks in synteny should be clearlyidentifiable using the recruitment metadata based on thepresence of ‘‘missing’’ mates (i.e., the mated sequencing readthat was recruited but whose mate failed to recruit; Figure 4).The ratio of missing mates to ‘‘good’’ mates determines howpenetrant the rearrangement is in the environmentalpopulation.In theory, all genome structure variations that are large

enough to prevent recruitment can be detected, and all suchrearrangements will be associated with missing mates.Depending on the type of rearrangement present otherrecruitment metadata categories will be present near therearrangements’ endpoints. This makes it possible to distin-guish among insertions, deletions, translocations, inversions,and inverted translocations directly from the recruitmentplots. Examples of the patterns associated with differentrearrangements are presented in Figure 5. This provides arapid and easy visual method for exploring structuralvariation between natural populations and sequenced repre-sentatives (Poster S1A and S1B).

Genomic Structural Variation in Abundant MarineMicrobesVariation in genome structure potentially results in func-

tional differences. Of particular interest are those differencesbetween sequenced (reference) microbes and environmentalpopulations. These differences can indicate how representa-tive a cultivated microbe might be and shed light on theevolutionary forces driving change in microbial populations.Fragment recruitment in conjunction with the mate metadatahelped us to identify both the consistent and the rarestructural differences between the genomes of microbialpopulations in the GOS data and their closest sequencedrelatives. Our analysis has thus far been confined to the threemicrobial genera that were widespread in the GOS dataset asrepresented by the finished genomes of P. marinus MIT9312,P. ubique HTCC1062, and to a lesser extent SynechococcusWH8102. Each of these genomes is characterized by large andsmall segments where little or no fragment recruitment tookplace. We refer to these segments as ‘‘gaps.’’ These gaps

Figure 3. Population Structure and Variation as Revealed by Phylogeny

Phylogenies were produced using neighbor-joining. There is significant within-clade variation as well as an absence of strong geographic structure tovariants of SAR11 (P. ubique HTCC1062) and P. marinus MIT9312. Similar reads are not necessarily from similar locations, and reads from similar locationsare not necessarily similar.(A) Geographic distribution of SAR11 proteorhodopsin variants. Keys to coloration: blue, Pacific; pink, Atlantic.(B) Geographic distribution of Prochlorococcus variants. Keys to coloration: blue, Pacific; pink, Atlantic.(C) Origins of spectral tuning of SAR11 proteorhodopsins. Reads are colored according to whether they contain the L (green) or Q (blue) variant at thespectral tuning residue described in the text. The selection of tuning residue is lineage restricted, but each variant must have arisen on two separateoccasions.doi:10.1371/journal.pbio.0050077.g003

Figure 4. Categories of Recruitment Metadata

The recruitment metadata distinguishes eight different general catego-ries based on the relative placement of paired end sequencing reads(mated reads) when recruited to a reference sequence in comparison totheir known orientation and separation on the clone from which theywere derived. Assuming orientation is correct, two mated reads can berecruited closer together, further apart, or within expected distancesgiven the size of the clone from which the sequences were derived.These sequences are categorized as ‘‘short,’’ ‘‘long,’’ or ‘‘good,’’respectively. Alternately, the mated reads may be recruited in a mis-oriented fashion, which trumps issues of separation. These reads can becategorized as ‘‘normal,’’ ‘‘anti-normal,’’ or ‘‘outie.’’ In addition, there aretwo other categories. ‘‘No mate’’ indicates that no mated read wasavailable for recruitment, possibly due to sequencing error. Perhaps mostuseful of any of the recruitment categories, ‘‘missing’’ mates indicatethat while a mated sequence was available, it was not recruited to thereference. ‘‘Missing’’ mates identify breaks in synteny between theenvironmental data and the reference sequence.doi:10.1371/journal.pbio.0050077.g004

PLoS Biology | www.plosbiology.org | S31 Special Section from March 2007 | Volume 5 | Issue 3 | e770407

Sorcerer II GOS Expedition

Page 38: Plos Biology Venter Collection Low

represent reference-specific differences that are not found inthe environmental populations rather than a cloning biasthat identifies genes or gene segments that are toxic orunclonable in E. coli. The presence of missing mates flankingthese gaps indicates that the associated clones do exist, andtherefore that cloning issues are not a viable explanation forthe absence of recruited reads. Although the reference-specific differences are quite apparent due to the recruitmentgaps they generate, there are also sporadic rearrangementsassociated with single clones, mostly resulting from smallinsertions or deletions.

Careful examination of the unrecruited mates of the readsflanking the gaps allowed us to identify, characterize, andquantify specific differences between the reference genomeand their environmental relatives. The results of this analysisfor P. ubique and P. marinus have been summarized in Table 5.With few exceptions, small gaps resulted from the insertionor deletion of only a few genes. Many of the genes associatedwith these small insertions and deletions have no annotatedfunction. In some cases the insertions display a degree ofvariability such that different sets of genes are found at theselocations within a portion of the population. In contrast,many of the larger gaps are extremely variable to the extentthat every clone contains a completely unrelated or highlydivergent sequence when compared to the reference or toother clones associated with that gap. These segments arehypervariable and change much more rapidly than would beexpected given the variation in the rest of the genome. Sitescontaining a hypervariable segment nearly always containedsome insert. We identified two exceptions both associatedwith P. ubique. The first is approximately located at the 166-kbposition in the P. ubique HTCC1062 genome. Though no largegap is present, the mated reads indicate that under manycircumstances a highly variable insert is often present. Thesecond is a gap on HTCC1062 that appears between 50 and90 kb. This gap appears to be less variable than otherhypervariable segments and is occasionally absent based onthe large numbers of flanking long mated reads (Poster S1A).

Interestingly, the long mated reads around this gap seem tobe disproportionately from the Sargasso Sea samples,suggesting that this segment may be linked to geographicand/or environmental factors. Thus, hypervariable segmentsare highly variable even within the same sample, can onoccasion be unoccupied, and the variation, or lack thereof,can be sample dependent.Hypervariable segments have been seen previously in a

wide range of microbes, including P. marinus [28], but theirprecise source and functional role, especially in an environ-mental context, remains a matter of ongoing research. Forclues to these issues we examined the genes associated withthe missing mates flanking these segments and the nucleotidecomposition of the gapped sequences in the referencegenomes. In some rare cases the genes identified on readsthat should have recruited within a hypervariable gap werehighly similar to known viral genes. For example, a viralintegrase was associated with the P. ubique HTCC1062hypervariable gap between 516 and 561 kb. However, in themajority of cases the genes associated with these gaps wereuncharacterized, either bearing no similarity to known genesor resembling genes of unknown function. If these genes wereindeed acquired through horizontal transfer then we mightexpect that they would have obvious compositional biases.Oligonucleotide frequencies along the P. ubique HTCC1062and Synechococcus WH8102 genomes are quite different in thelarge recruitment gaps in comparison to the well-representedportions of the genome (Poster S1). Surprisingly, this was lesstrue for P. marinus MIT9312, where the gaps have been linkedto phage activity [28]. These results suggest that thesehypervariable segments of the genome are widespread amongmarine microbial populations, and that they are the productof horizontal transfer events perhaps mediated by phage ortransposable elements. These results are consistent with andexpand upon the hypothesis put forward by Coleman et al.[28] suggesting that these segments are phage mediated, andconflicts with initial claims that the HTCC1062 genome wasdevoid of genes acquired by horizontal transfer [29].

Figure 5. Fragment Recruitment at Sites of Rearrangements

Environmental sequences recruited near breaks in synteny have characteristic patterns of recruitment metadata. Indeed, each of five basicrearrangements (i.e., insertion, deletion, translocation, inversion, and inverted translocation) produced a distinct pattern when examining therecruitment metadata. Here, example recruitment plots for each type of rearrangement have been artificially generated. The ‘‘good’’ and ‘‘no mate’’categories have been suppressed. In each case, breaks in synteny are marked by the presence of stacks of ‘‘missing’’ mate reads. The presence orabsence of other categories distinguishes each type of rearrangement from the others.doi:10.1371/journal.pbio.0050077.g005

PLoS Biology | www.plosbiology.org | S32 Special Section from March 2007 | Volume 5 | Issue 3 | e770408

Sorcerer II GOS Expedition

Page 39: Plos Biology Venter Collection Low

Table 5. Atypical Segments in P. marinus MIT9312 and P. ubique HTCC1062 (SAR11)

Reference Genome Begina Endb Size, bp Type of Variantc Description

MIT9312 36,401 38,311 2,132 Variable deletion 12 out of 66 clones support simple deletion. Remaining clones show considerable

sequence variation amongst themselves.

MIT9312 124,448 125,219 771 Variable insertion Associated with ASN tRNA gene; in the environment, half the reads identify pair

of small inserts with no similarity to known genes; the other half point to small

and large inserts of undetermined nature.

MIT9312 233,826 233,910 84 Insertion All clones support small insert (270 bp) with no clear sequence similarity to

known genes or sequences.

MIT9312 243,296 245,115 424 Variable deletion 24 of 42 clones support simple deletion of hypothetical protein. 13 support

slightly larger deletion. Remaining not clearly resolved.

MIT9312 296,818 300,888 4,344 Variable deletion 44 clones support deletion of 3,070 bp segment containing 4 genes (3 hypotheti-

cal; 1 carbamoyltransferase). 17 clones support alternative sequences with little or

no similarity to each other.

gMIT9312 342,404 342,662 326 Variable insert In environment is 93% chance of finding deoxyribodopyrimiden photolyase with

7% chance of finding an ABC type Fe3þ siderophore transport system permease

component.

MIT9312 345,933 365,351 19,418 Hypervariable Very limited similarity among clones indicates that this is a hypervariable seg-

ment. Note that it is closely associated with a site-specific integrase/recombinase.

MIT9312 551,347 552,025 678 Deletion Two small deletions within a hypothetical protein.

MIT9312 617,914 621,556 3,642 Hypervariable About 50% of the clones support small deletions among several hypothetical

proteins. Remaining support significant variability suggesting hypervariable seg-

ment. At least two of the missed mates contain integrase-like genes.

MIT9312 646,340 652,375 6,035 Deletion/Hypervariable Majority of clones support simple deletion of eight hypothetical and hypothetical

genes. Small number of clones indicate this may by hypervariable as well.

MIT9312 655,241 655,800 559 Deletion Small deletion between hypothetical proteins.

MIT9312 665,000 678000 12,000 Deletions Complex set of deletions and replacements that vary with geographic location.

MIT9312 665,824 666,380 556 Insertion Small inserted hypothetical protein.

MIT9312 670,747 671,933 1,186 Deletion Deletes hypothetical gene.

MIT9312 736,266 736,289 23 Insertion All clones support insertion of fructose-bisphosphate aldolase and fructose-1,6-bi-

sphosphate aldolase.

MIT9312 762,156 762,717 561 Deletion Small deletion between hypothetical proteins.

MIT9312 779,006 779,309 303 Insertion Small hypothetical protein inserted.

MIT9312 874,349 874,913 564 Insertion Small insertion including gene with similarity to RNA-dependent RNA-polymerase.

MIT9312 943,389 946,997 3,608 Variable Several small changes.

MIT9312 1,043,129 1,131,874 88,745 Hypervariable

MIT9312 1,140,922 1,141,412 490 Insertion Small insertion.

MIT9312 1,144,307 1,144,790 483 Insertion Small insertion of several genes.

MIT9312 1,155,123 1,156,440 1,317 Deletion Deletes high-light–inducible protein.

MIT9312 1,172,609 1,177,292 4,683 Variable Several genes have been replaced or deleted.

MIT9312 1,202,643 1,274,335 71,692 Hypervariable

MIT9312 1,288,481 1,290,367 1,886 Variable deletion Deletes polysaccharide export-related periplasmic protein (28 out of 55). Other

deletions are variable and may include replacement with alternate sequences.

MIT9312 1,323,606 1,324,523 917 Deletion Environmental sequences lack a small hypothetical protein.

MIT9312 1,369,637 1,369,996 359 Insertion NAD-dependent DNA ligase absent from MIT9312; has possible paralog

MIT9312 1,381,273 1,382,049 776 Deletion Small insert in MIT9312 not present in environment

MIT9312 1,384,664 1,385,110 446 Deletion Deletes delta(12)-fatty acid dehydrogenase and replaces gene with small (;100

bp) sequence. There is some variation in the exact location and replacement se-

quence.

MIT9312 1,388,430 1,389,718 1,288 Replacement Segment between two high-light–inducible proteins swapped for different se-

quence with no similarity.

MIT9312 1,392,865 1,392,976 111 Replacement Small replacement deletes hypothetical gene and replaces it with small, unknown

sequence. There is some variation in the precise boundaries of the deletion and

in the replacement sequences.

MIT9312 1,397,696 1,420,005 22,309 Hypervariable

MIT9312 1,486,145 1,487,971 231 Variable Small insertion of dolichyl-phosphate-mannose-protein mannosyltransferase; alter-

nately deletes glycosyl transferase (8 out of 56).

MIT9312 1,519,810 1,520,860 1,050 Variable deletion Deletes a single hypothetical protein; about half of the deletions contain variable

sequences of unknown origin.

MIT9312 1,568,049 1,569,121 1,072 Replacement Typically 928-bp portion of MIT9312 replaced by 175-bp stretch in environment;

some small amount of variation in environmental replacement sequence (11 out

of 51).

HTCC1062 50,555 93,942 43,387 Variable deletion Low recruitment segment containing many hypothetical, transporter, and secre-

tion genes.

HTCC1062 146,074 146,415 341 Deletion Often deleted segment containing DoxD-like and ferredoxin dependent gluta-

mate synthase peptide.

HTCC1062 166,600 166,700 100 Variable replacement GOS sequences indicate that variable blocks of genes are frequently inserted

here.

HTCC1062 308,720 309,633 913 Deletion Potential sulfotransferase domain deleted.

HTCC1062 339,545 339,951 406 Deletion Deletes a predicted O-linked N-acetylglucosamine transferase.

HTCC1062 385,348 386,224 876 Deletion SAM-dependent methyltransferase deleted.

PLoS Biology | www.plosbiology.org | S33 Special Section from March 2007 | Volume 5 | Issue 3 | e770409

Sorcerer II GOS Expedition

Page 40: Plos Biology Venter Collection Low

Though insertions and deletions accounted for many of theobvious regions of structural variation, we also looked forrearrangements. The high levels of local synteny associatedwith P. ubique and P. marinus suggested that large-scalerearrangements were rare in these populations. To inves-tigate this hypothesis we used the recruitment data toexamine how frequently rearrangements besides insertionsand deletions could be identified. We looked for rearrange-ments consisting of large (greater than 50 kb) inversions andtranslocations associated with P. marinus; however, we did notidentify any such rearrangements that consistently distin-guished environmental populations from sequenced cultivars.Rare inversions and translocations were identified in thedominant subtype associated with MIT9312 (Table 6). Based

on the amount of sequence that contributed to the analysis,we estimate that one inversion or translocation will beobserved for every 2.6 Mbp of sequence examined (less thanonce per P. marinus genome).A further observation concerns the uniformity along a

genome of the evolutionary history among and withinsubtypes. For instance, the similarity between GOS readsand P. marinus MIT9312 is typically 85%–95%, while thesimilarity between MIT9312 and P. marinus MED4 is generally;10% lower. However, there are several instances where thedivergence of MIT9312 and MED4 abruptly decreases to nomore than that between the GOS sequences and MIT9312(Poster S1G). These results are consistent either withhorizontal transfer (recombination) or with inhomogeneous

Table 6. Six Large-Scale Translocations and Inversions Were Identified in the Abundant P. marinus Subtype

Group Low

Genome

Begin

Low

Genome

End

High

Genome

Begin

High

Genome

End

Read ID Low

Read

Begin

Low

Read

Breakpoint

Low

Read

Strand

High

Read

Breakpoint

High

Read

End

High

Read

Strand

Read

Length

Inversion Sample

1 34,428 34,904 1,536,417 1,536,774 1092255385627 0 476 1 477 834 1 834 No 15

2 607,778 608,467 1,131,368 1,131,621 1093017685727 270 959 0 0 251 1 959 Yes 18

3 618,997 619,375 1,172,217 1,172,728 1092963065572 19 397 1 425 939 1 939 No 17

3 618,997 619,372 1,172,203 1,172,728 1095433012642 0 375 1 403 931 1 941 No 26

3 618,997 619,251 1,172,216 1,172,728 1091140913752 2 256 1 284 799 1 799 No 19

3 618,997 619,331 1,172,280 1,172,728 1641121 2 337 1 365 816 1 816 No 00d

3 619,007 619,375 1,172,223 1,172,728 1092963490951 19 387 1 425 933 1 933 No 17

4 652,933 653,483 1,369,774 1,370,077 200560 325 875 0 0 303 1 875 Yes 00a

4 652,979 653,350 1,369,774 1,370,200 1492647 0 371 1 439 865 0 867 Yes 00d

4 652,979 653,496 1,369,774 1,370,086 1092256128910 10 527 1 595 905 0 906 Yes 15

4 652,979 653,496 1,369,774 1,370,061 1092405979387 14 531 1 599 886 0 886 Yes 25

4 652,979 653,353 1,369,774 1,370,132 1093017637883 2 376 1 444 802 0 802 Yes 18

4 652,980 653,496 1,369,774 1,370,111 1092343389654 6 522 1 591 928 0 928 Yes 25

5 1,049,484 1,049,793 1,172,301 1,172,782 1400802 518 827 0 1 483 0 827 No 00d

6 1,219,993 1,220,519 1,485,834 1,486,222 1092256207885 2 528 1 532 919 1 919 No 17

6 1,219,993 1,220,496 1,485,925 1,486,222 380485 300 803 0 0 296 0 803 No 00a

6 1,219,993 1,220,477 1,485,782 1,486,204 1484583 0 484 1 506 927 1 927 No 00d

6 1,220,005 1,220,388 1,485,733 1,486,216 1682914 507 890 0 2 484 0 913 No 00d

doi:10.1371/journal.pbio.0050077.t006

Table 5. Continued.

Reference Genome Begina Endb Size, bp Type of Variantc Description

HTCC1062 441,074 441,152 78 Deletion Deletes possible methyltransferase FkbM.

HTCC1062 516,041 561,604 45,563 Variable replacement Several high identity ‘‘missed’’ mates match phage genes, including an integrase.

HTCC1062 660,413 660,978 565 Deletion Deletes hypothetical protein.

HTCC1062 675,141 676,399 1,258 Deletion Deletes hypothetical protein.

HTCC1062 766,022 768,263 2,241 Deletion Deletes four hypothetical proteins.

HTCC1062 814,015 816,386 2,371 Deletion Deletes steroid monoxygenase and short-chain dehydrogenase.

HTCC1062 893,450 922,604 29,154 Deletion Deletes large segment including a 7317-aa hypothetical gene.

HTCC1062 941,203 942,403 1,200 Deletion Deletes hypothetical protein.

HTCC1062 991,461 997,861 6,400 Deletion Deletes several hypotheticals and mix of other genes; adjacent to recombinase.

HTCC1062 1,117,126 1,143,483 26,357 Deletion Large number of hypothetical and transporters that are deleted.

HTCC1062 1,160,908 1,166,589 5,681 Deletion Deletes a small cluster of peptides with various functions.

HTCC1062 1,188,887 1,189,231 344 Deletion Deletes portion of winged helix DNA-binding protein but inserts sequences with

similarity to gij71082757 sodium bile symporter family protein found in large gap

between 50555 and 93942.

aBegin indicates the approximate bp position which marks the beginning of the gap in recruitment.bEnd indicates the approximate bp position which marks the ending of the gap in recruitment.cThe type of change indicates what would have to happen to the reference genome to produce the sequences seen in the environment (e.g., a deletion indicates that the indicatedportion of the reference would have to be deleted to generate the variant(s) seen in the environment).doi:10.1371/journal.pbio.0050077.t005

PLoS Biology | www.plosbiology.org | S34 Special Section from March 2007 | Volume 5 | Issue 3 | e770410

Sorcerer II GOS Expedition

Page 41: Plos Biology Venter Collection Low

selectional pressures. Similar patterns are present in the twohigh-identity subtypes seen on the P. ubique HTCC1062genome (Poster S1D). Other regions show local increases insimilarity between MIT9312 and the dominant subtype thatare not reflected in the MIT9312/MED4 divergence (e.g., nearpositions 50 kb, 288 kb, 730 kb, 850 kb, and 954 kb onMIT9312; also see Poster S1G). These latter regions mightreflect either regions of homogenizing recombination orregions of higher levels of purifying selection. However, thelengths of the intervals (several are 10 kb or more) are longerthan any single gene and correspond to genes that are notextremely conserved over greater taxonomic distances (incontrast to the ribosomal RNA operon). Equally, if widespreadhorizontal transfer of an advantageous segment explains theseintervals, the transfers occurred long enough ago forappreciable variation to accumulate (unpublished data).

Extreme Assembly of Uncultivated PopulationsThe analyses described above have been confined to those

organisms with representatives in culture and for whichgenomes were readily available. Producing assemblies forother abundant but uncultivated microbial genera wouldprovide valuable physiological and biochemical informationthat could eventually lead to the cultivation of theseorganisms, help elucidate their role in the marine commun-ity, and allow similar analyses of their evolution and variationsuch as those performed on sequenced organisms. Previousassembly efforts and the fragment recruitments plots showedthat there is considerable and in many cases conflictingvariation among related organisms. Such variation is knownto disrupt whole-genome assemblers. This led us to try anassembly approach that aggressively resolves conflicts. We callthis approach ‘‘extreme assembly’’ (see Materials and Meth-ods). This approach currently does not make use of mate-pairing data and, therefore produces only contigs, notscaffolded sequences. Using this approach, contigs as largeas 900 kb could be aligned almost in their entirety to the P.marinus MIT9312 and P. ubique HTCC1062 genomes (Figure2J–2L). Consistent patterns of fragment recruitment (seebelow) generally provided evidence of the correctness ofcontigs belonging to otherwise-unsequenced organisms.Accordingly, large contigs from these alternate assemblieswere used to investigate genetic and geographic populationstructure, as described below. However, the more aggressiveassemblies demonstrably suffered from higher rates ofassembly artifacts, including chimerism and false consensussequences (Figure 6). Thus, the more stringent primaryassembly was employed for most assembly-based analyses, asmanual curation was not practical.As just noted, many of the large contigs produced by the

more aggressive assembly methods described above did notalign to any great degree with known genomes. Some couldbe tentatively classified based on contained 16S sequences,but the potential for computationally generated chimerismwithin the rRNA operon is sufficiently high that inspection ofthe assembly or other means of confirming such classifica-tions is essential. An alternative to an unguided assembly thatfacilitates the association of assemblies with known organismsis to start from seed fragments that can be identified asbelonging to a particular taxonomic group. We employedfragments outside the ribosomal RNA operon that weremated to a 16S-containing read, limiting extension to thedirection away from the 16S operon. This produced contigsof 100 kb or more for several of the ribotypes that wereabundant in the GOS dataset. When evaluated via fragmentrecruitment (Figure 2M–2O), these assemblies revealedpatterns analogous to those seen for the sequenced genomesdescribed above: multiple subtypes could be distinguishedalong the assembly, differing in similarity to the referencesequence and sample distribution, with occasional gaps.Hypervariable segments by definition were not representedin these assemblies, but they may help explain the termi-nation of the extreme assemblies for P. marinus and SAR11and provide a plausible explanation for termination ofassemblies of the other deeply sampled populations as well.This directed approach to assembly can also be used to

investigate variation within a group of related organisms (e.g.,a 16S ribotype). We explored the potential to assemble

Figure 6. Examples of Chimeric Extreme Assemblies

(A) Fragment recruitment to an extreme assembly contig indicates theassembly is chimeric between two organisms, based on dramatic shifts indensity of recruitment, level of conservation, and sample distribution.(B) Fragment recruitment to a SAR11-related extreme assembly. Changesin color, density, and vertical location toward the top of the figureindicate transitions among multiple subtypes of SAR11.doi:10.1371/journal.pbio.0050077.g006

PLoS Biology | www.plosbiology.org | S35 Special Section from March 2007 | Volume 5 | Issue 3 | e770411

Sorcerer II GOS Expedition

Page 42: Plos Biology Venter Collection Low

distinct subtypes of SAR11 by repeatedly seeding extremeassembly with fragments mated to a SAR11-like 16S sequence.Figure 7 compares the first 20 kb from each of 24independent assemblies. Eighteen of these segments couldbe aligned full-length to a portion of the HTCC1062 genomejust upstream of 16S, while six appeared to reflect rearrange-ments relative to HTCC1062. The rearranged segments wereassociated with more divergent 16S sequences (8%–14%diverged from the 16S of HTCC1062), while those withoutrearrangements corresponded to less divergent 16S (averag-ing less than 3% different from HTCC1062). In each segment,many reads were recruited above 90% identity, but differentsamples dominated different assemblies. Phylogenetic treessupport the inference of evolutionarily distinct subtypes withdistinctive sample distributions (Figure 8).

Taxonomic DiversityEnvironmental surveys provide a cultivation-independent

means to examine the diversity and complexity of anenvironmental sample and serve as a basis to compare thepopulations between different samples. Typically, these

surveys use PCR to amplify ubiquitous but slowly evolvinggenes such as the 16S rRNA or recA genes. These in turn canbe used to distinguish microbial populations. Since PCR canintroduce various biases, we identified 16S genes directlyfrom the primary GOS assembly. In total, 4,125 distinct full-length or partial 16S were identified. Clustering of thesesequences at 97% identity gave a total of 811 distinctribotypes. Nearly half (48%) of the GOS ribotypes and 88%of the GOS 16S sequences were assigned to ribotypespreviously deposited in public databases. That is, more thanhalf the ribotypes in the GOS dataset were found to be novelat what is typically considered the species level [30]. Theoverall taxonomic distribution of the GOS ribotypes sampledby shotgun sequencing is consistent with previously publishedPCR based studies of marine environments (Table 7) [31]. Asmaller amount (16%) of GOS ribotypes and 3.4% of the GOS16S sequences diverged by more than 10% from any publiclyavailable 16S sequence, thus being novel to at least the familylevel.A census of microbial ribotypes allows us to identify the

abundant microbial lineages and estimate their contribution

Figure 7. Fragment Recruitment Plots to 20-kb Segments of SAR11-Like Contigs Show That Many SAR11 Subtypes, with Distinct Distributions, Can Be

Separated by Extreme Assembly

Each segment is constructed of a unique set of GOS sequencing reads (i.e., no read was used in more than one segment). Segments are arbitrarilylabeled (A–X) for reference in Figure 8.doi:10.1371/journal.pbio.0050077.g007

PLoS Biology | www.plosbiology.org | S36 Special Section from March 2007 | Volume 5 | Issue 3 | e770412

Sorcerer II GOS Expedition

Page 43: Plos Biology Venter Collection Low

Figure 8. Phylogeny of GOS Reads Aligning to P. ubique HTCC1062 Upstream of 16S Gene Indicates That the Extreme Assemblies in Figure 7

Correspond to Monophyletic Subtypes

Coloring of branches indicates that the corresponding reads align at .90% identity to the extreme assembly segments shown in Figure 7; coloredlabels (A–X) correspond to the labels in Figure 7, indicating the segment or segments to which reads aligned.doi:10.1371/journal.pbio.0050077.g008

PLoS Biology | www.plosbiology.org | S37 Special Section from March 2007 | Volume 5 | Issue 3 | e770413

Sorcerer II GOS Expedition

Page 44: Plos Biology Venter Collection Low

to the GOS dataset. Of the 811 ribotypes, 60 contain morethan 8-fold coverage of the 16S gene (Table 8); jointly, these 60ribotypes accounted for 73% of all the 16S sequence data. Allbut one of the 60 have been detected previously, yet only a feware represented by close relatives with complete or nearlycomplete genome sequencing projects (see Fragment Recruit-ment for further details). Several other abundant 16Ssequences belong to well-known environmental ribotypes thatdo not have cultivated representatives (e.g., SAR86, RoseobacterNAC-1–2, and branches of SAR11 other than those containingP. ubique). Interestingly, archaea are nearly absent from the listof dominant organisms in these near-surface samples.

The distribution of these ribotypes reveals distinct micro-bial communities (Figure 9 and Table 8). Only a handful ofthe ribotypes appear to be ubiquitously abundant; these aredominated by relatives of SAR11 and SAR86. Many of theribotypes that are dominant in one or more samples appearto reside in one of three separable marine surface habitats.For example, several SAR11, SAR86, and alpha Proteobacteria,as well as an Acidimicrobidae group, are widespread in thesurface waters, while a second niche delineated by tropicalsamples contains several different SAR86, Synechococcus andProchlorococcus (both cyanobacterial groups), and a Rhodospir-illaceae group. Other ribotypes related to Roseobacter RCA,SAR11, and gamma Proteobacteria are abundant in thetemperate samples but were not observed in the tropical orSargasso samples. Not surprisingly, samples taken fromnonmarine environments (GS33, GS20, GS32), estuaries(GS11, GS12), and larger-sized fraction filters (GS01a,GS01b, GS25) have distinguishing ribotypes. Furthermore,as the complete genomes of these dominant members areobtained, the capabilities responsible for their abundancesmay well lend insight into the community metabolism invarious oceanic niches.

Sample ComparisonsThe most common approach for comparing the microbial

community composition across samples has been to examinethe ribotypes present as indicated by 16S rRNA genes or byanalyzing the less-conserved ITS located between the 16S and

23S gene sequences [7,8,16,17]. However, a clear observationemerging from the fragment recruitment views was that thereference ribotypes recruit multiple subtypes, and that thesesubtypes were distributed unequally among samples (Figures2, 7, 8; Poster S1D, S1F, and S1I).We developed a method to assess the genetic similarity

between two samples that potentially makes use of allportions of a genome, not just the 16S rRNA region. Thissimilarity measure is assembly independent; under certaincircumstances, it is equivalent to an estimate of the fractionof sequence from one sample that could be considered to bein the other sample. Whole-metagenomic similarities werecomputed for all pairs of samples. Results are presented forcomparisons at �98% and 90% identity. No universal cutoffconsistently divides sequences into natural subsets, but the98% identity cutoff provides a relatively high degree ofresolution, while the 90% cutoff appears to be a reasonableheuristic for defining subtypes. For instance, a 90% cutofftreats most of the reads specifically recruited to P. marinusMIT9312 as similar (those more similar to MED4 notablyexcepted), while reasonably separating clades of SAR11(Figures 7 and 8). Reads with no qualifying overlap alignmentto any other read in a pair of samples are uninformative forthis analysis, as they correspond to lineages that were solightly sequenced that their presence in one sample andabsence in another may be a matter of chance. For the 90%cutoff, 38% of the sequence reads contributed to the analysis.The resulting similarities reveal clear and consistent group-ings of samples, as well as the outlier status of certain samples(Figures 10 and 11).The broadest contrast was between samples that could be

loosely labeled ‘‘tropical’’ (including samples from theSargasso Sea [GS00b, GS00c, GS00d] and samples that aretemperate by the formal definition but under the influence ofthe Gulf Stream [GS14, GS15]) and ‘‘temperate.’’ Furthersubgroups can be identified within each of these categories, asindicated in Figures 10 and 11. In some cases, these groupingswere composed of samples taken from different ocean basinsduring different legs of the expedition. A few pairs of sampleswith strikingly high similarity were observed, including GS17and GS18, GS23 and GS26, GS27 and GS28, and GS00b andGS00d. In each case, these pairs of samples were collectedfrom consecutive or nearly consecutive samples. However,the same could be said of many other pairs of samples that donot show this same degree of similarity. Indeed, geographi-cally and temporally separated samples taken in the Atlantic(GS17, GS18) and Pacific (GS23, GS26) during separate legs ofthe expedition are more similar to one another than weremost pairs of consecutive samples. The samples with leastsimilarity to any other sample were from unique habitats.Thus, similarity cannot be attributed to geographic separa-tion alone.The groupings described above can be reconstructed from

taxonomically distinct subsets of the data. Specifically, themajor groups of samples visible in Figure 10 were reproducedwhen sample similarities were determined based only onfragments recruiting to P. ubique HTCC1062 (unpublisheddata). Likewise, the same groupings were observed when thefragments recruiting to either HTCC1062 or P. marinusMIT9312, or both, were excluded from the calculations(unpublished data). Thus, the factors influencing samplesimilarities do not appear to rely solely on the most abundant

Table 7. Taxonomic Makeup of GOS Samples Based on 16S Datafrom Shotgun Sequencing

Phylum or Class Fractiona

Alpha Proteobacteria 0.32

Unclassified Proteobacteria 0.155

Gamma Proteobacteria 0.132

Bacteroidetes 0.13

Cyanobacteria 0.079

Firmicutes 0.075

Actinobacteria 0.046

Marine Group A 0.022

Beta Proteobacteria 0.017

OP11 0.008

Unclassified Bacteria 0.008

Delta Proteobacteria 0.005

Planctomycetes 0.002

Epsilon Proteobacteria 0.001

aValues shown are averages over all samples.doi:10.1371/journal.pbio.0050077.t007

PLoS Biology | www.plosbiology.org | S38 Special Section from March 2007 | Volume 5 | Issue 3 | e770414

Sorcerer II GOS Expedition

Page 45: Plos Biology Venter Collection Low

Table 8. Most Abundant Ribotypes (97% Identity Clusters)

Ribotype Classificationa Depth of Coverageb Range Number of Matching GenBank Entriesc

SAR11 Surface 1 581 Widespread 100þSAR11 Surface 2 182 Sargasso and GS31d 100þBurkholderia 139 00a 100þAcidomicrobidae type a 133 Tropical and Sargassod 77

Prochlorococcus 112 Tropical and Sargasso 76

SAR11 Surface 3 109 Widespread 63

SAR86-like type a 108 Widespread 88

Shewanella 80 00a 49

Synechococcus 59 Tropicald 100þRhodospirillaceae 50 Tropical and Sargassod 15

SAR86-like type b 47 Hypersaline pondd 24

SAR86-like type c 47 Tropical and Sargasso 75

Chlorobi-like 40 Hypersaline pond 1

Alpha Proteobacteria type a 38 Widespread 10

Roseobacter type a 35 Tropical and Sargassod 42

Cellulomonadaceae type a 34 Hypersaline pond 12

SAR86-like type d 28 Widespread 30

Alpha Proteobacteria type b 27 Widespread 43

SAR86-like type e 26 Widespread 71

Cytophaga type a 24 Tropical and Sargasso 21

Bacteroidetes type a 23 Widespread 17

Bdellovibrionales type a 21 Tropical and Sargasso 36

Acidomicrobidae type b 21 Temperate 41

SAR116-like 21 Widespread 28

Marine Group A type a 20 Tropical and Sargassod 9

Remotely SAR11-like type a 19 Sargasso and tropical 12

Frankineae type a 19 Fresh and estuary 6

Frankineae type b 18 Hypersaline pondd 71

SAR86-like type f 18 Tropical 15

SAR86-like type g 17 Tropical 13

Remotely SAR11-like type b 16 Tropical and Sargasso 4

Gamma Proteobacteria type a 16 Sargasso and GS14 18

Microbacteriaceae 15 Hypersaline pond 1

SAR102/122-like 15 Tropical and Sargasso 15

SAR86-like type h 15 Tropical and Sargasso 27

Bacteroidetes type b 14 Tropicald 14

Remotely SAR11-like type c 14 Tropical and Sargasso 18

SAR86-like type i 14 Sargassod 13

Rhodobium-like type a 14 Sargasso and GS31d 9

Marine Group A type b 13 Tropical and Sargassod 28

Gamma Proteobacteria type b 12 S. temperate and GS31 22

Oceanospirillaceae 12 Mangrove 1

Gamma Proteobacteria type c 12 Widespread 14

SAR11-like type a 12 Temperate 17

SAR11-like type b 11 Fresh and estuary 11

SAR11-like type c 11 Sargasso and GS31d 12

SAR86-like type j 11 Tropical and Sargasso 1

Roseobacter Algicola 11 Widespread 20

Remotely SAR102/122-like 11 Hypersaline pond 0

Frankineae type c 10 Fresh and estuary 11

Rhodobium-like type b 10 Widespread 9

Roseobacter RCA 10 Temperate 39

Remotely SAR11-like type d 10 Tropical 32

Acidobacteria 9 Fresh 4

Remotely SAR11-like type e 9 Tropical and Sargasso 8

Frankineae type d 9 Estuary and fresh 89

SAR11-like type d 9 Sargasso and GS31 4

Methylophilus 9 Temperated 41

Archaea C1 C1a 8 Sargasso and GS31 2

Cytophaga type b 8 Widespread 9

aTaxonomic classifications based on Hugenholtz ARB database. Labels indicate the most specific taxonomic assignment that could be confidently assigned to each ribotype. ‘‘Type a,’’‘‘type b,’’ etc., used to arbitrarily discriminate separate 97% ribotypes that would otherwise be given the same name.bNote that the 16S rRNA gene can be multicopy.cMatching GenBank entries required full-length matches at � 98% identity.dLess than 13 coverage outside described range.doi:10.1371/journal.pbio.0050077.t008

PLoS Biology | www.plosbiology.org | S39 Special Section from March 2007 | Volume 5 | Issue 3 | e770415

Sorcerer II GOS Expedition

Page 46: Plos Biology Venter Collection Low

organisms but rather are reflected in multiple microbiallineages.

It is tempting to view the groups of similar samples asconstituting community types. Sample similarities based ongenomic sequences correlated significantly with differencesin the environmental parameters (Table 1), particularly watertemperature and salinity (unpublished data). Samples that arevery similar to each other had relatively small differences intemperature and salinity. However, not all samples that hadsimilar temperature and salinity had high communitysimilarities. Water depth, primary productivity, fresh waterinput, proximity to land, and filter size appeared consistent

with the observed groupings. Other factors such as nutrientsand light for phototrophs and fixed carbon/energy forchemotrophs may ultimately prove better predictors, butthese results demonstrate the potential of using metagenomicdata to tease out such relationships.Examining the groupings in Figure 11 in light of habitat

and physical characteristics, the following may be observed.The first two samples, a hypersaline pond in the GalapagosIslands (GS33) and the freshwater Lake Gatun in the PanamaCanal (GS20) are quite distinct from the rest. Salinity—bothhigher and lower than the remaining coastal and oceansamples—is the simplest explanation.

Figure 9. Presence and Abundance of Dominant Ribotypes

The relative abundance of various ribotypes (rows) in each filter (columns) is represented by the area of the corresponding spot (if any). The listedribotypes each satisfied the following criteria in at least one filter: the ribotype was among the five most abundant ribotypes detected in the shotgundata, and was represented by at least three sequencing reads. Relative abundance is based on the total number of 16S sequences in a given filter. Orderand grouping of filters is based on the clustering of genomic similarity shown in Figure 11. Ribotype order was determined based on similarity ofsample distribution. A marked contrast between temperate and tropical groups is visible. Estuarine samples GS11 and GS12 contained a mix ofribotypes seen in freshwater and temperate marine samples, while samples from nonmarine habitats or larger filter sizes were pronounced outliers. Thepresence of large amounts of Burkholderia and Shewanella in one Sargasso Sea sample (GS00a) makes this sample look much less like other Sargassoand tropical marine samples than it otherwise would. Note that 16S is not a measure of cell abundance since 16S genes can be multicopy.doi:10.1371/journal.pbio.0050077.g009

PLoS Biology | www.plosbiology.org | S40 Special Section from March 2007 | Volume 5 | Issue 3 | e770416

Sorcerer II GOS Expedition

Page 47: Plos Biology Venter Collection Low

Twelve samples form a strong temperate cluster as seen inthe similarity matrix of Figure 11 as a darker square boundedby GS06 and GS12. Embedded within the temperate clusterare three subclusters. The first subcluster includes five

samples from Nova Scotia through the Gulf of Maine. Thisis followed by a subcluster of four samples between RhodeIsland and North Carolina. The northern subcluster wassampled in August, the southern subcluster in November and

Figure 10. Similarity between Samples in Terms of Shared Genomic Content

Genomic similarity, as described in the text, is an estimate of the amount of the genetic material in two filters that is ‘‘the same’’ at a given percentidentity cutoff—not the amount of sequence in common in a finite dataset, but rather in the total set of organisms present on each filter. Similarities areshown for 98% identity.(A) Hierarchical clustering of samples based on pairwise similarities.(B) Pairwise similarities between samples, represented as a symmetric matrix of grayscale intensities; a darker cell in the matrix indicates greatersimilarity between the samples corresponding to the row and column, with row and column ordering as in (A). Groupings of similar filters appear assubtrees in (A) and as squares consisting of two or more adjacent rows and columns with darker shading. Colored bars highlight groups of samplesdescribed in the text; labels are approximate characterizations rather than being strictly true of every sample in a group.doi:10.1371/journal.pbio.0050077.g010

PLoS Biology | www.plosbiology.org | S41 Special Section from March 2007 | Volume 5 | Issue 3 | e770417

Sorcerer II GOS Expedition

Page 48: Plos Biology Venter Collection Low

December. Though all samples were collected in the top fewmeters, the southern samples were in shallower waters, 10 to30 m deep, whereas most of the northern samples were inwaters greater than 100 m deep. Monthly average estimates ofchlorophyll a concentrations were typically higher in thesouthern samples as well (Table 1). All of these factors—temperature, system primary production, and depth of thesampled water body—likely contribute to the differences inmicrobial community composition that result in the two well-defined clusters. The final temperate subgroup includes twoestuaries, Chesapeake Bay (GS12) and Delaware Bay (GS11),distinguished by their lower salinity and higher productivity.However, GS11 is markedly similar not only to GS12 but alsoto coastal samples, whereas the latter appears much moreunique. Interestingly, the Bay of Fundy estuary sample (GS06)clearly did not group with the two other estuaries, but ratherwith the northern subgroup, perhaps reflecting differences inthe rate or degree of mixing at the sampling site.

Continuing to the right and downward in Figure 11, onecan see a large cluster of 25 samples from the tropics andSargasso Sea, bounded by GS47 and GS00b. This can befurther subdivided into several subclusters. The first sub-cluster (a square bounded by GS47 and GS14) includes 14samples, about half of which were from the Galapagos. Thesecond distinct subcluster (a square bounded by GS16 andGS26) includes seven samples from Key West, Florida, in theAtlantic Ocean to a sample close to the Galapagos Islands inthe Pacific Ocean. Loosely associated with this subcluster is asample from a larger filter size taken en route to theGalapagos (GS25). The remaining samples group weakly withthe tropical cluster. GS32 was taken in a coastal mangrove inthe Galapagos. The thick organic sediment at a depth of lessthan a meter is the likely cause for it being unlike the othersamples. Sample 00a was from the Sargasso Sea and containeda large fraction of sequence reads from apparently clonalBurkholderia and Shewanella species that are atypical. When this

Figure 11. Sample Similarity at 90% Identity

Similarity between samples in terms of shared genomic content similar to Figure 10, except that the plots were done using a 90% identity cutoff thathas proven reasonable for separating some moderately diverged subtypesdoi:10.1371/journal.pbio.0050077.g011

PLoS Biology | www.plosbiology.org | S42 Special Section from March 2007 | Volume 5 | Issue 3 | e770418

Sorcerer II GOS Expedition

Page 49: Plos Biology Venter Collection Low

Table 9. Relative Abundance of TIGRFAMs Associated with a Specific Sample

TIGRFAM Number of

Peptides

Samplea Relative

AbundancebMajor

Category

Minor

Category

Description

TIGR01526 131 GS01a 3.4 Nicotinamide-nucleotide adenylyltransferase

TIGR00661 214 GS01a 2.6 Hypothetical Conserved Conserved hypothetical protein

TIGR01833 135 GS01a 2 Hydroxymethylglutaryl-CoA synthase

TIGR01408 267 GS01b 4.5 Ubiquitin-activating enzyme E1

TIGR01678 144 GS01b 2.6 Sugar 1,4-lactone oxidases

TIGR01879 758 GS01b 2.5 Amidase, hydantoinase/carbamoylase family

TIGR00890 131 GS01b 2.4 Transport Carbohydrates, Oxalate/Formate Antiporter

TIGR01767 112 GS01b 2.4 5-methylthioribose kinase

TIGR00101 495 GS01b 2.3 Central Nitrogen Urease accessory protein UreG

TIGR01659 186 GS01b 2.2 Sex-lethal family splicing factor

TIGR00313 455 GS01b 2.1 Biosynthesis Heme, Cobyric acid synthase CobQ

TIGR00317 306 GS01b 2.1 Biosynthesis Heme, Cobalamin 59-phosphate synthase

TIGR00601 186 GS01b 2.1 DNA DNA UV excision repair protein Rad23

TIGR01792 904 GS01b 2.1 Central Nitrogen Urease, alpha subunit

TIGR00749 483 GS01b 2 Energy Glycolysis/gluconeogenesis Glucokinase

TIGR01001 485 GS01b 2 Amino Aspartate Homoserine O-succinyltransferase

TIGR02238 165 GS32 5.1 Meiotic recombinase Dmc1

TIGR02239 159 GS32 3.5 DNA repair protein RAD51

TIGR02232 140 GS32 2.6 Myxococcus cysteine-rich repeat

TIGR02153 212 GS32 2.5 Protein tRNA Glutamyl-tRNA(Gln) amidotransferase, subunit D

TIGR00519 289 GS32 2.4 L-asparaginases, type I

TIGR00288 248 GS32 2 Hypothetical Conserved Conserved hypothetical protein TIGR00288

TIGR01681 136 GS32 2 HAD-superfamily phosphatase, subfamily IIIC

TIGR02236 143 GS32 2 DNA DNA DNA repair and recombination protein RadA

TIGR00028 110 GS33 14.7 Mycobacterium tuberculosis PIN domain family

TIGR01550 131 GS33 9.1 Unknown General Death-on-curing family protein

TIGR01552 200 GS33 5.3 Mobile Other Prevent-host-death family protein

TIGR00143 151 GS33 4.6 Protein Protein [NiFe] hydrogenase maturation protein HypF

TIGR01641 131 GS33 4.3 Mobile Prophage Phage putative head morphogenesis protein, SPP1 gp7 family

TIGR01710 1,926 GS33 3.9 Cellular Pathogenesis General secretion pathway protein G

TIGR01539 251 GS33 3.6 Mobile Prophage Phage portal protein, lambda family

TIGR00016 217 GS33 3.5 Energy Fermentation Acetate kinase

TIGR00942 117 GS33 3.5 Transport Cations Multicomponent Naþ:Hþ antiporter

TIGR01836 174 GS33 3.3 Fatty Biosynthesis Poly(R)-hydroxyalkanoic acid synthase, class III, PhaC subunit

TIGR02110 135 GS33 3.2 Biosynthesis Other Coenzyme PQQ biosynthesis protein PqqF

TIGR01106 902 GS33 3.1 Energy ATP-proton Na,H/K antiporter P-type ATPase, alpha subunit

TIGR01497 726 GS33 3.1 Transport Cations Kþ-transporting ATPase, B subunit

TIGR01524 603 GS33 3.1 Transport Cations Magnesium-translocating P-type ATPase

TIGR02140 166 GS33 3.1 Transport Anions Sulfate ABC transporter, permease protein CysW

TIGR01409 122 GS33 3 Tat (twin-arginine translocation) pathway signal sequence

TIGR02195 282 GS33 3 Cell Biosynthesis Lipopolysaccharide heptosyltransferase II

TIGR01522 696 GS33 2.9 Calcium-transporting P-type ATPase, PMR1-type

TIGR01523 766 GS33 2.9 Potassium/sodium efflux P-type ATPase, fungal-type

TIGR00202 228 GS33 2.8 Regulatory RNA Carbon storage regulator

TIGR01116 798 GS33 2.8 Transport Cations Calcium-translocating P-type ATPase, SERCA-type

TIGR02094 204 GS33 2.8 Alpha-glucan phosphorylases

TIGR02051 296 GS33 2.7 Regulatory DNA Hg(II)-responsive transcriptional regulator

TIGR01005 259 GS33 2.6 Transport Carbohydrates, Exopolysaccharide transport protein family

TIGR01222 129 GS33 2.6 Cellular Cell Septum site-determining protein MinC

TIGR01334 157 GS33 2.6 Unknown General modD protein

TIGR01554 152 GS33 2.6 Mobile Prophage Phage major capsid protein, HK97 family

TIGR00640 1,059 GS33 2.5 Methylmalonyl-CoA mutase C-terminal domain

TIGR01708 801 GS33 2.5 Cellular Pathogenesis General secretion pathway protein H

TIGR02018 373 GS33 2.5 Regulatory DNA Histidine utilization repressor

TIGR00554 182 GS33 2.4 Biosynthesis Pantothenate Pantothenate kinase

TIGR01202 152 GS33 2.4 Biosynthesis Chlorophyll Chlorophyll synthesis pathway, bchC

TIGR01583 102 GS33 2.4 Energy Electron Formate dehydrogenase, gamma subunit

TIGR02092 241 GS33 2.4 Energy Biosynthesis Glucose-1-phosphate adenylyltransferase, GlgD subunit

TIGR00052 263 GS33 2.3 Hypothetical Conserved Conserved hypothetical protein TIGR00052

TIGR00824 166 GS33 2.3 Signal PTS PTS system, mannose/fructose/sorbose family, IIA component

TIGR01003 321 GS33 2.3 Transport Carbohydrates, Phosphocarrier, HPr family

TIGR01439 191 GS33 2.3 Regulatory DNA Transcriptional regulator, AbrB family

TIGR01457 220 GS33 2.3 HAD-superfamily subfamily IIA hydrolase, TIGR01457

TIGR01517 684 GS33 2.3 Calcium-translocating P-type ATPase, PMCA-type

TIGR02028 632 GS33 2.3 Biosynthesis Chlorophyll Geranylgeranyl reductase

TIGR00452 217 GS33 2.2 Unknown Enzymes Methyltransferase, putative

TIGR00609 2,555 GS33 2.2 DNA DNA Exodeoxyribonuclease V, beta subunit

TIGR00876 265 GS33 2.2 Energy Pentose Transaldolase

PLoS Biology | www.plosbiology.org | S43 Special Section from March 2007 | Volume 5 | Issue 3 | e770419

Sorcerer II GOS Expedition

Page 50: Plos Biology Venter Collection Low

sample is reanalyzed to exclude reads identified as belongingto these two groups, sample GS00a groups loosely withGS00b, GS00c, and GS00d (unpublished data). Finally, threesubsamples from a single Sargasso sample (GS01a, GS01b,GS01c) group together, despite representing three distinctsize fractions (3.0–20, 0.8–3.0, and 0.1–0.8 lm, respectively;Table 1).

The complete set of sample similarities is more complexthan described above, and indeed is more complex than canbe captured by a hierarchical clustering. For instance, thesouthern temperate samples are appreciably more similar tothe tropical cluster than are the northern temperate samples.GS22 appears to constitute a mix of tropical types, showingstrong similarity not only to the GS47–GS14 subcluster towhich it was assigned, but also to the other tropical samples.

These results may be compared to the more traditionalview of community structure afforded by 16S sequences(Figure 9). Some of the same groupings of samples are visibleusing both analyses. Several ribotypes recapitulated thetemperate/tropical clustering described above. Others wererestricted to the single instances of nonmarine habitats.Several of the most abundant organisms from the coastalmangrove, hypersaline lagoon, and freshwater lake werefound exclusively in these respective samples. However, whileseveral ribotypes recapitulated the temperate/tropical dis-tinction revealed by the genomic sequence, others crosscut it.A few dominant 16S ribotypes, related to SAR11, SAR86, andSAR116, were found in every marine sample. The brackishwaters from two mid-Atlantic estuaries (GS11 and GS12)

contained a mixture of otherwise exclusively marine andfreshwater ribotypes; similarity of these sites to the fresh-water sample (GS20) was minimal at the metagenomic level,while the greater similarity of GS11 to coastal samples visibleat the metagenomic level was not readily visible here. A fullercomparison of metagenome-based measurements of diversitybased on a large dataset of PCR-derived 16S sequences will bepresented in another paper (in preparation).

Variation in Gene AbundanceDifferences in gene content between samples can identify

functions that reflect the lifestyles of the community in thecontext of its local environment [20,32]. We examined therelative abundance of genes belonging to specific functionalcategories in the distinct GOS samples. Genes were binnedinto functional categories using TIGRFAM hidden Markovmodels [18], which are well annotated and manually curated[33].The results can be filtered in various ways to highlight

genes associated with specific environments. One catalog ofpossible interest is genes that were predominantly found in asingle sample. We identified 95 TIGRFAMs that annotatedlarge sets of genes (100 or more) that were significantly morefrequent (greater than 2-fold) in one sample than in any othersample (Table 9). Not surprisingly, this approach dispropor-tionately singles out genes from the samples collected onlarger filters (GS01a, GS01b, and GS25) and from thenonmarine environments, particularly the hypersaline pond(sample GS33). Another contrast might be between the

Table 9. Continued.

TIGRFAM Number of

Peptides

Samplea Relative

AbundancebMajor

Category

Minor

Category

Description

TIGR00996 302 GS33 2.2 Cellular Pathogenesis Virulence factor Mce family protein

TIGR01254 419 GS33 2.2 Transport Other ABC transporter periplasmic binding protein, thiB subfamily

TIGR01278 508 GS33 2.2 Biosynthesis Chlorophyll Light-independent protochlorophyllide reductase, B subunit

TIGR01512 1,820 GS33 2.2 Transport Cations Cadmium-translocating P-type ATPase

TIGR01525 2,037 GS33 2.2 Heavy metal translocating P-type ATPase

TIGR01543 418 GS33 2.2 Protein Other Phage prohead protease, HK97 family

TIGR01857 1,495 GS33 2.2 Purines Purine Phosphoribosylformylglycinamidine synthase

TIGR02015 183 GS33 2.2 Energy Photosynthesis Chlorophyllide reductase subunit Y

TIGR02072 1,052 GS33 2.2 Biosynthesis Biotin Biotin biosynthesis protein BioC

TIGR02099 273 GS33 2.2 Hypothetical Conserved Conserved hypothetical protein TIGR02099

TIGR00203 168 GS33 2.1 Energy Electron Cytochrome d ubiquinol oxidase, subunit II

TIGR00218 141 GS33 2.1 Energy Sugars Mannose-6-phosphate isomerase, class I

TIGR00915 4,834 GS33 2.1 Transport Other Transporter, hydrophobe/amphiphile efflux-1 (HAE1) family

TIGR01315 319 GS33 2.1 FGGY-family pentulose kinase

TIGR01330 253 GS33 2.1 39(29),59-bisphosphate nucleotidase

TIGR01508 848 GS33 2.1 Diaminohydroxyphosphoribosylaminopyrimidine reductase

TIGR01511 1,928 GS33 2.1 Transport Cations Copper-translocating P-type ATPase

TIGR01764 387 GS33 2.1 Unknown General DNA binding domain, excisionase family

TIGR02014 298 GS33 2.1 Energy Photosynthesis Chlorophyllide reductase subunit Z

TIGR02047 222 GS33 2.1 Cd(II)/Pb(II)-responsive transcriptional regulator

TIGR00586 990 GS33 2 DNA DNA Mutator mutT protein

TIGR00853 114 GS33 2 Signal PTS PTS system, lactose/cellobiose family IIB component

TIGR00937 364 GS33 2 Transport Anions Chromate transporter, chromate ion transporter (CHR) family

TIGR01030 221 GS33 2 Protein Ribosomal Ribosomal protein L34

TIGR01214 2,834 GS33 2 Cell Biosynthesis dTDP-4-dehydrorhamnose reductase

TIGR01698 608 GS33 2 Purine nucleotide phosphorylase

TIGR02190 465 GS33 2 Glutaredoxin-family domain

aReads associated with Shewanella and Burkholderia have been excluded.bTIGRFAM is this many times more abundant than in the next most abundant sample.doi:10.1371/journal.pbio.0050077.t009

PLoS Biology | www.plosbiology.org | S44 Special Section from March 2007 | Volume 5 | Issue 3 | e770420

Sorcerer II GOS Expedition

Page 51: Plos Biology Venter Collection Low

temperate and tropical clusters (Figures 10 and 11). Weidentified 32 proteins that were more than 2-fold morefrequent in one or the other group (Table 10). The presenceof various Prochlorococcus-associated genes in this list high-lights some of the potential challenges with this sort ofapproach. Overrepresentation may reflect: a direct responseto particular environmental pressures (as the excess of salttransporters plausibly do in the hypersaline pond); a lineage-restricted difference in functional repertoire (as exemplifiedby the excess of photosynthesis genes in samples containingProchlorococcus); or a more incidental ‘‘hitchhiking’’ of aprotein found in a single organism that happens to bepresent.

We explored whether clearer and more informativedifferences could be discovered between communities byfocusing on groups of samples that are highly similar inoverall taxonomic/genetic content. Two pairs of samplesprovide a particularly nice illustration of this approach.Samples GS17 and GS18 from the western Caribbean Sea andsamples GS23 and GS26 from the eastern Pacific Ocean wereall very similar based on the presence of abundant ribotypesand overall similarity in genetic content (Figures 9–11).Despite these similarities, several genes are found to be up toseven times more common in the pair of Caribbean samplesthan the Pacific pair (Table 11). No genes are more than 2-

fold higher in the Pacific than the Caribbean pair of samples.Several of the most differentially abundant genes are relatedto phosphate transport and utilization. It is very plausiblethat this is a reflection of a functional adaptation: thesedifferences correlate well with measured differences inphosphate abundance between the Atlantic and easternPacific samples [34,35], and phosphate abundance plays acritical role in microbial growth [36,37]. Indeed, the ability toacquire phosphate, especially under conditions where it islimited, is thought to determine the relative fitness ofProchlorococcus strains [38].The single greatest difference between GS17 and GS18 on

the one hand and GS23 and GS26 on the other was attributedto a set of genes annotated by the hidden Markov modelTIGR02136 as a phosphate-binding protein (PstS). ThisTIGRFAM identified a single gene in both P. marinusMIT9312 and P. ubique HTCC1062. In P. marinus MIT9312,this gene is located at 672 kb lying roughly in the middle of a15-kb segment of the genome that recruits almost no GOSsequences from the Pacific sampling sites (Poster S1H). In P.ubique HTCC1062, the PstS gene is found at 1,133 kb in a 5-kbsegment that also recruited far fewer GOS sequences from allthe Pacific samples except for GS51 (Poster S1E). Thesegenomic segments differ structurally among isolates but theyare no more variable than the flanking regions, and thus are

Table 10. Relative Abundance of TIGRFAM Matches in Temperate and Tropical Waters

TIGRFAM Number of

Peptides

Sample(s) Relative

AbundanceaMajor

Category

Minor

Category

Description

TIGR01153 729 GS15–GS19 32.7 Energy Photosynthesis Photosystem II 44 kDa subunit reaction center protein

TIGR02093 673 GS15–GS19 29.6 Energy Biosynthesis Glycogen/starch/alpha-glucan phosphorylases

TIGR01335 813 GS15–GS19 26.6 Energy Photosynthesis Photosystem I core protein PsaA

TIGR01336 806 GS15–GS19 26.5 Energy Photosynthesis Photosystem I core protein PsaB

TIGR00975 648 GS15–GS19 11 Transport Anions Phosphate ABC transporter, phosphate-binding protein

TIGR00297 261 GS15–GS19 8.6 Hypothetical Conserved Conserved hypothetical protein TIGR00297

TIGR00992 302 GS15–GS19 8.5 Transport Amino Chloroplast envelope protein translocase, IAP75 family

TIGR02030 560 GS15–GS19 7.5 And Chlorophyll Magnesium chelatase ATPase subunit I

TIGR02041 359 GS15–GS19 6 Central Sulfur Sulfite reductase (NADPH) hemoprotein, beta-component

TIGR01151 2,095 GS15–GS19 4.7 Energy Photosynthesis Photosystem q(b) protein

TIGR01152 1,865 GS15–GS19 4.7 Energy Photosynthesis Photosystem II D2 protein (photosystem q(a) protein)

TIGR02031 800 GS15–GS19 4.2 Biosynthesis Chlorophyll Magnesium chelatase ATPase subunit D

TIGR01790 629 GS15–GS19 4 Lycopene cyclase family protein

TIGR02100 512 GS15–GS19 4 Energy Biosynthesis Glycogen debranching enzyme GlgX

TIGR00073 284 GS15–GS19 3.4 Protein Protein Hydrogenase accessory protein HypB

TIGR00159 497 GS15–GS19 3 Hypothetical Conserved Conserved hypothetical protein TIGR00159

TIGR01515 594 GS15–GS19 3 Energy Biosynthesis 1,4-alpha-glucan branching enzyme

TIGR00217 601 GS15–GS19 2.7 Energy Biosynthesis 4-alpha-glucanotransferase

TIGR01486 505 GS15–GS19 2.7 Mannosyl-3-phosphoglycerate phosphatase family

TIGR01098 720 GS15–GS19 2.6 Transport Carbohydrates Phosphonate ABC transporter, periplasmic phosphonate-binding protein

TIGR00101 495 GS15–GS19 2.5 Central Nitrogen Urease accessory protein UreG

TIGR01273 567 GS15–GS19 2.4 Central Polyamine Arginine decarboxylase

TIGR01470 179 GS5–GS10 25.7 Biosynthesis Heme Siroheme synthase, N-terminal domain

TIGR00361 374 GS5–GS10 12.1 Cellular DNA DNA internalization-related competence protein ComEC/Rec2

TIGR01537 333 GS5–GS10 6.2 Mobile Prophage Phage portal protein, HK97 family

TIGR00201 291 GS5–GS10 6 Cellular DNA comF family protein

TIGR00879 420 GS5–GS10 5 Transport Carbohydrates Sugar transporter

TIGR02018 373 GS5–GS10 4.1 Regulatory DNA Histidine utilization repressor

TIGR02183 294 GS5–GS10 4.1 Glutaredoxin, GrxA family

TIGR00427 602 GS5–GS10 4 Hypothetical Conserved Conserved hypothetical protein TIGR00427

TIGR01109 219 GS5–GS10 3.6 Energy Other Sodium ion-translocating decarboxylase, beta subunit

TIGR01262 840 GS5–GS10 2.8 Energy Amino Maleylacetoacetate isomerase

aAverage abundance of TIGRFAM is that many times more abundant the average abundance in the given samples than in the other set of samples (in this case, GS15–GS19 werecompared with GS5–GS10).doi:10.1371/journal.pbio.0050077.t010

PLoS Biology | www.plosbiology.org | S45 Special Section from March 2007 | Volume 5 | Issue 3 | e770421

Sorcerer II GOS Expedition

Page 52: Plos Biology Venter Collection Low

not hypervariable in the sense used previously (unpublisheddata). Nor are they particularly conserved when present,indicating that they are not the result of a recent lateraltransfer. Phylogenetic analyses outside these segments did notproduce any evidence of a Pacific versus Caribbean clade ofeither Prochlorococcus or SAR11 (Figure 3A–3B). The presenceor absence of phosphate transporters is not limited to thesetwo types of organisms. The number of phosphate trans-porters that were found in the Caribbean far exceeds thenumber that can be attributed to HTCC1062- and MIT9312-like organisms. However, these results indicate that withinindividual strains or subtypes the ability to acquire phosphate(in one or more of its forms) can vary without detectabledifferences in the surrounding genomic sequences.

Biogeographic Distribution of Proteorhodopsin VariantsVariation in gene content is only one aspect of the

tremendous diversity in the GOS data. The functionalsignificance of all the polymorphic differences betweenhomologous proteins remains largely unknown. To look forfunctional differences, we analyzed members of proteorho-dopsin gene family. Proteorhodopsins are fast, light-drivenproton pumps for which considerable functional informationis available though their biological role remains unknown.Proteorhodopsins were highly abundant in the Sargasso Seasamples [19] and continue to be highly abundant and evenlydistributed (relative to recA abundance) in all the GOSsamples. A total of 2,674 putative proteorhodopsin geneswere identified in the GOS dataset. Although many of thesequences are fragmentary, 1,874 of these genes contain theresidue that is primarily responsible for tuning the light-absorbing properties of the protein [39–41], and theseproperties have been shown to be selected for under differentenvironmental conditions [42]. Variation at this residue isstrongly correlated with sample of origin (Figure 12). Theleucine (L) or green-tuned variant was highly abundant in theNorth Atlantic samples and in the nonmarine environmentslike the fresh water sample from Lake Gatun (GS20). Theglutamine (Q) or blue-tuned variant dominated in theremaining mostly open ocean samples.

Given our limited understanding of the biological role forproteorhopsin, the reason for this differential distribution isnot immediately clear. In coastal waters where nutrients aremore abundant, phytoplankton is dominant. Phytoplankton

absorbs primarily in the blue and red spectra; consequently,the water appears green [43]. Conversely, in the open oceannutrients are rare and phytoplanktonic biomass is low, sowaters appear blue because in the absence of impurities thered wavelengths are absorbed preferentially [44]. It may bethat proteorhodopsin-carrying microbes have simply adaptedto take advantage of the most abundant wavelengths of lightin these systems.Proteorhodopsins encoded on reads that were recruited to

P. ubique HTCC1062 account for a fraction (;25%) of all theproteorhodopsin-associated reads, suggesting that the re-mainder must be associated with a variety of marine micro-bial taxa (see also [45–47]). Phylogenetic analysis of theSAR11-associated proteins revealed that each variant hasarisen independently at least two times in the SAR11 lineage(Figure 3C). Consistent with other findings that proteorho-dopsins are widely distributed throughout the microbialworld [48], we conclude that multiple microbial lineages areresponsible for proteorhodopsin spectral variation and thatthe abundance of a given variant reflects selective pressuresrather than taxonomic effects. Similar mechanisms seem tobe involved in the evolution and diversification of opsins thatmediate color vision in vertebrates [49].

Discussion

Our results highlight the astounding diversity containedwithin microbial communities, as revealed through whole-genome shotgun sequencing carried out on a global scale.Much of this microbial diversity is organized aroundphylogenetically related, geographically dispersed popula-tions we refer to as subtypes. In addition, there is tremendousvariation within subtypes, both in the form of sequencevariation and in hypervariable genomic islands. Our ability tomake these observations derived from not only the largevolumes of data but also from the development of new toolsand techniques to filter and organize the information inmanageable ways.

Variation and DiversityOur data demonstrate to an unprecedented degree the

nature and evolution of genetic variation below the specieslevel. Variation can be analyzed in several ways, includingobserved differences in sequence, genomic structure, and

Table 11. Relative Abundance of TIGRFAM Matches in Atlantic and Pacific Open Ocean Waters

TIGRFAM Number of

Peptides

Sample(s) Relative

AbundanceaMajor

Category

Minor

Category

Description

TIGR02136 1,130 GS17, GS18 7.2 Transport Anions Phosphate-binding protein

TIGR00974 2,122 GS17, GS18 3.5 Transport Anions Phosphate ABC transporter, permease protein PstA

TIGR00975 648 GS17, GS18 3.5 Transport Anions Phosphate ABC transporter, phosphate-binding protein

TIGR02138 2,139 GS17, GS18 3.4 Transport Anions Phosphate ABC transporter, permease protein PstC

TIGR00206 459 GS17, GS18 2.8 Cellular Chemotaxis Flagellar M-ring protein FliF

TIGR01782 1,297 GS17, GS18 2.4 Transport Unknown TonB-dependent receptor

TIGR00642 862 GS17, GS18 2.3 Central Other Methylmalonyl-CoA mutase, small subunit

TIGR02135 899 GS17, GS18 2.3 Transport Anions Phosphate transport system regulatory protein PhoU

aRelative Abundance: average abundance of TIGRFAM is at least that many times more abundant the average abundance in the given samples than in the other set of samples (in thiscase, GS17–GS18 were compared to GS23 and GS26).doi:10.1371/journal.pbio.0050077.t011

PLoS Biology | www.plosbiology.org | S46 Special Section from March 2007 | Volume 5 | Issue 3 | e770422

Sorcerer II GOS Expedition

Page 53: Plos Biology Venter Collection Low

gene complement. The observed patterns of variation shedlight on the mechanisms by which marine prokaryotes evolve.Gene synteny seems to be more highly conserved than thenucleotide and protein sequences. This variation is seen overessentially the entire genome in every abundant group oforganisms sufficiently related for us to recognize a popula-tion by fragment recruitment. (These include, but are notlimited to, the organisms shown in Figure 2 and Poster S1.)Notably, we found no evidence of widespread low-diversityorganisms such as B. anthracis [50].

Phylogenetic trees and fragment recruitment plots (Figures7 and 8) indicate that the variation within a species is not anunstructured swarm or cloud of variants all equally divergedfrom one another. Instead, there are clearly distinct subtypes,in terms of sequence similarity, gene content, and sampledistribution. Similar findings have been shown for specificorganisms, based on evaluation of one or a few loci [2,51–53].These results rule out certain trivial models of populationhistory and evolution for what is commonly considered abacterial ‘‘species.’’ For instance, it argues against a recentexplosive population growth from a single successful indi-vidual (selective sweep) [54]. Equally, it argues against aperfectly mixed population, suggesting instead some barriersto competition and exchange of genetic material.

In principle, this variationcould reflect somecombination ofphysical barriers (true biogeography), short-term stochasticeffects, and/or functional differentiation. Given the confound-ing variables of geography, time, and environmental conditionsin the current collection of samples, it is difficult to definitivelyseparate these effects, but various observations argue forfunctional differentiation between subtypes (i.e., they con-stitute distinct ecotypes). First, individual subtypes may befound in a wide range of locations; P. ubique HTCC1062 wasisolated in the Pacific Ocean off the coast of Oregon [55], butclosely related sequences are relatively abundant inour samplestaken in the Atlantic Ocean. Second, geography per se cannotfully explain differences in subtype distributions, as multiplesubtypes are found simultaneously in a single sample.Third, thecollection of samples in which a given subtype was foundgenerally exhibits similar environmental conditions. A strongindependent illustration of this comes from the correlation oftemperature with the distribution of Prochlorococcus subtypes[56]. Fourth, the extensive variation within each subtype (i.e.,the fact that subtypes are not clonal populations) indicates thatit cannot be chance alone that makes genetically similarorganisms have similar observed distributions.Taken together, these results argue that subtype classifica-

tion is more informative for categorizing microbial popula-

Figure 12. Distribution of Common Proteorhodopsin Variants across GOS Samples

The leucine (L) and methionine (M) variants absorb maximally in the green spectrum (Oded Beja, personal communication) while the glutamine (Q)variant absorbs maximally in the blue spectrum. The relative abundance of each variant is shown as a percentage (x-axis) per sample (y-axis). Totalabundance for all variants in read equivalents normalized by the abundance of recA protein are shown on the right side of the y-axis. The L and Qvariants show a nonrandom distribution. The L variant is abundant in temperate Atlantic waters close to the U.S. and Canadian coast. The Q variant isabundant in warmer waters further from land. The M variant is moderately abundant in a wide range of samples with no obvious geographic/environmental association.doi:10.1371/journal.pbio.0050077.g012

PLoS Biology | www.plosbiology.org | S47 Special Section from March 2007 | Volume 5 | Issue 3 | e770423

Sorcerer II GOS Expedition

Page 54: Plos Biology Venter Collection Low

tions than classification using 16S-based ribotypes, or finger-printing techniques based on length polymorphism, such as T-RFLPs [57] or ARISA [58]. For example, the grouping of suchdisparate microbial populations under the umbrella P. marinusdilutes the significance of the term ‘‘species.’’ Indeed,numerous papers have been devoted to comparing andcontrasting the differences and variability in P. marinus isolatesto better understand how this particularly abundant group oforganisms has evolved and adapted within the dynamicmarineenvironment [28,52,56,59–66]. Prior to the widespread use ofmarker-based phylogenetic approaches, microbial systematicsrelied on a wide range of variables to distinguish microbialpopulations [67]. Subtypes bring us back to these morecomprehensive approaches since they reflect the influencesof a wide range of factors in the context of an entire genome.

Although subtypes are a salient feature of our data,variation within a ribotype does not stop at the level ofsubtypes. Variation within subtypes is so extensive that fewGOS reads can be aligned at 100% identity to any other GOSread, despite the deep coverage of several taxonomic groups.Related findings have been shown for the ITS region invarious organisms [2,51,52], and in a limited number oforganisms for individual protein coding and intergenicregions [2,53,68]. High levels of diversity within the ribotypecan be convincingly demonstrated in the 16S gene itself [69].The applicability of these results over the entire genome wererecently shown for P. marinus [28] using data from theSargasso Sea samples taken as a pilot project for theexpedition reported here [19]. We have definitively demon-strated the generality of these findings, greatly increased ourunderstanding of the minimum number of variants of a givenorganism, and shown that these observations apply to theentire genome for a wide range of abundant taxonomicgroups and across a wide range of geographic locations.

Average pairwise differences of several percent betweenoverlapping P. marinus or SAR11 reads imply that thisvariation did not arise recently. If one uses substitution ratesestimated for E. coli [70], one could conclude that on averageany two P. marinus cells must have diverged millions of yearsago. Mutational rates are notoriously variable and hard toestimate, and assumptions of molecular clocks are equallychancy, but clearly within-subtype variants have persistedside by side for quite some time. This raises a question relatedto the classic ‘‘Paradox of the Plankton’’: how can so manysimilar organisms have coexisted for so long [71,72]? Oneexplanation, which we favor, is that not only subtypes but alsoindividual variants are sufficiently different phenotypically toprevent any one strain from completely replacing all others(discussed further below; see [71] for a recent theoreticaltreatment). An alternative is that recombination mightprevent selective sweeps within ecotypes, as proposed byCohan (reviewed in [73]).

The Significance of Within-Subtype VariationGiven the apparent generality of subtypes and intra-

subtype variation, it is important to understand if and howthese subpopulations are functionally distinct. At the level ofDNA sequence, a substantial fraction of substitutions aresilent in terms of amino acid sequence, and others may benonsynonymous but functionally neutral. However, twoorganisms that differ by 5% in their genetic sequence (e.g.,100,000 substitutions in 2 Mbp of shared sequence) will

inevitably have at least minor functional differences such asin the optimal temperature or pH for the activity of someenzyme. At the level of gene content, the observation ofhypervariable segments ([28] and here) implies that there isan additional dimension to functional variability. Hyper-variable genomic islands with preferential insertion sitescould potentially be associated with a wide range offunctions, though to date they have been most closelyexamined for their role in pathogenicity (for a review, see[74]). However, given their apparent variability within even asingle sampling site, it seems unlikely that these elementsreflect a specific adaptive advantage to the local population.Identifying the source(s), diversity, and range of functionalityassociated with these islands by fully sequencing a largenumber of these segments and understanding how theirindividual abundances fluctuate should be quite informative.Some might still argue that these differences must be moot

for the purpose of understanding the role these organisms playin an ecosystem. Yet even small differences in optimalconditions may have profound effects. They may prevent anysingle genotype from being universally fittest, allowing and/ornecessitating the coexistence of multiple variants [2,51,69].Moreover, variation within subtype might afford a form offunctional ‘‘buffering,’’ such that the population as a wholemay be more stable in its ecosystem role than any one clonecould be (see also [51]). That is, while any one strain ofProchlorococcus might thrive and provide energy input to therest of the community at a limited range of temperatures, lightconditions, etc., the ensemblemight provide such inputs over awider range of environmental conditions. In this way, micro-diversity might provide system stability or robustness throughfunctional redundancy and the ‘‘insurance effect’’ (reviewed in[75]). Thus, while the extent of microdiversity suggests thatknowing the behavior of any one isolate in exquisite detailmight not be as useful to reductionist modeling as one mighthope, this buffering could afford a more stable ensemblebehavior, facilitating the development and maintenance of anecosystem and allowing for system-level modeling.A direct equation of subtypes with ecotypes is tempting,

but not entirely clear-cut. The correlation of PstS distributionwith phosphate abundance suggests a functional adaptation,but within Prochlorococcus and SAR11 the presence or absenceof PstS subdivides subtypes without apparent respect forphylogenetic structure. This contrasts markedly with thedistribution of proteorhodopsin-tuning variants withinSAR11, which, despite a few convergent substitutions, arestrongly congruent with phylogeny. It is interesting to askwhat distinguishes pressures or adaptations that respect (orthat lead to) lineage splits from those that show little or nophylogenetic structuring. These two specific examples plau-sibly reflect two different mechanisms (i.e., convergent butindependent mutation in proteorhodopsin genes and theacquisition by horizontal transfer of genes involved inphosphate uptake). Yet, we must wonder: given the evidencethat proteorhodopsin has been transferred laterally [48], andthat only a small number of mutations, in some circumstanceseven a single base-pair change, are required to switchbetween the blue-absorbing and green-absorbing forms[39,40], why should proteorhodopsin variants show anylineage restriction? Perhaps this relates to the modularity ofthe system in question: proteorhodopsin tuning may be partof a larger collection of synergistic adaptations that are

PLoS Biology | www.plosbiology.org | S48 Special Section from March 2007 | Volume 5 | Issue 3 | e770424

Sorcerer II GOS Expedition

Page 55: Plos Biology Venter Collection Low

collectively not easily evolved, acquired, or lost, while the PstSand surrounding genes may represent a functional unit thatcan be readily added and removed over relatively shortevolutionary time scales. If so, perhaps subtypes are indeedecotypes, but rapidly evolving characters can lead tophenotypes that crosscut or subdivide ecotypes.

Phage provide one possible mechanism for rapid evolutionof microbial populations or strains, and have been found inabundance with this and other marine metagenomic datasets[18,20]. It has been proposed that hypervariable islands arephage mediated [28]. However, there are reasons to becautious about invoking phage as an explanation for rapidlyevolving characteristics. While we see variability of PstS andneighboring genes in both SAR11 and P. marinus populations,this variation does not seem to be linked to recent phageactivity. Initially, the distribution of PstS seems similar to thevariation associated with the hypervariable islands, whichmay be phage mediated [28]. Indeed, phosphate-regulatinggenes including PstS have been identified in phage genomes[64], presumably because enhanced phosphate acquisition isrequired during the replication portion of their life cycle.However, the regions containing the PstS genes in bothSAR11 and Prochlorococcus do not behave in the same fashionas clearly hypervariable regions, being effectively bimorphic(modulo the level of sequence variation observed elsewhere inthe genome), whereas clearly hypervariable regions are sodiverse that nearly every sampled clone falling in such aregion appears completely unrelated to every other. Nor dothe other genes in PstS-containing regions appear to be phageassociated. These observations suggest that differences in PstSpresence or absence arose in the distant past, or thatdifferent mechanisms are at work. It seems likely that phagemay mediate lateral transfer of PstS and other phosphateacquisition genes, but it is unclear whether these genes thencan become fixed within the population. Phage requireenhanced phosphate acquisition as part of their life cycle[64], so regulatory or functional differences in these genesmay limit their suitability for being acquired by the host cellfor its own purposes. The rate of phage-mediated horizontaltransfer of genes may reflect a combination of the gene’svalue to the host and to the agent mediating the transfer (e.g.,phage), suggesting that PstS may have much greater immedi-ate value than do proteorhodopsin genes and their variants.

In practical terms, these results highlight the limitationsassociated with marker-based analysis and the use of theseapproaches to infer the physiology of a particular microbialpopulation. At the resolution used here, marker-basedapproaches are not always informative regarding differencesin gene content (e.g., the PstS gene as well as neighboringgenes), especially those associated with hypervariable seg-ments. Though phosphate acquisition is known to vary withindifferent strains of P. marinus [64,76], our results clearly showthat this variability can happen within a single subtype (asrepresented by MIT9312), effectively identifying distinctecotypes. Given the correct samples from the appropriateenvironments, other core genes might also show similarvariation and allow us to more fully assess the reliability ofreference genomes as indicators of physiological potential.

Tools and TechniquesAnalysis of the GOS dataset has benefited from the

development of new tools and techniques. Many of these

approaches rely on fairly well-known techniques but havebeen modified to take greater advantage of the metadata.The technique of fragment recruit and the corresponding

fragment recruitment plots have proven highly useful forexamining the biogeography and genomic variation ofabundant marine microbes when a close reference genomeexists. Ultimately, this approach derives from the percentidentity plots of PipMaker [27]. Similar approaches have beenused to examine variation with respect to metagenomicdatasets. For example, hypervariable segments and sequencevariation have been visualized in P. marinus MIT9312 usingthe Sargasso Sea data [28] and in human gut microbesBifidobacterium longum and Methanobrevibacter smithii [77].Our primary advance associated with fragment recruit-

ment plots is the incorporation of metadata associated withthe isolation or production of the sequencing data. Whilesimple in nature, the resulting plots can be extremelyinformative due to the volume of data being presented.Being able to present the sequence similarity and metadatavisually allows a researcher to quickly identify interestingportions of the data for further examination. This is one ofthe first tools to make extensive use of the metadata collectedduring a metagenomic sequencing project. The use of sampleand recruitment metadata is just the beginning. It is notdifficult to imagine displaying other variations such as watertemperature, salinity, phosphate abundance, and time of yearwith this approach. Even sample independent metadata suchas phylogenetic information may produce informative viewsof the data. The usefulness of this and related approaches willonly grow as the robust collection of metadata becomesroutine and the variables that are most relevant to microbialcommunities are further elucidated.The greatest limitation of fragment recruitment is the lack

of appropriate reference sequences, particularly finishedgenomes. Using a series of modifications to the CeleraAssembler referred to as ‘‘extreme assembly,’’ we haveproduced large assemblies for cultivated and uncultivatedmarine microbes. On its own, the extreme assembly approachwould be excessively prone to producing chimeric sequences.However, when extreme assemblies are used as references forfragment recruitment, the metadata provides additionalcriteria to validate the sampling consistency along the lengthof the scaffold. Chimeric joins can be rapidly detected andavoided. This argues that future metagenomic assemblerscould be specifically designed to make use of the metadata toproduce more accurate assemblies, and that metagenomicassemblies will be improved by using data from multiplesources. Finding ways to represent the full diversity in theseassemblies remains a pressing issue.Extreme assembly can produce much larger assemblies but

it is still limited by overall coverage. While many ribotypes arepresumably present in sufficient quantities that reasonableassemblies of these genomes might be expected, this did notoccur even for the most abundant organisms, includingSAR11 and P. marinus. Many of the problems can beattributed to the diversity associated with the hypervariablesegments where the effective coverage drops precipitously. Ifthese are indeed commonplace in the microbial world, it isunlikely that complete genomes will be produced using thesmall insert libraries presented here. However, the ability tobin the larger sequences based on their coverage profilesacross multiple samples, oligonucleotide frequency profiles,

PLoS Biology | www.plosbiology.org | S49 Special Section from March 2007 | Volume 5 | Issue 3 | e770425

Sorcerer II GOS Expedition

Page 56: Plos Biology Venter Collection Low

and phylogenetic markers suggests that large portions of amicrobial genome can be reconstructed from the environ-mental data. This in turn should provide critical insights intothe physiology and biochemistry of these microbial lineagesthat will inform culture techniques to allow cultivation ofthese recalcitrant organisms under laboratory conditions.

Not every technique described herein relies on metadata.The marker-less, overlap-based metagenomic comparisonprovides a quantitative approach to comparing the overallgenetic similarity of two samples (Figures 10 and 11). Inessence, genomic similarity acts as a proxy for communitysimilarity. Marker-based approaches such as ARISA includingthe use of 16S sequences described herein can also be used toinfer community similarity, though these approaches moreaptly generate a census of the community members[51,69,78,79]. This census is biased to the extent that 16Sgenes can vary in copy number and relies on linkage of themarker gene to infer genome composition. While ourmetagenome comparison does not directly provide a census,the sensitivity can be tuned by restricting the identity ofmatches. This means that even subtype-level differences canbe detected across samples. It would also identify thesubstantial gene content differences between the K12 andO157:H7 E. coli strains [12]. Such large-scale gene contentdifferences have yet to be seen between closely related marinemicrobes, but may be a factor in other environments.Although the requisite amount of data will vary with thecomplexity of the environment or the degree of resolutionrequired, we have found that 10,000 sequencing reads issufficient to reliably measure the similarity of two surfacewater samples (unpublished data). This analysis may become ageneral tool for allocating sequencing resources by allowing ashallow survey of many samples followed by deep sequencingof a select number of ‘‘interesting’’ ones.

The application of this technique for comparing samplesalong with detailed analysis of fragments recruiting to a givenreference sequence can also help explicate differences amongcommunities in gene content or sequence variation. Forexample, recent metagenomic studies have reported differ-ences in abundance of various gene families or differingfunctional roles between samples. Some of these differencescorrespond to plausible differences in physiology andbiochemistry, such as the relative overabundance of photo-synthetic or light-responsive genes in surface water samples[20,32]. Other differences however are less obvious, such asthe abundance of ribosomal proteins at 130 m or theabundance of tranposase at 4,000 m [20]; some of these mayreflect ‘‘taxonomic hitchhiking,’’ such that a sample rich inArchaea or Firmicutes or Cyanobacteria, etc., has an over-representation of genes more reflective of their recentevolutionary history than of a response to environmentalconditions. Being able to control or account for thesetaxonomic effects is crucial to understanding how microbialpopulations have adapted to environmental conditions andhow they may behave under changing conditions. Themetagenomic comparison method described here providesa new tool to more accurately measure the impact oftaxonomic effects.

In conclusion, this study reveals the wealth of biologicalinformation that is contained within large multi-sampleenvironmental datasets. We have begun to quantify theamount and structure of the variation in natural microbial

populations, while providing some information about howthese factors are structured along phylogenetic and environ-mental factors. At the same time, many questions remainunanswered. For example, although microbial populationsare structured and therefore genetically isolated, we do notunderstand the mechanisms that lead to this isolation. Theirisolation seems contradictory given overwhelming evidencethat horizontal gene transfer associated with hypervariableislands is a common phenomenon in marine microbialpopulations. Whatever the mechanism, the role and rate atwhich gene exchange occurs between populations will becrucial to understanding population structure within micro-bial communities and whether these communities are chanceassociations or necessary collections. The hypervariableislands could be a source for tremendous genetic innovationand novelty as evidenced by the rate of discovery of novelprotein families in the GOS dataset [18]. However, it is notclear whether these entities are the main source of thisnovelty or whether this novelty resides in the vast numbers ofrare microbes [4] that cannot be practically accessed usingcurrent metagenomic approaches. Altogether, this researchreaffirms our growing wealth and complexity of data andpaucity of understanding regarding the biological systems ofthe oceans.

Materials and Methods

Sampling sites. A more detailed description of the sampling sitesprovides additional context in which to understand the individualsamples. The northernmost site (GS05) was at Compass Buoy in thehighly eutrophic Bedford Basin, a marine embayment encircled byHalifax, Nova Scotia, that has a 15-y weekly record of biological,physical, and chemical monitoring (http://www.mar.dfo-mpo.gc.ca/science/ocean/BedfordBasin/index.htm). Other temperate sites in-cluded a coastal station sample near Nova Scotia (GS4), a station inthe Bay of Fundy estuary at outgoing tide (GS06), and three Gulf ofMaine stations (GS02, GS03, and GS07). These were followed bysampling coastal stations from the New England shelf region of theMiddle Atlantic Bight (Newport Harbor through Delaware Bay;GS08–GS11). The Delaware Bay (GS11) was one of several estuarysamples along the Global Expedition path. Estuaries are complexhydrodynamic environments that exhibit strong gradients in oxygen,nutrients, organic matter, and salinity and are heavily impacted byanthropogenic nutrients. The Chesapeake Bay (GS12) is the largestestuary in the United States and has microbial assemblages that arediverse mixtures of freshwater and marine-specific organisms [80].GS13 was collected near Cape Hatteras, North Carolina, inside andnorth of the Gulf Stream, and GS14 was taken along the westernboundary frontal waters of the Gulf Stream off the coast ofCharleston, South Carolina. The vessel stopped at five additionalstations as it transited through the Caribbean Sea (GS15–GS19) to thePanama Canal. In Panama, we sampled the freshwater Lake Gatun,which drains into the Panama Canal (GS20). The first of the easternPacific coastal stations GS21, GS22, and GS23 were sampled on theway to Cocos Island (;500 km southwest of Costa Rica), followed by acoastal Cocos Island sample (GS25). Near the island, ocean currentsdiverge and nutrient rich upwellings mix with warm surface waters tosupport a highly productive ecosystem. Cocos Island is distinctive inthe eastern Pacific because it belongs to one of the first shallowundersea ridges in the region encountered by the easterly flowingNorth Equatorial Counter/Cross Current in the Far Eastern Pacific[81,82]. After departing Cocos Island, the vessel continued southwestto the Galapagos Islands, stopping for an open ocean station (GS26).An intensive sampling program was then conducted in the Galapagos.The Galapagos Archipelago straddles the equator 960 km west ofmainland Ecuador in the eastern Pacific. These islands are in ahydrographically complex region due to their proximity to theEquatorial Front and other major oceanic currents and regionalfront systems [83]. The coastal and marine parts of the GalapagosIslands ecosystem harbor an array of distinctive habitats, processes,and endemic species. Several distinct zones were targeted including ashallow-water, warm seep (GS30), below the thermocline in an

PLoS Biology | www.plosbiology.org | S50 Special Section from March 2007 | Volume 5 | Issue 3 | e770426

Sorcerer II GOS Expedition

Page 57: Plos Biology Venter Collection Low

upwelling zone (GS31), a coastal mangrove (GS32), and a hypersalinelagoon (GS33). The last stations were collected from open ocean sites(GS37 and GS47) and a coral reef atoll lagoon (GS51) in the immenseSouth Pacific Gyre. The open ocean samples come from a region oflower nutrient concentrations where picoplankton are thought torepresent the single most abundant and important factor forbiogeochemical structuring and nutrient cycling [84–87]. In the atollsystems, ambient nutrients are higher, and bacteria are thought toconstitute a large biomass that is one to three times as large as that ofthe phytoplankton [88–90].

Sample collection. A YSI (model 6600) multiparameter instrument(http://www.ysi.com) was deployed to determine physical character-istics of the water column, including salinity, temperature, pH,dissolved oxygen, and depth. Using sterilized equipment [91], 40–200 lof seawater, depending on the turbidity of the water, was pumpedthrough a 20-lm nytex prefilter into a 250-l carboy. From this sample,two 20-ml subsamples were collected in acid-washed polyethylenebottles and frozen (�20 8C) for nutrient and particle analysis. At eachstation the biological material was size fractionated into individual‘‘samples’’ by serial filtration through 20-lm, 3-lm, 0.8-lm, and 0.1-lm filters that were then sealed and stored at�20 8C until transportback to the laboratory. Between 44,160 and 418,176 clones per stationwere picked and end sequenced from short-insert (1.0–2.2 kb)sequencing libraries made from DNA extracted from filters [19].Data from these six Sorcerer II expedition legs (37 stations) werecombined with the results from samples in the Sargasso Sea pilotstudy (four stations; GS00a–GS00d and GS01a–GS01c; [19]. Themajority of the sequence data presented came from the 0.8- to 0.1-lmsize fraction sample that concentrated mostly bacterial and archaealmicrobial populations. Two samples (GS01a, GS01b) from theSargasso Sea pilot study dataset and one GOS sample (GS25) camefrom other filter size fractions (Table 1).

Filtration and storage. Microbes were size fractionated by serialfiltration through 3.0-lm, 0.8-lm, and 0.1-lm membrane filters(Supor membrane disc filter; Pall Life Sciences, http://www.pall.com),and finally through a Pellicon tangential flow filtration (Millipore,http://www.millipore.com) fitted with a Biomax-50 (polyethersulfone)cassette filter (50 kDa pore size) to concentrate a viral fraction to 100ml. Filters were vacuum sealed with 5 ml sucrose lysis buffer (20 mMEDTA, 400 mM NaCl, 0.75 M sucrose, 50 mM Tris-HCl [pH 8.0]) andfrozen to �20 8C on the vessel until shipment back to the VenterInstitute, where they were transferred to a�80 8C freezer until DNAextraction. Glycerol was added (10% final concentration) as acryoprotectant for the viral/phage sample.

DNA isolation. In the laboratory, the impact filters wereaseptically cut into quarters for DNA extraction. Unused quartersof the filter were refrozen at �80 8C for storage. Quarters used forextraction were aseptically cut into small pieces and placed inindividual 50-ml conical tubes. TE buffer (pH 8) containing 50 mMEGTA and 50 mM EDTA was added until filter pieces were barelycovered. Lysozyme was added to a final concentration of 2.5 mg/ml�1, and the tubes were incubated at 37 8C for 1 h in a shakingwater bath. Proteinase K was added to a final concentration of 200lg/ml�1, and the samples were frozen in dry ice/ethanol followed bythawing at 55 8C. This freeze–thaw cycle was repeated once. SDS(final concentration of 1%) and an additional 200 lg/ml�1 ofproteinase K were added to the sample, and samples were incubatedat 55 8C for 2 h with gentle agitation followed by three aqueousphenol extractions and one phenol/chloroform extraction. Thesupernatant was then precipitated with two volumes of 100%ethanol, and the DNA pellet was washed with 70% ethanol. Finally,the DNA was treated with CTAB to remove enzyme inhibitors. Sizefraction samples not utilized in this study were archived for futureanalysis.

Library construction. DNA was randomly sheared via nebulization,end-polished with consecutive BAL31 nuclease and T4 DNApolymerase treatments, and size-selected using gel electrophoresison 1% low-melting-point agarose. After ligation to BstXI adapters,DNA was purified by three rounds of gel electrophoresis to removeexcess adapters, and the fragments were inserted into BstXI-linearized medium-copy pBR322 plasmid vectors. The resultinglibrary was electroporated into E. coli. To ensure construction ofhigh-quality random plasmid libraries with few to no clones with noinserts, and no clones with chimeric inserts, we used a series ofvectors (pHOS) containing BstXI cloning sites that include severalfeatures: (1) the sequencing primer sites immediately flank the BstXIcloning site to avoid excessive resequencing of vector DNA; (2)elimination of strong promoters oriented toward the cloning site;and (3) the use of BstXI sites for cloning facilitates the preparation oflibraries with a low incidence of no-insert clones and a high

frequency of single inserts. Clones were sequenced from both endsto produce pairs of linked sequences representing ;820 bp at the endof each insert.

Template preparation. Libraries were transformed, and cells wereplated onto large format (16 3 16cm) diffusion plates prepared bylayering 150 ml of fresh molten, antibiotic-free agar onto a previouslyset 50-ml layer of agar containing antibiotic. Colonies were picked fortemplate preparation using the Qbot or QPix colony-picking robots(Genetix, http://www.genetix.com), inoculated into 384-well blockscontaining liquid media, and incubated overnight with shaking. High-purity plasmid DNA was prepared using the DNA purificationrobotic workstation custom-built by Thermo CRS (http://www.thermo.com) and based on the alkaline lysis miniprep [92]. Bacterialcells were lysed, cell debris was removed by centrifugation, andplasmid DNA was recovered from the cleared lysate by isopropanolprecipitation. DNA precipitate was washed with 70% ethanol, dried,and resuspended in 10 mM Tris HCl buffer containing a trace of bluedextran. The typical yield of plasmid DNA from this method isapproximately 600–800 ng per clone, providing sufficient DNA for atleast four sequencing reactions per template.

Automated cycle sequencing. Sequencing protocols were based onthe di-deoxy sequencing method [93]. Two 384-well cycle-sequencingreaction plates were prepared from each plate of plasmid templateDNA for opposite-end, paired-sequence reads. Sequencing reactionswere completed using the Big Dye Terminator chemistry andstandard M13 forward and reverse primers. Reaction mixtures,thermal cycling profiles, and electrophoresis conditions wereoptimized to reduce the volume of the Big Dye Terminator mix(Applied Biosystems, http://www.appliedbiosystems.com) and to ex-tend read lengths on the AB3730xl sequencers (Applied Biosystems).Sequencing reactions were set up by the Biomek FX (BeckmanCoulter, http://www.beckmancoulter.com) pipetting workstations.Robots were used to aliquot and combine templates with reactionmixes consisting of deoxy- and fluorescently labeled dideoxynucleo-tides, DNA polymerase, sequencing primers, and reaction buffer in a5 ll volume. Bar-coding and tracking promoted error-free templateand reaction mix transfer. After 30–40 consecutive cycles ofamplification, reaction products were precipitated by isopropanol,dried at room temperature, and resuspended in water and trans-ferred to one of the AB3730xl DNA analyzers. Set-up times were lessthan 1 h, and 12 runs per day were completed with average trimmedsequence read length of 822 bp.

Fosmid end sequencing. Fosmid libraries [24] were constructedusing approximately 1 lg DNA that was sheared using bead beating togenerate cuts in the DNA. The staggered ends or nicks were repairedby filling with dNTPs. A size selection process followed on a pulsefield electrophoresis system with lambda ladder to select for 39–40 Kbfragments. The DNA was then recovered from a gel, ligated to theblunt-ended pCC1FOS vector, packaged into lambda packagingextracts, incubated with the host cells, and plated to select for theclones containing an insert. Sequencing was performed as describedfor plasmid ends.

Metagenomic assembly. Assembly was conducted with the CeleraAssembler [21], with modifications as follows. The ‘‘genome length’’was artificially set at the length of the dataset divided by 50 to allowunitigs of abundant organisms to be treated as unique, as previouslydescribed [19]. Several distinct assemblies were computed. In theprimary assembly, all pairs of mated reads were tested to see whetherthe paired reads overlapped one another; if so, they were merged intoa single pseudo-read that replaced the two original reads; further,only overlaps of 98% identity or higher were used to constructunitigs. A second assembly was conducted in the same fashion withthe exception of using a 94% identity cutoff to construct unitigs.Finally, series of assemblies at various stringencies were computed forsubsets of the GOS data; in these assemblies, overlapping mates werenot preassembled and the Celera Assembler code was modifiedslightly to allow for overlapping and multiple sequence alignment atlower stringency.

Construction of a low-identity overlap database. An all-against-allcomparison of unassembled (but merged and duplicate-stripped)sequences from the combined dataset was performed using amodified version of the overlapper component of the CeleraAssembler [21]. The code was modified to find overlap alignments(global alignments allowing free end gaps) starting from pairs of readsthat share an identical substring of at least 14 bp. An alignmentextension was then performed with match/mismatch scores set toyield a positive outcome if an overlap alignment was found with�65% identity. Overlaps involving alignments of �40 bp wereretained for various analyses. For the GOS dataset described here,this process resulted in a dataset of 1.2 billion overlaps. Due to the 14-

PLoS Biology | www.plosbiology.org | S51 Special Section from March 2007 | Volume 5 | Issue 3 | e770427

Sorcerer II GOS Expedition

Page 58: Plos Biology Venter Collection Low

bp requirement and certain heuristics for early termination ofapparently hopeless extensions, not all alignments at �65% werefound. In addition, some of the lowest-identity overlaps are bound tobe chance matches; however, this was a relatively uncommon event.Approximately one in 5 3 106 pairs of 800-bp random sequences (allsites independent, A ¼ C ¼ G ¼ T ¼ 25%) can be aligned to overlap�40 bp at �65% identity using the same procedure. At a 70% cutoff,the value is reduced to one in 43107, and one in 53108 at a 75% cutoff.

Extreme assembly. Like many assembly algorithms, the extremeassembler proceeds in three phases: overlap, layout, and consensus.The overlap phase is provided by the all-against-all comparisondescribed above. The consensus phase is performed by a version ofthe Celera Assembler, modified to accept higher rates of mismatch.The layout phase begins with a single sequencing read (‘‘seed’’) that ischosen at random or specified by the user and is considered the‘‘current’’ read. The following steps are performed off one or bothends of the seed. (1) Starting from the current fragment end, add thefragment with the best overlap off that end and mark the currentfragment as ‘‘used,’’ thus making the added fragment the new currentfragment. (2) Mark as used any alternative overlap that would haveresulted in a shorter extension. The simplest notion of ‘‘best overlap’’is simply the one having the highest identity alignment, but morecomplicated criteria have certain advantages. A simple but usefulrefinement is to favor fragments whose other ends have overlaps overthose which are dead ends. For an unsupervised extreme assembly,when the sequence extension terminates because there are no moreoverlaps, a new unused fragment is chosen as the next seed and theprocess is repeated until all fragments have been marked used.

Construction of multiple SAR11 variants. Sequencing reads matedto SAR11-like 16S sequences but themselves outside of the ribosomaloperon (n ¼ 348) were used as seeds in independent extremeassemblies. Since the assemblies were independent, the results werehighly redundant, with a given chain of overlapping fragmentstypically being used in multiple assemblies. A subset of 24 assembliesthat shared no fragments over their first 20 kb was identified asfollows. (1) Connected components were determined in a graphdefined by nodes corresponding to extreme assemblies. If theassemblies shared at least one fragment in the first 20 kb of eachassembly, the two nodes were connected by an edge. (2) A singleassembly was chosen at random from each of the connectedcomponents. The consensus sequence over the 20-kb segment ofeach such representative was used as the reference for fragmentrecruitment.

Phylogeny. Phylogenies of sequences homologous to a givenportion of a reference sequence (typically 500 bp) were determinedin the following manner. A set of homologous fragments wasidentified based on fragment recruitment to the reference asdescribed above. Fragments that fully spanned the segment ofinterest and had almost full-length alignments to the referencesequence of a user-defined percent identity (typically, 70%) wereused for further analysis. A preliminary master–slave multiplesequence alignment of the recruited reads (slaves) to the referencesegment (master) was performed with a modified version of theconsensus module of the Celera Assembler. Based on this alignment,reads were trimmed to the portion aligning to the reference segmentof interest. A refined multiple sequence alignment was thencomputed with MUSCLE [94]. Distance based phylogenies werecomputed using the programs DNADIST and NEIGHBOR from thePHYLIP package [95] using default settings. Trees were visualizedusing HYPERTREE [96].

Measurement of library-to-library similarity. Based on the low-identity overlap database described above, the similarity of a library ito another library j at a given percent identity cutoff was computed asfollows. For each sequence s of i, let ns,i¼ the number of overlaps toother fragments of i satisfying the cutoff; ns,j¼ the number of overlapsto fragments of j satisfying the cutoff; and fs,i¼ ns,i/(ns,iþns,j)¼ fractionof reads overlapping s from i or j that are from i.

ri;i ¼X

s

fs;i ð1Þ

ri;j ¼X

s

1� fs;i ð2Þ

si;j ¼ 0:5�ðri;j þ rj;iÞ=ffiffiffiffiffiffiffiffiffiffiffiffiffiri;i�rj;j

pð3Þ

Si;j ¼ Sj;i ¼ 2si;j=ð1þ si;jÞ ð4Þ

A read that can be overlapped to another at sufficiently high-sequence identity was taken to indicate that they were from similarorganisms, and, relatedly, that similar genes were present in thesamples. Only reads with such overlaps contributed to the calcu-lation. Other reads reflect genes or segments of genomes that were solightly sampled (i.e., at such low abundance) that they were notinformative regarding the similarity of two samples. Consequently,the analysis automatically corrects for differences in the amount ofsequencing, and can be computed over sets of samples that varyconsiderably in diversity. The resulting measure of similarity Si,j takeson a value between 0 and 1, where 0 implies no overlaps between iand j, and 1 implies that a fragment from i and a fragment from j areas likely to overlap one another as are two fragments from i or twofragments from j. As with the Bray-Curtis coefficient [97], abundanceof categories affects the computation. In an idealized situation wheretwo libraries can each be divided into some number k of ‘‘species’’ atequal abundance, and the libraries have l of the species in common,the similarity statistic will approach l/k for large samplings; in thissense, Si,j ¼ x indicates that the two samples share approximately afraction x of their genetic material. It is frequently useful to define Di,j¼ 1 � Si,j, the ‘‘dissimilarity’’ or distance between two samples.

Ribotype clustering and identification of representatives. An all-against-all comparison of predicted 16S sequences was performed todetermine the alignment between pairs of overlapping sequencesusing a version of an extremely fast bit-vector algorithm [98]. Ahierarchical clustering was determined using percent-mismatch inthe resulting alignments as the distance between pairs of sequences.Order of clustering and cluster identity scores were based on theaverage-linkage criterion, with distances between nonoverlappingpartial sequences treated as missing data. Ribotypes were themaximal clusters with an identity score above the cutoff (typically97%). Representative sequences were chosen for each cluster basedon both length and highest average identity to other sequences in thecluster.

Taxonomic classification. Taxonomic classification of 16S sequen-ces was conducted using phylogenetic techniques based on clademembership of similar sequences with 16S sequences with definedtaxonomic membership. Representative sequences from clusteredsequences were analyzed as described previously [19,99] and byaddition into an ARB database of small subunit rDNAs [100,101].Results were spot-checked against the Ribosomal Database Project IIClassifier server [102] and the taxonomic labels of the best BLASTNhits against the nonredundant database at NCBI.

Fragment recruitment. Global ocean sequences were aligned togenomic sequences of different bacteria and phage using NCBIBLASTN [26]. The following blast parameters were designed toidentify alignments as low as 55% identity that could contain largegaps: -F ‘‘m L’’ -U T -p blastn -e 1e-4 -r 8 -q -9 -z 3000000000 -X 150.Reads were filtered in several steps to identify the reads that werealigned over more or less their entire length. Reads had to be alignedfor more than 300 bp at .30% identity with less than 25 bp ofunaligned bases on either end, or reads had to be aligned over morethan 100 bp at .30% identity with less than 20 bp of overhang offeither end. Identity was calculated ignoring gaps. In some instances aread might be placed, but the mate would not be placed under thesecriteria. In such cases, if 80% or more of the mate were successfullyaligned, then the mate would be rescued and considered successfullyaligned.

Generation of shredded artificial reads from finished genomes.Random pieces of DNA from the genome in question with a lengthbetween 1,800 to 2,500 bp were selected. For each piece a read lengthN1 was selected from the distribution of lengths using the GOSdataset. If that GOS sequence had a mate pair, then a second lengthN2 was again randomly selected. The length N1 was used to generatea read from the 59 end of the DNA. The piece of DNA was thenreverse complemented and if appropriate, a second length N2 wasused to generate a second read. The relationship between these tworeads was then recorded and used to produce a fasta file. Thisapproach successfully mimics the types of reads found in the GOSdata with similar rates of missing mates.

Abundance of proteorhodopsin variants. A total of 2,644 proteo-rhodopsin genes were identified from the clustered open readingframes derived from the GOS assembly [18]. These genes couldbe linked back to 3,608 GOS clones. Open reading frames werepredicted from these clones as described in [18]. The peptidesequences were aligned with NCBI blastpgp with the followingparameters: -j 5 -U T -e 10 -W 2 -v 5 -b 5000 -F ‘‘m L’’ -m 3. Thesearch was performed with a previously described blue-absorbingproteorhodopsin protein BPR (gij32699602) as the query. The aminoacid associated with light absorption is found within a short

PLoS Biology | www.plosbiology.org | S52 Special Section from March 2007 | Volume 5 | Issue 3 | e770428

Sorcerer II GOS Expedition

Page 59: Plos Biology Venter Collection Low

conserved motif RYVDWLLTVPL*IVEF, where the asterisk indicatestuning amino acid [39–41]. In total, 1,938 clones were found tocontain this motif. Clones and the sample metadata were thenassociated with the tuning amino acid to determine the relativeabundance of the different amino acids at these positions. Clonescould be associated with SAR11 if both mated sequencing reads(when available) were recruited to P. ubique HTCC1062.

Site abundance estimates and comparisons. Given a set of genesidentified on the GOS sequences, we can identify the scaffolds onwhich these genes were annotated. A vector indicating the number ofsequences contributed by every sample is determined for every gene.This vector reflects the number of sequences from every sample thatassembled into the scaffold on which the gene was identified afternormalizing for the proportion of scaffold covered by the gene. Forexample, if a 10-kb scaffold contains a 1-kb gene, then each samplewill contribute one GOS sequence for every ten GOS sequences itcontributed to the entire scaffold. The vectors are then summed andnormalized to account for either the total number of GOS sequencesobtained from each sample or based on the number of typically singlecopy recA genes (identified as in [18]). Unless stated otherwise, recAwas used to normalize abundance across samples. When comparisonsusing groups of samples were performed, the average value for thesamples was compared.

Oligonucleotide composition profile. A 1-D profile representingoligonucleotide frequencies was computed as follows. A sequence wasconverted into a series of overlapping 10,000-bp segments, eachsegment offset by 1,000 bp from the previous one, using perl and shellscripting. Dinucleotide frequencies are computed on each segmentusing a C program written for this purpose. Higher-order oligonu-cleotides were examined and gave similar results for the genomes ofinterest. Remaining calculations were performed using the R package[103]. Principle component analysis (function princomp with defaultsettings) was applied to the matrix of frequencies per windowposition. The value of the first component for each position wasnormalized by the standard deviation of these values, and truncatedto the range [�5, 5]. For visualizations, the resulting values wereplotted at the center of each window.

Estimating frequency of large-scale translocations and inversions.The unrecruited mated sequencing reads of reads recruited to P.marinus MIT9312 at or above 80% identity were examined. Anunrecruited mate indicated a potential translocation or inversion if italigned to the MIT9312 genome in two and only two distinctalignments separated by at least 50 kb, if each aligned portion was atleast 250 bp long, if there was less than 100 bp of unaligned sequenceand no more than 100 bp of overlapping sequence between the twoaligned portions in read coordinates, and if each aligned portion wasanchored to one end of the sequencing read with less than 25 bp ofunaligned sequence from each end. In total, 18 rearrangements wereidentified, six of which appear to be unique events.

The rate of discovery was estimated by determining the number ofrearrangements in a given volume of sequence. We estimated thevolume of sequence that was potentially examined by identifyingrecruited mated sequencing reads that fit the ‘‘good’’ category (i.e.,which were recruited in the correct orientation at the expecteddistance from each other). For a given read, if the mate was recruitedat greater than or equal to 80% identity, then the expected amount ofsequence examined should be the current (as opposed to mate) readlength minus 500 bp. This produces an estimate of the search space tobe ;47 Mbp. Given 18 rearrangements, this leads to an estimate ofone rearrangement per 2.6 Mbp.

Quantification and assessment of sequences associated with gaps.GOS reads assigned to the ‘‘missing mate’’ category that wererecruited at greater than 80% identity outside the gap in questionwere identified. The mates of these reads were then identified andclustering was attempted with Phrap (http://www.phrap.org). Readsthat were incorporated end to end into the Phrap assemblies wereidentified. For most small gaps a single assembly included all themissing mate reads and identified the precise difference between thereference and the environmental sequences. For the hypervariablesegments, most of the reads failed to assemble at all, and those thatdid show greater sequence divergence than typically seen. In the caseof SAR11-recruited reads, to increase the number of reads associatedwith the hypervariable gaps we identified reads that did not recruit tothe P. ubique HTCC1062 but aligned in a single HSP (high-scoringpair) over at least 500 bp with one end unaligned because it extendedinto the hypervariable gap.

Data and tool release. To facilitate continued analysis of this andother metagenomic datasets, the tools presented here along with theirsource code will be available via the Cyberinfrastructure for AdvancedMarine Microbial Ecology Research and Analysis (CAMERA) website

(http://camera.calit2.net). The dataset and associated metadata will beaccessible via CAMERA (using the dataset tag CAM_PUB_Rusch07a).Given the exceptional abundance of Burkholderia and Shewanellasequences in the first Sargasso Sea sample and the feeling that thesemay be contaminants, we are also providing a list of the scaffold IDsand sequencing read IDs associated with these organisms to facilitateanalyses with or without the sequences. In addition to CAMERA, theGOS scaffolds and annotations will be available via the public sequencerepositories such as NCBI (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db¼genomeprj&cmd¼Retrieve&dopt¼Overview&list_uids¼13694), and the reads will be available via the Trace Archive(http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?).

Supporting Information

Poster S1. Fragment Recruitment of GOS Data to Finished MicrobialGenomes

Found at doi:10.1371/journal.pbio.0050077.sd001 (21 MB PDF).

Accession Numbers

The GenBank (http://www.ncbi.nlm.nih.gov/Genbank) accession num-ber for proteorhodopsin protein BPR is gij32699602.

Acknowledgments

We acknowledge the Department of Energy (DOE), Office of Science,and Office of Biological and Environmental Research (DE-FG02-02ER63453), the Gordon and Betty Moore Foundation, the DiscoveryChannel, and the J. Craig Venter Science Foundation for funding toundertake this study. We are also indebted to a large group ofindividuals and groups for facilitating our sampling and analysis. Wethank the governments of Canada, Mexico, Honduras, Costa Rica,Panama, Ecuador, French Polynesia, and France for facilitatingsampling activities. All sequencing data collected from waters of theabove-named countries remain part of the genetic patrimony of thecountry from which they were obtained. Canada’s Bedford Instituteof Oceanography provided a vessel and logistical support forsampling in Bedford Basin. The Universidad Nacional Autonomade Mexico (UNAM) facilitated permitting and logistical arrangementsand identified a team of scientists for collaboration. The scientistsand staff of the Smithsonian Tropical Research Institute (STRI)hosted our visit in Panama. Representatives from Costa Rica’sOrganization for Tropical Studies (Jorge Arturo Jimenez andFrancisco Campos Rivera), the University of Costa Rica (JorgeCortes), and the National Biodiversity Institute (INBio) providedassistance with planning, logistical arrangements, and scientificanalysis. Our visit to the Galapagos Islands was facilitated byassistance from the Galapagos National Park Service Director,Washington Tapia, and the Charles Darwin Research Institute,especially Howard Snell and Eva Danulat. We especially thank GregEstes (guide), Hector Chauz Campo (Institute of Oceanography of theEcuador Navy), and a National Park Representative, Simon RicardoVillemar Tigrero, for field assistance while in the Galapagos Islands.Martin Wilkalski (Princeton University) and Rod Mackie (Universityof Illinois) provided planning advice for the Galapagos sampling plan.We thank Matthew Charette (Woods Hole Oceanographic Institute)for nutrient data analysis. We also acknowledge the help of MichaelFerrari and Jennifer Clark for remote sensing data. The U.S.Department of State facilitated Governmental communications onmultiple occasions. John Glass (J. Craig Venter Institute [JCVI])provided valuable assistance in methods development. The dedicatedefforts of the quality systems, library construction, template, andsequencing teams at the JCVI Joint Technology Center produced thehigh quality sequence data that was the basis of this paper. We thankMatthew LaPointe, Creative Director of JCVI, for assistance withfigure design, and the JCVI information technology support teamwho facilitated many of the vessel related technical needs. Specialthanks are due for Charles H. Howard, captain of the Sorcerer II, andfellow crew members Cyrus Foote and Brooke A. Dill for their timeand effort in support of this research. We gratefully acknowledge Dr.Michael Sauri, who oversaw medical related issues for the crew of theSorcerer II.

Author contributions. DBR, ALH, KB, HS, CAP, JFH, MF, and JCVconceived and designed the experiments. DBR, ALH, JMH, KB, BT,HBT, CS, JT, JF, CAP, and JCV performed the experiments. DBR,ALH, GS, KBH, SW, DW, JAE, KR, JEV, TU, YHR, MRF, KN, and RFanalyzed the data. DBR, ALH, GS, KBH, SY, JMH, KR, KB, BT, HS,

PLoS Biology | www.plosbiology.org | S53 Special Section from March 2007 | Volume 5 | Issue 3 | e770429

Sorcerer II GOS Expedition

Page 60: Plos Biology Venter Collection Low

HBT, CS, JT, JF, CAP, KL, SK, JFH, TU, YHR, LIF, VS, GBR, LEE,DMK, SS, TP, EB, VG, GTC, MRF, RLS, MF, and JCV contributedreagents/materials/analysis tools. DBR, ALH, GS, KBH, SW, SY, JAE,RLS, KN, RF, MF, and JCV wrote the paper.

Funding. Funding for this study was received from the USDepartment of Energy, Office of Science, and Office of Biological

and Environmental Research (DE-FG02-02ER63453), the Gordon andBetty Moore Foundation, the Discovery Channel, and the J. CraigVenter Science Foundation.

Competing interests. The authors have declared that no competinginterests exist.

References1. Whitman WB, Coleman DC, Wiebe WJ (1998) Prokaryotes: The unseen

majority. Proc Natl Acad Sci U S A 95: 6578–6583.2. Beja O, Koonin EV, Aravind L, Taylor LT, Seitz H, et al. (2002)

Comparative genomic analysis of archaeal genotypic variants in a singlepopulation and in two different oceanic provinces. Appl EnvironMicrobiol 68: 335–345.

3. DeLong EF, Pace NR (2001) Environmental diversity of bacteria andarchaea. Systematic Biol 50: 1–9.

4. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, et al. (2006)Microbial diversity in the deep sea and the underexplored ‘‘rarebiosphere.’’ Proc Natl Acad Sci U S A 103: 12115–12120.

5. Garrity GM (2001) Bergey’s manual of systematic bacteriology. New York;Springer-Verlag.

6. Madigan M, Martinko JM, Parker J (2000) Brock biology of micro-organisms. Upper Saddle River (NJ); Prentice Hall. 991 p.

7. Fuhrman JA, McCallum K, Davis AA (1992) Novel major archaebacterialgroup from marine plankton. Nature (London) 356: 148–149.

8. Giovannoni SJ, Britschgi TB, Moyer CL, Field KG (1990) Genetic diversityin Sargasso Sea bacterioplankton. Nature 345: 60–63.

9. Pace NR (1997) A molecular view of microbial diversity and the biosphere.Science 276: 734–740.

10. Rappe MS, Giovannoni SJ (2003) The uncultured microbial majority. AnnRev Microbiol 57: 369–394.

11. Stackebrandt E, Goebel BM (1994) Taxonomic note: A place for DNA-DNA reassociation and 16S rRNA sequence analysis in the present speciesdefinition in bacteriology. Int J Syst Bacteriol 44: 846–849.

12. Welch RA, Burland V, Plunkett G 3rd, Redford P, Roesch P, et al. (2002)Extensive mosaic structure revealed by the complete genome sequence ofuropathogenic Escherichia coli. Proc Natl Acad Sci U S A 99: 17020–17024.

13. Linklater E (1972) The voyage of the Challenger. Garden City (NJ);Doubleday. 280 p.

14. Mosley HN (1879) Notes by a naturalist on the ‘‘Challenger,’’ being anaccount of various observations made during the voyage of H.M.S.‘‘Challenger’’ round the world, in the years 1872–1876. London:Macmillian and Company. 540 p.

15. Thompson SCW, Murray SJ, Nares GS, Thompson FT (1895) Report onthe scientific results of the voyage of H.M.S. Challenger during the years1873–76 under the command of Captain George S. Nares, R.N., F.R.S. andthe late Captain Frank Tourle Thomson, R.N. Prepared under thesuperintendence of the late Sir C. Wyville Thomson, 1885–1895:Edinburgh: printed for H.M. Stationery off. (by order of Her Majesty’sGovernment).

16. Fuhrman JA, Mccallum K, Davis AA (1993) Phylogenetic diversity ofsubsurface marine microbial communities from the Atlantic and PacificOceans. Appl Environ Microbiol 59: 1294–1302.

17. Hewson I, Steele JA, Capone DG, Fuhrman JA (2006) Temporal and spatialscales of variation in bacterioplankton assemblages of oligotrophic surfacewaters. Mar Ecol Prog Ser 311: 67–77.

18. Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al. (2007)The Sorcerer II Global Ocean Sampling expedition: Expanding the universeof protein families. PLoS Biol 5: e16. doi:10.1371/journal.pbio.0050016

19. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al. (2004)Environmental genome shotgun sequencing of the Sargasso Sea. Science304: 66–74.

20. DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, et al. (2006)Community genomics among stratified microbial assemblages in theocean’s interior. Science 311: 496–503.

21. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, et al. (2000) Awhole-genome assembly of Drosophila. Science 287: 2196–2204.

22. Lander ES, Waterman MS (1988) Genomic mapping by fingerprintingrandom clones: A mathematical analysis. Genomics 2: 231–239.

23. Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, et al. (2002)The genome sequence of the malaria mosquito Anopheles gambiae. Science298: 129–149.

24. Kim UJ, Shizuya H, Dejong PJ, Birren B, Simon MI (1992) Stablepropagation of cosmid sized human DNA inserts in an F-factor basedvector. Nucleic Acids Res 20: 1083–1085.

25. Stein JL, Marsh TL, Wu KY, Shizuya H, DeLong EF (1996) Characterizationof uncultivated prokaryotes: Isolation and analysis of a 40-kilobase-pairgenome fragment from a planktonic marine archaeon. J Bacteriol 178:591–599.

26. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic localalignment search tool. J Mol Biol 215: 403–410.

27. Schwartz S, Zhang Z, Frazer KA, Smit A, Riemer C, et al. (2000) PipMaker:

A Web server for aligning two genomic DNA sequences. Genome Res 10:577–586.

28. Coleman ML, Sullivan MB, Martiny AC, Steglich C, Barry K, et al. (2006)Genomic islands and the ecology and evolution of Prochlorococcus. Science311: 1768–1770.

29. Giovannoni SJ, Tripp HJ, Givan S, Podar M, Vergin KL, et al. (2005)Genome streamlining in a cosmopolitan oceanic bacterium. Science 309:1242–1245.

30. Hagstrom A, Pommier T, Rohwer F, Simu K, Stolte W, et al. (2002) Use of16S ribosomal DNA for delineation of marine bacterioplankton species.Appl Environ Microbiol 68: 3628–3633.

31. Giovannoni S, Rappe M (2002) Evolution, diversity and molecular ecologyof marine Prokaryotes. In: Kirchman DL, editor. Microbial ecology of theoceans. New York: Wiley-Liss. pp. 47–84.

32. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, et al. (2005)Comparative metagenomics of microbial communities. Science 308: 554–557.

33. Haft DH, Selengut JD, White O (2003) The TIGRFAMs database of proteinfamilies. Nucleic Acids Res 31: 371–373.

34. Conkright M, Levitus S, Boyer T. (1994) World ocean atlas 1994. Volume 1:Nutrients. Washington, D.C. U.S. Department of Commerce.

35. Levitus S, Burgett R, Boyer T (1994) World ocean atlas 1994. Volume 3:Nutrients. Washington, D.C.: U.S. Department of Commerce.

36. Parekh P, Follows MJ, Boyle E (2004) Modeling the global ocean iron cycle.Global Biogeochem Cycles 18: GB1002.

37. Scanlan DJ, Wilson WH (1999) Application of molecular techniques toaddressing the role of P as a key effector in marine ecosystems.Hydrobiologia 401: 149–175.

38. Moore L, Ostrowski M, Scanlan D, Feren K, Sweetsir T (2005) Ecotypicvariation in phosphorus acquisition mechanisms within marine picocya-nobacteria. Aquat Microb Ecol 39: 257–269.

39. Kelemen BR, Du M, Jensen RB (2003) Proteorhodopsin in living color:Diversity of spectral properties within living bacterial cells. BiochimBiophys Acta 1618: 25–32.

40. Man D, Wang W, Sabehi G, Aravind L, Post AF, et al. (2003) Diversificationand spectral tuning in marine proteorhodopsins. EMBO J 22: 1725–1731.

41. Man-Aharonovich D, Sabehi G, Sineshchekov OA, Spudich EN, SpudichJL, et al. (2004) Characterization of RS29, a blue-green proteorhodopsinvariant from the Red Sea. Photochem Photobiol Sci 3: 459–462.

42. Bielawski JP, Dunn KA, Sabehi G, Beja O (2004) Darwinian adaptation ofproteorhodopsin to different light intensities in the marine environment.Proc Natl Acad Sci U S A 101: 14824–14829.

43. Johnsen S, Sosik H. (2005) Shedding light on light in the ocean. OceanusMag 43: 24–28.

44. Braun C, Smirnov S (1993) Why is water blue. J Chem Educ 70: 612–615.45. Beja O, Aravind L, Koonin EV, Suzuki MT, Hadd A, et al. (2000) Bacterial

rhodopsin: Evidence for a new type of phototrophy in the sea. Science 289:1902–1906.

46. de la Torre JR, Christianson LM, Beja O, Suzuki MT, Karl DM, et al. (2003)Proteorhodopsin genes are distributed among divergent marine bacterialtaxa. Proc Natl Acad Sci U S A 100: 12830–12835.

47. Sabehi G, Beja O, Suzuki MT, Preston CM, DeLong EF (2004) DifferentSAR86 subgroups harbour divergent proteorhodopsins. Environ Micro-biol 6: 903–910.

48. Frigaard NU, Martinez A, Mincer TJ, DeLong EF (2006) Proteorhodopsinlateral gene transfer between marine planktonic bacteria and archaea.Nature 439: 847–850.

49. Yokoyama S (2000) Phylogenetic analysis and experimental approaches tostudy color vision in vertebrates. Methods Enzymol 315: 312–325.

50. Read TD, Salzberg SL, Pop M, Shumway M, Umayam L, et al. (2002)Comparative genome sequencing for discovery of novel polymorphisms inBacillus anthracis. Science 296: 2028–2033.

51. Brown MV, Fuhrman JA (2005) Marine bacterial microdiversity as revealedby internal transcribed spacer analysis. Aquat Microb Ecol 41: 15–23.

52. Rocap G, Distel DL, Waterbury JB, Chisholm SW (2002) Resolution ofProchlorococcus and Synechococcus ecotypes by using 16S-23S ribosomal DNAinternal transcribed spacer sequences. Appl Environ Microbiol 68: 1180–1191.

53. Schleper C, DeLong EF, Preston CM, Feldman RA, Wu KY, et al. (1998)Genomic analysis reveals chromosomal variation in natural populations ofthe uncultured psychrophilic archaeon Cenarchaeum symbiosum. J Bacteriol180: 5003–5009.

54. Rogers AR, Harpending H (1992) Population growth makes waves in thedistribution of pairwise genetic differences. Mol Biol Evol 9: 552–569.

55. Rappe MS, Connon SA, Vergin KL, Giovannoni SJ (2002) Cultivation of

PLoS Biology | www.plosbiology.org | S54 Special Section from March 2007 | Volume 5 | Issue 3 | e770430

Sorcerer II GOS Expedition

Page 61: Plos Biology Venter Collection Low

the ubiquitous SAR11 marine bacterioplankton clade. Nature 418: 630–633.

56. Johnson ZI, Zinser ER, Coe A, McNulty NP, Woodward EM, et al. (2006)Niche partitioning among Prochlorococcus ecotypes along ocean-scaleenvironmental gradients. Science 311: 1737–1740.

57. Liu WT, Marsh TL, Cheng H, Forney LJ (1997) Characterization ofmicrobial diversity by determining terminal restriction fragment lengthpolymorphisms of genes encoding 16S rRNA. Appl Environ Microbiol 63:4516–4522.

58. Fisher MM, Triplett EW (1999) Automated approach for ribosomalintergenic spacer analysis of microbial diversity and its application tofreshwater bacterial communities. Appl Environ Microbiol 65: 4630–4636.

59. Hess WR, Rocap G, Ting CS, Larimer F, Stilwagen S, et al. (2001) Thephotosynthetic apparatus of Prochlorococcus: Insights through comparativegenomics. Photosynth Res 70: 53–71.

60. Martiny AC, Coleman ML, Chisholm SW (2006) Phosphate acquisitiongenes in Prochlorococcus ecotypes: Evidence for genome-wide adaptation.Proc Natl Acad Sci U S A 103: 12552–12557.

61. Moore LR, Chisholm SW (1999) Photophysiology of the marine cyano-bacterium Prochlorococcus: Ecotypic differences among cultured isolates.Limnol Oceanogr 44: 628–638.

62. Moore LR, Rocap G, Chisholm SW (1998) Physiology and molecularphylogeny of coexisting Prochlorococcus ecotypes. Nature 393: 464–467.

63. Rocap G, Larimer FW, Lamerdin J, Malfatti S, Chain P, et al. (2003)Genome divergence in two Prochlorococcus ecotypes reflects oceanic nichedifferentiation. Nature 424: 1042–1047.

64. Sullivan MB, Coleman ML, Weigele P, Rohwer F, Chisholm SW (2005)Three Prochlorococcus cyanophage genomes: Signature features andecological interpretations. PLoS Biol 3: e144.

65. Ting CS, Rocap G, King J, Chisholm SW (2002) Cyanobacterial photosyn-thesis in the oceans: The origins and significance of divergent light-harvesting strategies. Trends Microbiol 10: 134–142.

66. Zinser ER, Coe A, Johnson ZI, Martiny AC, Fuller NJ, et al. (2006)Prochlorococcus ecotype abundances in the North Atlantic Ocean asrevealed by an improved quantitative PCR method. Appl EnvironMicrobiol 72: 723–732.

67. Wayne LG, Brenner DJ, Colwell RR, Grimont PAD, Kandler O, et al. (1987)Report of the ad hoc committee on reconciliation of approaches tobacterial systematics. Int J Syst Bacteriol 37: 463–464.

68. Thompson JR, Pacocha S, Pharino C, Klepac-Ceraj V, Hunt DE, et al.(2005) Genotypic diversity within a natural coastal bacterioplanktonpopulation. Science 307: 1311–1313.

69. Acinas SG, Klepac-Ceraj V, Hunt DE, Pharino C, Ceraj I, et al. (2004) Fine-scale phylogenetic architecture of a complex bacterial community. Nature430: 551–554.

70. Lawrence JG, Ochman H (1998) Molecular archaeology of the Escherichiacoli genome. Proc Natl Acad Sci U S A 95: 9413–9417.

71. Scheffer M, Rinaldi S, Huisman J, Weissing FJ (2003) Why planktoncommunities have no equilibrium: Solutions to the paradox. Hydro-biologia 491: 9–18.

72. Hutchinson GE (1961) The paradox of the plankton. Am Nat 95: 137–145.73. Cohan F (2002) Concepts of bacterial biodiversity for the age of genomics.

In: Fraser CM, Read TD, Nelson KE, editors. Microbial genomes. Totowa(New Jersey): Humana Press. pp. 175–194.

74. Hacker J, Blum-Oehler G, Hochhut B, Dobrindt U (2003) The molecularbasis of infectious diseases: Pathogenicity islands and other mobile geneticelements. A review. Acta Microbiol Immunol Hung 50: 321–330.

75. McCann KS (2000) The diversity-stability debate. Nature 405: 228–233.76. Fuller NJ, West NJ, Marie D, Yallop M, Rivlin T, et al. (2005) Dynamics of

community structure and phosphate status of picocyanobacterial pop-ulations in the Gulf of Aqaba, Red Sea. Limnol Oceanogr 50: 363–375.

77. Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ, et al. (2006)Metagenomic analysis of the human distal gut microbiome. Science 312:1355–1359.

78. Brown MV, Schwalbach MS, Hewson I, Fuhrman JA (2005) Coupling 16S-ITS rDNA clone libraries and automated ribosomal intergenic spaceranalysis to show marine microbial diversity: Development and applicationto a time series. Environ Microbiol 7: 1466–1479.

79. Garcia-Martinez J, Rodriguez-Valera F (2000) Microdiversity of unculturedmarine prokaryotes: The SAR11 cluster and the marine Archaea of GroupI. Mol Ecol 9: 935–948.

80. Jenkins BD, Steward GF, Short SM, Ward BB, Zehr JP (2004) Finger-printing Diazotroph communities in the Chesapeake Bay by using a DNAmacroarray. Appl Environ Microbiol 70: 1767–1776.

81. Legeckis R (1988) Upwelling off the Gulfs of Panama and Papagayo in thetropical Pacific during March 1985. J Geophys Res 93: 15485–15489.

82. McCreary JP, Lee HS, Enfield DB (1989) The response of the coastal oceanto strong offshore winds: With application to circulation in the gulfs ofTehuantepec and Papagayo. J Mar Res 47: 81–109.

83. Palacios DM (2003) Oceanographic conditions around the GalapagosArchipelago and their influence on Cetacean community structure. [PhDdiss]. Corvallis: Oregon State University. 178 p.

84. Christian JR, Lewis MR, Karl DM (1997) Vertical fluxes of carbon,nitrogen, and phosphorus in the North Pacific Subtropical Gyre nearHawaii. J Geophys Res 102: 15667–15677.

85. Doney SC, Abbott MR, Cullen JJ, Karl DM, Rothstein L (2004) From genesto ecosystems: The ocean’s new frontier. Front Ecol Environ 2: 457–466.

86. McGillicuddy DJ, Anderson LA, Doney SC, Maltrud ME (2003) Eddy-drivensources and sinks of nutrients in the upper ocean: Results from a 0.18

resolutionmodel of the North Atlantic. Global BiogeochemCycles 17: 1035.87. van der Staay SYM, van der Staay GWM, Guillou L, Vaulot D, Claustre H,

et al. (2000) Abundance and diversity of prymnesiophytes in thepicoplankton community from the equatorial Pacific Ocean inferredfrom 18S rDNA sequences. Limnol Oceanogr 45: 98–109.

88. Blanchot J, Charpy L, Borgne RL (1989) Size composition of particulateorganic matter in the lagoon of Tikehau Atoll (Tuiamotu Archipelago).Mar Biol 102: 329–339.

89. Torreton JP, Dufour P (1996) Bacterioplankton production determined byDNA synthesis, protein synthesis and frequency of dividing cells inTuamotu atoll lagoons and surrounding ocean. Microb Ecol 32: 185–202.

90. Torreton JP, Dufour P (1996) Temporal and spatial stability ofbacterioplankton biomass and productivity in an atoll lagoon. AquatMicrob Ecol 11: 251–261.

91. Rutala WA, Weber DJ (1997) Uses of inorganic hypochlorite (bleach) inhealth-care facilities. Clin Microbiol Rev 10: 597–610.

92. Sambrook J, Fritsch EF, Maniatis T (1989) Molecular cloning. A laboratorymanual. Cold Spring Harbor (NY): Cold Spring Laboratory Press.

93. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74: 5463–5467.

94. Edgar RC (2004) MUSCLE: Multiple sequence alignment with highaccuracy and high throughput. Nucleic Acids Res 32: 1792–1797.

95. Felsenstein J (1989) PHYLIP: Phylogeny Inference Package (Version 3.2).Cladistics 5: 164–166.

96. Bingham J, Sudarsanam S (2000) Visualizing large hierarchical clusters inhyperbolic space. Bioinformatics 16: 660–661.

97. Bray JR, Curtis JT. (1957) An ordination of upland forest communities ofsouthern Wisconsin. Ecol Monogr 27: 325–349.

98. Myers G (1999) A fast bit-vector algorithm for approximate stringmatching based on dynamic programming. J ACM 46: 395–415.

99. Penn K, Wu D, Eisen JA, Ward N (2006) Characterization of bacterialcommunities associated with deep-sea corals on Gulf of Alaska Seamounts.Appl Environ Microbiol 72: 1680–1683.

100. Ludwig W, Strunk O, Westram R, Richter L, Meier H, et al. (2004) ARB: Asoftware environment for sequence data. Nucleic Acids Res 32: 1363–1371.

101. Hugenholtz P (2002) Exploring prokaryotic diversity in the genomic era.Genome Biol 3: REVIEWS0003.

102. Angly F, Rodriguez-Brito B, Bangor D, McNairnie P, Breitbart M, et al.(2005) PHACCS, an online tool for estimating the structure and diversityof uncultured viral communities using metagenomic information. BMCBioinformatics 6: 41.

103. R Development Core Team (2004) R: A language and environment forstatistical computing [computer program]. Vienna, Austria: R Foundationfor Statistical Computing. http://www.R-project.org.

104. Gomez-Consarnau L, Gonzalez JM, Coll-Llado M, Gourdon P, Pascher T, etal. (2007) Light stimulates growth of proteorhodopsin-containing marineFlavobacteria. Nature 445: 210–213.

Note Added in Proof

Recently, Gomez-Consarnau et al. provided credible evidence for thebiological role of proteorhodopsins [104]. These results indicate thatproteorhodopsins blur the line between heterotrophic and autotrophicmicrobes by allowing a wide range of organisms to harness light energy forrespiration and growth. This reinforces the notion that the differentialdistribution of proteorhodopsin variants identified here reflects functionaladaptation to the wavelengths of available light. Furthermore, these adapta-tions may be driven by the makeup of the microbial community. Thus, thesedistributional differences could reflect competition between microbes for lightresources.

PLoS Biology | www.plosbiology.org | S55 Special Section from March 2007 | Volume 5 | Issue 3 | e770431

Sorcerer II GOS Expedition

Page 62: Plos Biology Venter Collection Low

The Sorcerer II Global Ocean SamplingExpedition: Expanding the Universeof Protein FamiliesShibu Yooseph

1*, Granger Sutton

1, Douglas B. Rusch

1, Aaron L. Halpern

1, Shannon J. Williamson

1, Karin Remington

1,

Jonathan A. Eisen1,2

, Karla B. Heidelberg1

, Gerard Manning3

, Weizhong Li4

, Lukasz Jaroszewski4

, Piotr Cieplak4

,

Christopher S. Miller5

, Huiying Li5

, Susan T. Mashiyama6

, Marcin P. Joachimiak6

, Christopher van Belle6

,

John-Marc Chandonia6,7

, David A. Soergel6

, Yufeng Zhai3

, Kannan Natarajan8

, Shaun Lee8

, Benjamin J. Raphael9

,

Vineet Bafna8

, Robert Friedman1

, Steven E. Brenner6

, Adam Godzik4

, David Eisenberg5

, Jack E. Dixon8

,

Susan S. Taylor8

, Robert L. Strausberg1

, Marvin Frazier1

, J. Craig Venter1

1 J. Craig Venter Institute, Rockville, Maryland, United States of America, 2 University of California, Davis, California, United States of America, 3 Razavi-Newman Center for

Bioinformatics, Salk Institute for Biological Studies, La Jolla, California, United States of America, 4 Burnham Institute for Medical Research, La Jolla, California, United States of

America, 5 University of California Los Angeles–Department of Energy Institute for Genomics and Proteomics, Los Angeles, California, United States of America, 6 University

of California Berkeley, Berkeley, California, United States of America, 7 Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, California, United

States of America, 8 University of California San Diego, San Diego, California, United States of America, 9 Brown University, Providence, Rhode Island, United States of

America

Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into proteinfamilies. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting ofsequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 millionGlobal Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A totalof 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have nodetectable homology to known families. The GOS-only clusters contain a higher than expected proportion ofsequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions inthe GOS dataset and current protein databases show distinct biases. Several protein domains that were previouslycategorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans)from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS datasetis also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins,the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on theirevolution. These observations are illustrated using several protein families, including phosphatases, proteases,ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOSdata has implications for choosing targets for experimental structure characterization as part of structural genomicsefforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with theaddition of new sequences, implying that we are still far from discovering all protein families in nature.

Citation: Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al. (2007) The Sorcerer II Global Ocean Sampling expedition: Expanding the universe of protein families.PLoS Biol 5(3): e16. doi:10.1371/journal.pbio.0050016

Academic Editor: Sean Eddy, Washington University St. Louis, United States ofAmerica

Received March 24, 2006; Accepted August 15, 2006; Published March 13, 2007

Copyright: � 2007 Yooseph et al. This is an open-access article distributed underthe terms of the Creative Commons Attribution License, which permits unrestricteduse, distribution, and reproduction in any medium, provided the original authorand source are credited.

Abbreviations: aa, amino acid; ENS, Ensembl; EST, expressed sequence tag; GO,Gene Ontology; GOS, Global Ocean Sampling; GS, glutamine synthetase; HMM,hidden Markov model; IDO, indoleamine 2,3-dioxygenase; NCBI, National Center forBiotechnology Information; ORF, open reading frame; PDB, Protein Data Bank; PG,prokaryotic genomes; PP2C, protein phosphatase 2C; PSI, Protein StructureInitiative; RLP, RuBisCO-like protein; TGI, TIGR gene indices; TC, trusted cutoff;UVDE, UV dimer endonuclease

* To whom correspondence should be addressed. E-mail: [email protected]

This article is part of Global Ocean Sampling collection in PLoS Biology. The fullcollection is available online at http://collections.plos.org/plosbiology/gos-2007.php.

PLoS Biology | www.plosbiology.org | S56 Special Section from March 2007 | Volume 5 | Issue 3 | e160432

PLoS BIOLOGY

Page 63: Plos Biology Venter Collection Low

Introduction

Despite many efforts to classify and organize proteins [1–6]from both structural and functional perspectives, we are farfrom a clear understanding of the size and diversity of theprotein universe [7–9]. Environmental shotgun sequencingprojects, in which genetic sequences are sampled fromcommunities of microorganisms [10–14], are poised to makea dramatic impact on our understanding of proteins andprotein families. These studies are not limited to culturableorganisms, and there are no selection biases for proteinclasses or organisms. These studies typically provide a gene-centric (as opposed to an organism-centric) view of theenvironment and allow the examination of questions relatedto protein family evolution and diversity. The proteinpredictions from some of these studies are characterizedboth by their sheer number and diversity. For instance, therecent Sargasso Sea study [10] resulted in 1.2 million proteinpredictions and identified new subfamilies for several knownprotein families.

Protein exploration starts by clustering proteins intogroups or families of evolutionarily related sequences. Thenotion of a protein family, while biologically very relevant, ishard to realize precisely in mathematical terms, therebymaking the large-scale computational clustering and classi-fication problem nontrivial. Techniques for these problemstypically rely on sequence similarity to group sequences.Proteins can be grouped into families based on the highlyconserved structural units, called domains, that they contain[15,16]. Alternatively, proteins are grouped into familiesbased on their full sequence [17,18]. Many of these classi-fications, together with various expert-curated databases [19]such as Swiss-Prot [20], Pfam [15,21], and TIGRFAM [22,23],or integrated efforts such as Uniprot [24] and InterPro [25],

provide rich resources for protein annotation. However, avast number of protein predictions remain unclassified bothin terms of structure and function. Given varying rates ofevolution, there is unlikely to be a single similarity thresholdor even a small set of thresholds that can be used to defineevery protein family in nature. Consequently, estimates of thenumber of families that exist in nature vary considerablybased on the different thresholds used and assumptions madein the classification process [26–29].In this study, we explored proteins using a comprehensive

dataset of publicly available sequences together with environ-mental sequence data generated by the Sorcerer II GlobalOcean Sampling (GOS) expedition [30]. We used a novelclustering technique based on full-length sequence similarityboth to predict proteins and to group related sequences. Thegoals were to understand the rate of discovery of proteinfamilies with the increasing number of protein predictions,explore novel families, and assess the impact of the environ-mental sequences from the expedition on known proteinsand protein families. We used hidden Markov model (HMM)profiling to examine the relative biases in protein domaindistributions in the GOS data and existing protein databases.This profiling was also used to assess the impact of the GOSdata on target selection for protein structure character-ization efforts. We carried out in-depth analyses on severalprotein families to validate our clustering approach and tounderstand the diversity and evolutionary information thatthe GOS data added; the families included ultraviolet (UV)irradiation DNA damage repair enzymes, phosphatases,proteases, and the metabolic enzymes glutamine synthetaseand RuBisCO.

Results/Discussion

Data Generation, Sequence Clustering, and HMM ProfilingWe used the following publicly available datasets in this

study (Table 1)—the National Center for BiotechnologyInformation (NCBI)’s nonredundant protein database(NCBI-nr) [31,32], NCBI Prokaryotic Genomes (PG) [31,33],TIGR Gene Indices (TGI-EST) [34], and Ensembl (ENS)[35,36]. The rationale for including these datasets is discussedin Materials and Methods. All datasets were downloaded onFebruary 10, 2005.None of the above-mentioned databases contained sequen-

ces from the Sargasso Sea study [10], the largest environ-mental survey to date, and so we pooled reads from theSargasso Sea study with the reads from the Sorcerer II GOSexpedition [30], creating a combined set that we call the GOSdataset. The GOS dataset was assembled using the CeleraAssembler [37] as described in [30] (see Materials andMethods). The GOS dataset was primarily generated fromthe 0.1 lm to 0.8 lm size filters and thus is expected to bemostly microbial [30]. The data also included a small set ofsequences from a viral size (,0.1 lm) fraction (Table 1).We identified open reading frames (ORFs) from the DNA

sequences in the PG, TGI-EST, and GOS datasets. An ORF iscommonly defined as a translated DNA sequence that beginswith a start codon and ends with a stop codon. Toaccommodate partial DNA sequences, we extended thisdefinition to allow an ORF to be bracketed by either a startcodon or the start of the DNA sequence, and by either a stopcodon or the end of the DNA sequence. ORFs were generated

Author Summary

The rapidly emerging field of metagenomics seeks to examine thegenomic content of communities of organisms to understand theirroles and interactions in an ecosystem. Given the wide-ranging rolesmicrobes play in many ecosystems, metagenomics studies ofmicrobial communities will reveal insights into protein familiesand their evolution. Because most microbes will not grow in thelaboratory using current cultivation techniques, scientists haveturned to cultivation-independent techniques to study microbialdiversity. One such technique—shotgun sequencing—allows ran-dom sampling of DNA sequences to examine the genomic materialpresent in a microbial community. We used shotgun sequencing toexamine microbial communities in water samples collected by theSorcerer II Global Ocean Sampling (GOS) expedition. Our analysispredicted more than six million proteins in the GOS data—nearlytwice the number of proteins present in current databases. Thesepredictions add tremendous diversity to known protein families andcover nearly all known prokaryotic protein families. Some of thepredicted proteins had no similarity to any currently known proteinsand therefore represent new families. A higher than expectedfraction of these novel families is predicted to be of viral origin. Wealso found that several protein domains that were previouslythought to be kingdom specific have GOS examples in otherkingdoms. Our analysis opens the door for a multitude of follow-upprotein family analyses and indicates that we are a long way fromsampling all the protein families that exist in nature.

PLoS Biology | www.plosbiology.org | S57 Special Section from March 2007 | Volume 5 | Issue 3 | e160433

Expanding the Protein Family Universe

Page 64: Plos Biology Venter Collection Low

by considering translations of the DNA sequence in all sixframes. For ORFs from the PG and TGI-EST datasets, we usedthe appropriate codon usage table for the known organism.For GOS ORFs from the assembled sequences, we usedtranslation table 11 (the code for bacteria, archaea, andprokaryotic viruses) [31]. We did not include alternate codontranslations in this analysis. For all datasets, only ORFscontaining at least 60 amino acids (aa) were considered. Notall ORFs are proteins. In this paper, ORFs that havereasonable evidence for being proteins are called predictedproteins; other ORFs are called spurious ORFs.

In summary, the total input data for this study (Table 1)consisted of 28,610,994 sequences from NCBI-nr, PG, TGI-EST, ENS, and GOS. All data and analysis results will be madepublicly available (see Materials and Methods).

We used a sequence similarity clustering to group relatedsequences and subsequently predicted proteins from thisgrouping. This approach of protein prediction was adoptedfor two reasons. First, the GOS data make up a major portionof the dataset being analyzed, and a large fraction of GOSORFs are fragmentary sequences. Traditional annotationpipelines/gene finders, which presume complete or near-complete genomic data, perform unsatisfactorily on this typeof data. Second, protein prediction based on the comparisonof ORFs to known protein sequences imposes limits on theprotein families that can be explored. In particular, novelproteins that belong to known families will not be detected ifthey are sufficiently distant from known members of thatfamily. This is the case even though there may be other novelproteins that can transitively link them to the knownproteins. Similarly, truly novel protein families will also notbe detected.

As the primary input to our clustering process, wecomputed the pairwise sequence similarity of the 28.6 millionaa sequences in our dataset using an all-against-all BLASTsearch [38]. This required more than 1 million CPU hours ontwo large compute clusters (see Materials and Methods). Thesequences were clustered in four steps (see Materials andMethods). In the first step, we identified a nonredundant setof sequences from the entire dataset using only pairwisematches with �98% similarity and involving �95% of the

length of the shorter sequence. This step served the dual roleof identifying highly conserved groups of sequences (whereeach group was represented by a nonredundant sequence) andremoving redundancy in the dataset due to identical andnear-identical sequences. Only nonredundant sequences wereconsidered for further steps in our clustering procedure. Inthe second step, we identified core sets of similar sequencesusing only matches between two sequences involving �80%of the length of the longer sequence. We used a graph-theoretic procedure to identify dense subgraphs (the coresets) within a graph defined by these matches. While thematch parameters we used in this step were more relaxedthan those in the first step, we chose them to reduce thegrouping of unrelated sequences while simultaneously re-ducing the unnecessary splitting of families. In the third step,these core sets were transformed into profiles, and we used aprofile–profile method [39] to merge related core sets intolarger groups. In the final step, we recruited sequences tocore sets using sequence-profile matching (PSI-BLAST [40])and BLAST matches to core set members. We required thematch to involve �60% of the length of the sequence beingrecruited.We identified and removed clusters containing likely

spurious ORFs using two filters (see Materials and Methods).The first filter identified clusters containing shadow ORFs.The second filter identified clusters containing conserved butnoncoding sequences, as indicated by a lack of selection at thecodon level. Only clusters that remained after the twofiltering steps and contained at least two nonredundantsequences are reported in this analysis.We examined the distribution of known protein domains

in the full dataset using profile HMMs [41] from the Pfam [15]and TIGRFAM [22] databases (see Materials and Methods).We labeled sequences that end up in clusters (containing at

least two nonredundant sequences) or that have HMMmatches as predicted proteins. The inclusion of the PG ORFset allowed for the evaluation of protein prediction using ourclustering approach. A comparison of proteins predicted inthe PG ORF set by our clustering against PG ORFs annotatedas proteins by whole-genome annotation techniques revealedthat our protein prediction method via clustering has a

Table 1. The Complete Dataset Consisted of Sequences from NCBI-nr, ENS, TGI-EST, PG, and GOS, for a Total of 28,610,944 Sequences

Dataset Source Number of Amino

Acid Sequences

Mean Sequence

Length

Brief

Description

NCBI-nr NCBI 2,317,995 339 Consists of protein sequences submitted to SWISS-PROT, PDB, PIR, and PRF, and

also predicted proteins from both finished and unfinished genomes in GenBank,

EMBL, and DDBJ.

PG ORFs NCBI 3,049,695 160 ORFs identified from 222 prokaryotic genome projects. Organisms are listed in

Protocol S1.

TGI-EST ORFs TIGR Gene Index 5,458,820 119 ORFs identified from 72 datasets in which each dataset consists of EST assem-

blies. Organisms are listed in Protocol S1.

ENS Ensembl 361,668 466 Sequences from 12 species, including human, mouse, rat, chimp, zebrafish, fruit

fly, mosquito, honey bee, dog, two species of puffer fish, chicken, and worm.

GOS ORFs J. Craig Venter

Institute

17,422,766 134 ORFs identified from an assembly of 7.7 million reads. These reads include both

the reads from the Sorcerer II GOS Expedition and the reads from the earlier Sar-

gasso Sea study. Also included are 36,318 ORFs identified from an assembly of

sequences collected from the viral size (, 0.1 lm) fraction of one sample.

doi:10.1371/journal.pbio.0050016.t001

PLoS Biology | www.plosbiology.org | S58 Special Section from March 2007 | Volume 5 | Issue 3 | e160434

Expanding the Protein Family Universe

Page 65: Plos Biology Venter Collection Low

sensitivity of 83% and a specificity of 86% (see Materials andMethods). The HMM profiling allowed for the evaluation ofour clustering technique’s grouping of sequences. We usedPfam models in two different ways for this assessment (seeMaterials and Methods) and make three observations. First,using a simple Pfam domain architecture-based evaluation,these clusters are mostly consistent as reflected by 93% ofclusters having less than 2% unrelated pairs of sequences inthem. Second, these clusters are quite conservative and cansplit domain families, with 58% of domain architecturesbeing confined to single clusters and 88% of domainarchitectures having more than half of their occurrences ina single cluster. Third, the size distribution of these clusters isquite similar to the size distribution of clusters induced byPfams.

Protein PredictionOf the initial 28,610,944 sequences, we labeled 9,978,637

sequences (35%) as predicted proteins based on the cluster-ing, of which nearly 60% are from GOS (Table 2). The HMMprofiling labeled only an additional 226,743 (0.8%) sequencesas predicted proteins, for a total of 10,205,380 predictedproteins. This indicates that our clustering method capturesmost of the sequences found by profile HMMs. For sequencesboth in clusters and with HMM matches, (on average) 73.5%of their length is covered by HMM matches. For sequencesnot in clusters but with HMM matches, this value is only45.3%. Furthermore, while 64% of sequences in clusters haveHMM matches, there are 3,550,901 sequences that aregrouped into clusters but do not have HMM matches. Mostof these clusters correspond either to families lacking profileHMMs or contain sequences that are too remote to matchabove the cutoffs used. The latter is an indication of thediversity added to known families that is not picked up bycurrent profile HMMs.

Using our method, the predicted proteins constitutedifferent fractions of the totals for the five datasets, with87% for NCBI-nr, nearly 20% for both PG ORFs and TGI-EST ORFs, 92% for ENS, and 35% for GOS. The high rate ofprediction for ENS is a reflection of the high degree ofconservation of proteins across the metazoan genomes,whereas the prediction rates for PG ORFs and TGI-ESTORFs are similar to rates seen in other protein predictionapproaches. The 13% of NCBI-nr sequences that we markedas spurious may constitute contaminants in the form of false

predictions or organism-specific proteins. Nearly two-thirdsof these sequences are labeled ‘‘hypotheticals,’’ ‘‘unnamed,’’or ‘‘unknown.’’ This is more than twice the fraction ofsimilarly labeled sequences (30%) in the full NCBI-nr dataset.Of the remaining one-third, half of them are less than 100 aain length. This suggests that they are either fast-evolving shortpeptides, spurious predictions, or proteins that failed to meetthe length-based thresholds in the clustering.Based on the clustering and the HMM profiling, there is

evidence for 6,123,395 proteins in the GOS dataset (Table 2).Given the fragmentary nature of the GOS ORFs (as a result ofthe GOS assembly [10,30]), it is not surprising that the averagelength of a GOS-predicted protein (199 aa) is smaller than theaverage length of predicted proteins in NCBI-nr (359 aa), PGORFs (325 aa), TGI-EST ORFs (207 aa), and ENS (489 aa). Theratio of clustered ORFs to total ORFs is significantly higherfor the GOS ORFs (34%) compared to PG ORFs (19%). Thiscould be due to a large number of false-positive proteinpredictions in the GOS dataset. However, this is unlikely for avariety of reasons. Nearly 4.64 million GOS ORFs (26.6%)have significant BLAST matches (with an E-value � 13 10�10)to NCBI-nr sequences. The PG ORFs do not have a high false-positive rate compared to the submitted annotation for theprokaryotic genomes (see Materials and Methods). Mostimportantly, based on the fragmentary nature of GOSsequencing compared to PG sequencing, the number ofshadow (spurious) ORFs �60 aa is significantly reduced (seeMaterials and Methods).Some pairs of GOS-predicted proteins that belong to the

same cluster are adjacent in the GOS assembly. While some ofthem correspond to tandem duplicate genes, an overwhelm-ing fraction of the pairs are on mini-scaffolds [10], indicatingthat they are potentially pieces of the same protein (from thesame clone) that we split into fragments. We estimate that thiseffect applies to 3% of GOS-predicted proteins. Sequencingerrors and the use of the wrong translation table can alsoresult in the ORF generation process producing split ORFfragments.The combined set of predicted proteins in NCBI-nr, PG,

TGI-EST, and ENS, as expected, has a lot of redundancy. Forinstance, most of the PG protein predictions are in NCBI-nr.Removing exact substrings of longer sequences (i.e., 100%identity) reduces this combined set to 3,167,979 predictedproteins. When we perform the same filtering on the GOSdataset, 5,654,638 predicted proteins remain. Thus, the GOS-

Table 2. Clustering and HMM Profiling Results Showing the Number of Predicted Proteins (Including Both Redundant andNonredundant Sequences) in Each Dataset

Dataset Original Set Clustering (A) HMM

Profiling (B)

A \ B A � B B � A Total Predicted

Proteins A [ B

Mean Length

of Sequence

NCBI-nr 2,317,995 1,939,056 1,645,146 1,566,123 372,933 79,023 2,018,079 359

PG ORFs 3,049,695 575,729 448,159 418,503 157,226 29,656 605,385 325

TGI-EST ORFs 5,458,820 1,097,083 606,779 576,532 520,551 30,247 1,127,330 207

ENS 361,668 319,855 253,007 241,671 78,184 11,336 331,191 489

GOS ORFs 17,422,766 6,046,914 3,701,388 3,624,907 2,422,007 76,481 6,123,395 199

Total 28,610,944 9,978,637 6,654,479 6,427,736 3,550,901 226,743 10,205,380 —

A \ B denotes the number of predicted proteins common to both the clustering and the HMM profiling; A�B, the number of predicted proteins in clusters but not in the HMM profile set;B� A, the number of predicted proteins in the HMM profile set but not in clusters; and A [ B, the total number of predicted proteins in each dataset.doi:10.1371/journal.pbio.0050016.t002

PLoS Biology | www.plosbiology.org | S59 Special Section from March 2007 | Volume 5 | Issue 3 | e160435

Expanding the Protein Family Universe

Page 66: Plos Biology Venter Collection Low

predicted protein set is 1.8 times the size of the predictedprotein set from current publicly available datasets. We useda simple BLAST based scheme to assign kingdoms for theGOS sequences (see Materials and Methods). Of the sequencesthat we could annotate by kingdom, 63% of the sequences inthe public datasets are from the eukaryotic kingdom, and90.8% of the sequences in the GOS set are from the bacterialkingdom (Figure 1).

Protein ClusteringThe 9,978,637 protein sequences predicted by our cluster-

ing method are grouped into 297,254 clusters of size two ormore, where size of a cluster is defined to be the number ofnonredundant sequences in the cluster. There are 280,187small clusters (size , 20), 12,992 medium clusters (sizebetween 20 and 200), and 4,075 large clusters (size . 200).While the 17,067 medium- and large-sized clusters constituteonly 6% of the total number of clusters, they account for 85%of all the sequences that are clustered (Table 3). Many of the

largest clusters correspond to families that have functionallydiversified and expanded (Table 4). While some large families,such as the HIV envelope glycoprotein family and theimmunoglobulins, also reflect biases in sequence databases,many more, including ABC transporters, kinases, and short-chain dehydrogenases, reflect their expected abundance innature.

Rate of Discovery of Protein FamiliesWe examined the rate of discovery of protein families using

our clustering method to determine whether our sampling ofthe protein universe is reaching saturation. We find that forthe present number of sequences there is an approximatelylinear trend in the rate of discovery of clusters with theaddition of new (i.e., nonredundant) sequences (Figure 2).Moreover, the observed distribution of cluster sizes is wellapproximated by a power law [42,43], and this observedpower law can be used to predict the rate of growth of thenumber of clusters of a given size (see Materials andMethods). This rate is dependent on the value of the powerlaw exponent and decreases with increasing cluster sizes. Wefind good agreement between the observed and predictedgrowth rates for different cluster sizes. The approximatelylinear relationship between the number of clusters and thenumber of protein sequences indicates that there are likelymany more protein families (either novel or subfamiliesdistantly related to known families) remaining to bediscovered.

GOS versus Known Prokaryotic versus KnownNonprokaryoticWe also examined the GOS coverage of known proteins

and protein families. Based on the cell-size filteringperformed while collecting the GOS samples, we expectedthat the sample would predominantly be a size-limited subsetof prokaryotic organisms [30]. We studied the content of the17,067 medium- and large-sized clusters across three group-ings: (1) GOS, (2) known prokaryotic (PG together withbacterial and archaeal portions of NCBI-nr), and (3) knownnonprokaryotic (TGI-EST and ENS together with viral andeukaryotic portions of NCBI-nr). The Venn diagram in Figure3 shows the breakdown of these clusters by content (seeMaterials and Methods). The largest section contains GOS-

Figure 1. Proportion of Sequences for Each Kingdom

(A) The combined set of NCBI-nr, PG, TGI-EST, and ENS has 3,167,979sequences. The eukaryotes account for the largest portion and is morethan twice the bacterial fraction.(B) Predicted kingdom proportion of sequences in GOS. Out of the5,654,638 GOS sequences, 5,058,757 are assigned kingdoms using aBLAST-based scheme. The bacterial kingdom forms by far the largestfraction in the GOS set.doi:10.1371/journal.pbio.0050016.g001

Table 3. Cluster Size Distribution and the Distribution of Sequences in These Clusters

Cluster Size Number of Clusters Total Sequences NCBI-nr PG TGI-EST ENS GOS

2–4 214,033 756,269 194,297 87,699 149,687 32,920 291,666

5–9 48,348 415,166 97,759 30,565 71,414 14,828 200,600

10–19 17,806 350,918 90,682 19,904 60,783 23,493 156,056

20–49 7,255 310,770 78,153 13,809 58,496 26,486 133,826

50–99 3,086 337,296 80,470 14,342 55,190 26,150 161,144

100–199 2,631 595,903 165,846 28,100 107,490 40,465 254,002

200–499 2,134 1,036,567 218,940 57,131 164,581 49,797 546,118

500–999 799 914,207 148,084 54,077 90,020 24,047 597,979

1,000–2,000 620 1,503,116 205,196 79,348 105,866 21,883 1,090,823

�2,000 542 3,758,425 659,629 190,754 233,556 59,786 2,614,700

Total 297,254 9,978,637 1,939,056 575,729 1,097,083 319,855 6,046,914

The size of a cluster is the number of nonredundant sequences in it. Column three shows the total number of sequences (both redundant and nonredundant) in these clusters. Thesucceeding columns show their breakdown by the five datasets. There are 17,067 medium- and large-size clusters.doi:10.1371/journal.pbio.0050016.t003

PLoS Biology | www.plosbiology.org | S60 Special Section from March 2007 | Volume 5 | Issue 3 | e160436

Expanding the Protein Family Universe

Page 67: Plos Biology Venter Collection Low

only clusters (23.40%) emphasizing the significant noveltyprovided by the GOS data. The next section consists ofclusters containing sequences from only the known non-prokaryotic grouping (20.78%), followed closely by thesection containing clusters with sequences from all threegroupings (20.23%). The large known nonprokaryotic–onlygrouping shows that our current GOS sampling methodologywill not cover all protein families, and perhaps misses someprotein families that are exclusive to higher eukaryotes. Thelarge section of clusters that include all three groupings

indicates a large core of well-conserved protein familiesacross all domains of life. In contrast, the known prokaryoticprotein families are almost entirely covered by the GOS data.

Novelty Added by GOS DataThere are 3,995 medium and large clusters that contain

only sequences from the GOS dataset. Some are divergentmembers of known families that failed to be merged by theclustering parameters used, or are too divergent to bedetected by any current homology detection methods. The

Table 4. List of the Top 25 Clusters from the Clustering Process

Cluster ID Cluster

Annotation

Nonredundant

Sequences

Total

Sequences

NCBI-nr PG TGI-EST ENS GOS

3510 Immunoglobulin 37,227 51,944 49,206 0 1,649 1,089 0

2568 ABC transporter 34,130 69,010 8,886 6,248 150 13 53,713

49 Short chain dehydrogenase 33,406 56,266 7,607 3,055 2,852 747 42,005

4294 NAD dependent epimerase/dehydratase 29,445 35,555 2,745 1,265 1,500 111 29,934

1239 AMP-binding enzyme 22,111 37,598 3,838 1,614 2,246 613 29,287

2630 Envelope glycoprotein 21,161 41,205 41,189 2 10 0 4

157 Glycosyl transferases group 1 20,366 27,012 2,766 1,446 557 42 22,201

183 Integral membrane protein 17,627 33,079 2,154 1,298 1,198 95 28,334

530 Aldehyde dehydrogenase 15,851 30,929 3,116 1,349 1,589 388 24,487

1308 Aminotransferase class-V and

DegT/DnrJ/EryC1/StrS aminotransferase

15,757 22,484 1,849 1,086 413 71 19,065

244 Kinase family, including pknb, epk, c6 15,112 21,641 6,384 83 10,809 2,761 1,604

336 Histidine kinase–, DNA gyrase B–, and HSP90-like ATPase 14,724 23,355 3,809 2,469 54 4 17,019

357 Tetratricopeptide repeat 14,323 17,058 1,598 609 1,320 315 13,216

4325 Alpha/Beta hydrolase fold 13,806 20,886 2,828 1,334 1,625 196 14,903

113 Aminotransferase class I and II 13,006 22,186 2,931 1,534 1,239 120 16,362

333 Zinc-binding dehydrogenase 12,737 22,298 4,055 1,370 2,383 269 14,221

1315 tRNA synthetases class I (I, L, M, and V) 12,545 19,992 1,152 600 472 131 17,637

26 Acyl-CoA dehydrogenase 12,150 22,340 2,081 1,152 541 179 18,387

159 ABC transporter and ABC transporter transmembrane 11,984 17,650 2,697 1,442 797 170 12,544

3357 Cytochrome P450 11,929 17,302 5,355 249 6,994 1,399 3,305

4556 Response regulator 11,928 21,903 5,387 3,320 348 5 12,843

1720 TonB-dependent receptor 11,890 17,080 1,789 1,090 34 2 14,165

514 NADH dehydrogenase (various subunits) 11,224 25,068 11,624 635 253 10 12,546

4235 Glycosyl transferase family 2 10,954 13,593 1,236 724 74 14 11,545

186 7 transmembrane receptor 10,654 22,252 13,943 0 1,475 6,829 5

Clusters were annotated using the most commonly matching Pfam domains. Many of these clusters correspond to families that have expanded and functionally diversified.doi:10.1371/journal.pbio.0050016.t004

Figure 2. Rate of Discovery of Clusters as (Nonredundant) Sequences Are Added

The x-axis denotes the number of sequences (in millions) and the y-axis denotes the number of clusters (in thousands). Seven datasets with increasingnumbers of (nonredundant) sequences are chosen as described in the text. The blue curve shows the number of core sets of size �3 for the sevendatasets. Curves for core set sizes �5, �10, and �20 are also shown. Linear regression gives slopes 0.027 (R2¼ 0.999), 0.011 (R2¼ 0.999), 0.0053 (R2¼0.999), and 0.0024 (R2 ¼ 0.996) for size �3, size �5, size �10, and size �20, respectively.doi:10.1371/journal.pbio.0050016.g002

PLoS Biology | www.plosbiology.org | S61 Special Section from March 2007 | Volume 5 | Issue 3 | e160437

Expanding the Protein Family Universe

Page 68: Plos Biology Venter Collection Low

remaining clusters are completely novel families. In exploringthe 3,995 GOS-only clusters, 44.9% of them containsequences that have HMM matches, or BLAST matches tosequences in a more recent snapshot of NCBI-nr (down-loaded in August 2005) than was used in this study. The recentNCBI-nr matches include phage sequences from cyanophages(P-SSM2 and P-SSM4) [44] and sequences from the SAR-11genome (Candidatus pelagibacter ubique HTCC1062) [45]. Weused profile–profile searches [39] to show that an additional12.5% of the GOS-only clusters can be linked to profiles builtfrom Protein Data Bank (PDB), COG, or Pfam. The 2,295clusters with detected homology are referred to as Group Iclusters. The remaining 1,700 (42.6%) GOS-only clusters withno detectable homology to known families are labeled asGroup II clusters.

We applied a guilt-by-association operon method toannotate the GOS-only clusters with a strategy that did notrely on direct sequence homology to known families.Function was inferred for the GOS-only clusters by examin-ing their same-strand neighbors on the assembly (seeMaterials and Methods). Similar strategies have been success-fully used to infer protein function in finished microbialgenomes [46–48]. Despite minimal assembly of GOS reads,many scaffolds and mini-scaffolds contain at least partialfragments of more than one predicted ORF, thereby makingthis approach feasible. For 90 (5.3%) of the Group II clusters,and for 214 (9.3%) of the Group I clusters, at least one GeneOntology (GO) [49] biological process term at p-value �0.05can be inferred. The inferred functions and neighbors ofsome of these GOS-only clusters are highlighted in Table 5.We observed that for Group I clusters, the neighbor-inferredfunction is often bolstered by some information from weakhomology to known sequences. While neighboring clusters asa whole are of diverse function, a number of GOS-onlyclusters seem to be next to clusters implicated in photosyn-

thesis or electron transport. These GOS-only clusters couldbe of viral origin, as cyanophage genomes contain andexpress some photosynthetic genes that appear to be derivedfrom their hosts [44,50,51]. In support of these observations,we identified five photosynthesis-related clusters containinghundreds to thousands of viral sequences, including psbA,psbD, petE, SpeD, and hli in the GOS data; furthermore, ournearest-neighbor analysis of these sequences reveals thepresence of multiple viral proteins (unpublished data).Although the majority of GOS-only sequences are bacterial,

a higher than expected proportion of the GOS-only clustersare predicted to be of viral origin, implying that viralsequences and families are poorly explored relative to othermicrobes. To assign a kingdom to the GOS-only clusters, wefirst inferred the kingdom of neighboring sequences based onthe taxonomy of the top four BLAST matches to the NCBI-nrdatabase (see Materials and Methods). A possible kingdom wasassigned to the GOS-only cluster if more than 50% ofassignable neighboring sequences belong to the same king-dom. Viewed in this way, 11.8% of Group I clusters and17.3% of Group II clusters with at least one kingdom-assignedneighbor have more than 50% viral neighbors (Figure 4).Only 3.3% and 3.4% of random samples of clusters with sizedistributions matching that of Group I and Group II clustershave more than 50% viral neighbors, while 7.7% of allclusters pass this criterion. A total of 547 GOS-only clusterscontain sequences collected from the viral size fractionincluded in the GOS dataset. For these clusters, 38.9% of theGroup I subset and 27.5% of the Group II subset with one ormore kingdom-assigned neighbors would be inferred as viral,based on the conservative criteria of having more than 50%viral assignable neighbors. Several alternative kingdomassignment methods were tried (see Materials and Methods)and provide for a similar conclusion.The GOS-only clusters also tend to be more AT-rich than

sequences from a random size-matched sample of clusters(35.9% 6 8% GC content for Group II clusters versus 49.5%6 11% GC content for sample). Phage genomes with aProchlorococcus host [44] are also AT rich (37% average GCcontent). Our analysis of the graph constructed based oninferred operon linkages between all clusters indicates thatthe GOS-only clusters may constitute large sets of cotran-scribed genes (see Materials and Methods).The high proportion of potentially viral novel clusters

observed here is reasonable, as 60%–80% of the ORFs inmost finished marine phage genomes are not homologous toknown protein sequences [52]. Viral metagenomics projectshave reported an equally high fraction of novel ORFs [53],and a recent marine metagenomics project estimated that upto 21% of photic zone sequences could be of viral origin [51].It has also been reported that 40% of ORFans (sequences thatlack similarity to known proteins and predicted proteins)exist in close spatial proximity to each other in bacterialgenomes, and this combined with proximity to integrationsignals has been used to suggest a viral horizontally trans-ferred origin for many bacterial ORFans [54]. Others havenoted a clustering of ORFans in genome islands andsuggested they derive from a phage-related gene pool [55].A recent analysis of genome islands from related Prochlor-ococcus found that phage-like genes and novel genes cohabitthese dynamic areas of the genome [56]. In our GOS-onlyclusters, 37 of the 1,700 clusters with no detectable similarity

Figure 3. Venn Diagram Showing Breakdown of the 17,067 Medium and

Large Clusters by Three Categories—GOS, Known Prokaryotic, and

Known Nonprokaryotic

doi:10.1371/journal.pbio.0050016.g003

PLoS Biology | www.plosbiology.org | S62 Special Section from March 2007 | Volume 5 | Issue 3 | e160438

Expanding the Protein Family Universe

Page 69: Plos Biology Venter Collection Low

Ta

ble

5.

Ne

igh

bo

r-B

ase

dIn

fere

nce

of

Fun

ctio

nfo

rN

ove

lC

lust

ers

of

GO

SSe

qu

en

ces

No

ve

l

Clu

ste

r

ID

Infe

rre

dF

un

ctio

np

-Va

lue

aN

eig

hb

ori

ng

Clu

ste

rsw

ith

Co

ntr

ibu

tin

gG

OA

nn

ota

tio

n

Oth

er

Ne

igh

bo

rs

of

Inte

rest

bC

om

me

nts

GO

IDB

iolo

gic

al

pro

cess

Clu

ste

rID

GO

An

no

tati

on

88

37

GO

:00

06

26

0D

NA

rep

licat

ion

4.7

03

10�

48

12

AT

Pas

ein

volv

ed

inD

NA

rep

licat

ion

Ph

age

Mu

Mo

mD

NA

mo

dif

icat

ion

en

zym

e

Pro

file

–p

rofi

lem

atch

:D

NA

po

lym

era

se

pro

cess

ivit

yfa

cto

r

2,6

55

DN

Ap

oly

me

rase

fam

ilyB

DN

Am

eth

ylas

e

12

51

9G

O:0

00

61

18

Ele

ctro

ntr

ansp

ort

4.5

43

10�

31

,36

2C

yto

chro

me

co

xid

ase

sub

un

itIII

Pro

file

–p

rofi

lem

atch

:P

F03

62

6—

cyto

chro

me

co

xid

ase

sub

un

itIV

;3

pre

dic

ted

tran

sme

mb

ran

eh

elic

es

1,7

71

SCO

1/S

en

C—

bio

ge

ne

sis

of

ph

oto

syn

the

tic

syst

em

s

11

01

01

51

GO

:00

17

00

4C

yto

chro

me

com

ple

x

asse

mb

ly

�1

.00

31

0�

58

,13

6T

hio

red

oxi

n.

20

div

ers

ep

rofi

le–

pro

file

mat

che

s,o

ne

of

wh

ich

is

cyto

chro

me

cb

iog

en

esi

sfa

cto

rcc

mH

_2

9,3

64

Cyt

och

rom

ec

bio

ge

ne

sis

pro

tein

1,3

17

Cyt

och

rom

ec

asse

mb

lyp

rote

in

18

45

6G

O:0

00

92

52

Pe

pti

do

gly

can

bio

syn

the

sis

�1

.00

31

0�

51

,25

2FA

Db

ind

ing

do

mai

nEx

trac

yto

pla

smic

fun

ctio

n

(EC

F)si

gm

afa

cto

r2

4

On

ep

red

icte

dT

Mh

elix

10

,76

4U

DP

-N-a

cety

len

olp

yru

voyl

glu

cos-

amin

ere

du

ctas

e

Vir

alR

NA

he

licas

e

14

21

9G

O:0

00

96

28

Re

spo

nse

toab

ioti

cst

imu

lus

3.1

03

10�

45

,93

6C

olic

inV

pro

du

ctio

np

rote

inP

red

icte

dso

lub

le;

exc

lusi

vely

ne

igh

bo

rsto

just

two

clu

ste

rs.

Mat

Em

ult

idru

ge

fflu

xp

um

p

11

48

0G

O:0

01

50

31

Pro

tein

tran

spo

rt3

.00

31

0�

44

,17

7M

otA

/To

lQ/E

xbB

pro

ton

chan

ne

lfa

mily

Fou

rp

red

icte

dT

Mh

elic

es;

To

lp

rote

ins

faci

litat

e

tran

spo

rto

f

colic

ins,

iro

n,

and

ph

age

DN

A

9,5

69

Bio

po

lym

er

tran

spo

rtp

rote

in

Exb

D/T

olR

14

36

0G

O:0

00

67

77

Mo

-mo

lyb

do

pte

rin

cofa

cto

r

bio

syn

the

sis

�1

.00

31

0�

59

,74

5M

oaC

fam

ilySu

lfit

eo

xid

ase

SAR

11

bla

stm

atch

ann

ota

ted

asp

rob

able

mo

aD;

pro

file

–p

rofi

lem

atch

es

toT

hiS

and

mo

lyb

do

pte

rin

con

vert

ing

fact

or;

,.0

5%

of

seq

ue

nce

sh

ave

PFA

M

mat

chto

Th

iSfa

mily

9,9

48

Mo

aEp

rote

inP

red

icte

dth

ioe

ste

rase

25

5R

adic

alSA

Msu

pe

rfam

ily

83

97

GO

:00

17

00

4C

yto

chro

me

com

ple

x

asse

mb

ly

�1

.00

31

0�

58

,13

6T

hio

red

oxi

nSM

Csu

pe

rfam

ily(h

om

olo

go

us

to

AB

Cfa

mily

)

Bla

stm

atch

to‘‘

pe

rip

lasm

ico

rin

ne

rm

em

bra

ne

asso

ciat

ed

pro

tein’’

;tw

op

red

icte

dT

Mh

elic

es;

0.7

%o

fse

qu

en

ces

hav

eP

FAM

mat

chto

cyto

chro

me

cb

iog

en

esi

sp

rote

in

9,3

64

Un

char

acte

rize

dcy

toch

rom

ec

bio

ge

ne

sis

pro

tein

13

90

9G

O:0

01

59

79

Ph

oto

syn

the

sis

�1

.00

31

0�

51

3,9

90

Ph

oto

syst

em

IIre

acti

on

cen

tre

Np

rote

in(p

sbN

)

Pre

dic

ted

solu

ble

;si

ng

leb

last

mat

chto

cyan

op

hag

e

P-S

SM2

hyp

oth

eti

cal

pro

tein

;m

any

ph

age

pro

tein

s

asm

ino

rn

eig

hb

ors

5,1

84

Ph

oto

syn

the

tic

reac

tio

nce

ntr

e

pro

tein

D1

(psb

A)

7,6

64

Ferr

ed

oxi

n-d

ep

en

de

nt

bili

nre

du

ctas

e

ap

-Val

ue

sw

ere

com

pu

ted

by

sim

ula

tin

g1

00

,00

0n

eig

hb

or

clu

ste

rse

tso

fe

qu

ival

en

tsi

ze.

bN

ot

all

clu

ste

rsco

uld

be

map

pe

dto

aG

Ote

rm.

do

i:10

.13

71

/jo

urn

al.p

bio

.00

50

01

6.t

00

5

PLoS Biology | www.plosbiology.org | S63 Special Section from March 2007 | Volume 5 | Issue 3 | e160439

Expanding the Protein Family Universe

Page 70: Plos Biology Venter Collection Low

(2.2%) have at least ten bacterial-classified and ten viral-classified neighboring ORFs. This is 6.2-fold higher than therate seen for the size-matched sample of all clusters (sixclusters, 0.35%). This would seem to add more support to aphage origin for at least some ORFans found in bacterialgenomes.

If a sizable portion of the novel families in the GOS dataare in fact of viral origin, it suggests that we are far from fullyexploring the molecular diversity of viruses, a conclusionechoed in previous studies of viral metagenomes [53,57,58]. Instudies of bacterial genomes, discovery of new ORFans showsno sign of reaching saturation [59]. Coverage of many phagefamilies in the GOS data may be low, given that there areinherent differences in the abundance of their presumedbacterial hosts. These GOS-only clusters were operationallydefined as having at least 20 nonredundant sequences.Reducing this threshold to ten nonredundant sequences adds7,241 additional clusters. Whether this vast diversity repre-sents new families or is a reflection of the inability to detectdistant homology will require structural and biochemicalstudies, as well as continued development of computationalmethods to identify remotely related sequences.

Comparison of Domain Profiles in GOS and PG DatasetsWe used HMM profiling to address the question of which

biochemical and biological functions are expanded orcontracted in GOS compared to the largely terrestrialgenomes in PG. Significant differences are seen in 68% ofdomains (4,722 out of the 6,975 domains that match eitherGOS or PG; p-value ,0.001, chi-square test). These differ-

ences reflect several factors, including differing biochemicalneeds of oceanic life and taxonomic biases in the twodatasets. An initial comparison of these domain profileshelps shed light on these factors. 91% (964/1,056) of GOS-onlydomains are viral and/or eukaryotic specific (by Pfamannotation). Most of the remaining 92 domains are rare (63domains have less than ten copies in GOS), are predom-inantly eukaryotic/viral, or are specific to narrow bacterialtaxa without completed genome sequences. Most of the 879PG-only domains are also rare (444 have ten or less members),and/or are restricted to tight lineages, such as Mycoplasma (104matches to five domains) or largely extremeophile archaeal-specific domains (1,254 matches to 99 domains). Highly PG-enriched domains also tend to belong in these categories.Many moderately skewed domains reflect the taxonomic skewbetween PG and GOS. For instance, we found that a set of sixsarcosine oxidase-related domains are 4.8-fold enriched inGOS (Table 6). They are mostly found in a- and c-proteobacteria, which are widespread in GOS. Normalizingto the taxonomic class level predicts a 1.8-fold enrichment inGOS, indicating that taxonomy alone cannot fully explain theprevalence of these proteins in oceanic bacteria.

Figure 4. Enrichment in the GOS-Only Set of Clusters for Viral Neighbors

Cluster sets from left to right are: I, GOS-only clusters with detectableBLAST, HMM, or profile-profile homology (Group I); II, GOS-only clusterswith no detectable homology (Group II); I-S, a sample from all clusterschosen to have the same size distribution as Group I; II-S, a sample fromall clusters chosen to have the same size distribution as Group II; I-V, asubset of clusters in Group I containing sequences collected from theviral size fraction; II-V, a subset of clusters in Group II from the viral sizefraction; and all clusters. Notice that although predominantly bacterial,GOS-only clusters are assigned as viral based on their neighbors moreoften than the size-matched samples and the set of all clusters.doi:10.1371/journal.pbio.0050016.g004

Table 6. Functions Skewed in Domain Representation betweenPG and GOS

Process Number

of HMMs

Number

in PG

Number

in GOS

GOS

Enrichment

Sarcosine oxidase 6 686 19,295 4.766

Oxidative stress 5 524 9,804 3.170

Ubiquinone synthesis 4 245 4,035 2.790

RecA 1 215 3,728 2.938

Topoisomerase IV 4 2,163 33,472 2.622

Photosynthesis 41 919 13,889 2.561

DNA polymerase 20 3,682 51,224 2.357

tRNA synthetases 11 5,499 71,294 2.197

Transketolase 4 2,127 26,440 2.106

DNA gyrase 7 4,146 49,677 2.030

TCA cycle 30 12,057 135,294 1.901

Shikimate metabolism 8 2,393 24,316 1.722

DnaJ 3 1,103 12,389 1.891

Universal ribosomal

components (found in

all three kingdoms)

39 8,555 80,321 1.591

UVR exonuclease operon 6 4,108 38,223 1.577

ABC transporter 39 193,689 727,314 0.636

Flagellum 38 3,771 12,988 0.584

Sugar transport 7 3,601 4,453 0.210

Transposase 13 4,354 4,365 0.170

Che operon (chemotaxis) 7 1,142 1,119 0.166

Ethanolamine 9 231 218 0.160

Hydrogenases 16 1,179 1,061 0.152

Pilus 14 700 623 0.151

PTS phosphotransferase

system

32 11,439 6,661 0.099

Gas vesicle 6 49 19 0.066

Grþ nonspore 22 1,063 52 0.008

Grþ spores 15 503 0 0.000

Functionally related families of domains were grouped by GO terms or by inspection tosum up total domain counts in GOS and PG. There were 8,935,364 domain matches in theGOS data (corresponding to 3,701,388 sequences) and 1,513,880 domain matches in thePG data (corresponding to 448,159 sequences). The GOS enrichment ratio is computedfrom columns three and four, and then normalized to account for the 5.9 times thenumber of domain matches in GOS compared to PG.doi:10.1371/journal.pbio.0050016.t006

PLoS Biology | www.plosbiology.org | S64 Special Section from March 2007 | Volume 5 | Issue 3 | e160440

Expanding the Protein Family Universe

Page 71: Plos Biology Venter Collection Low

Mysterious Lack of Characteristic Gram-Positive DomainsGram-positive bacteria (Firmicutes and Actinobacteria) repre-

sent 26.7% of PG and ;12% of GOS [30]. Given the largersize of the GOS dataset, one might predict Gram-positive–specific domains to be ;2.4-fold enriched in GOS. Instead,the opposite is consistently seen. Of 15 firmicute-specificspore-associated domains, PG has 503 members, but GOS hasnone. For another 22 firmicute-restricted domains of varyingor unknown function, the PG/GOS ratio is 1797:77 (Table 6).Hence, it appears that GOS Gram-positive lineages lack mostof their characteristic protein domains. Two sequencedmarine Gram-positives (Oceanobacillus iheyensis [60] and Bacillussp. NRRL B-14911) have a large complement of thesedomains. However, another recently assembled genome fromSargasso sea surface waters, the actinomycete Janibacter sp.HTCC2649, has just two of these domains, and may reveal awhole-genome context for this curious loss of characteristicdomains.

Flagellae and Pili Are Selectively Lost from OceanicSpecies

Flagellum components from both eubacteria and archaeaare significantly underrepresented in the GOS dataset byabout 2-fold (Table 6). Ironically, at a bacterial scale,swimming may be worthwhile on an almost dry surface, butnot in open water. The chemotaxis (che) operon that oftendirects flagellar activity is also rare in GOS. Another direc-tional appendage, the pilus, is even more reduced, though itstaxonomic distribution (mostly in proteobacteria, predom-inantly c-proteobacteria) would have predicted enrichment.

Skew in Core Cellular PathwaysWhile taxonomically specialized domains are likely to be

skewed by taxonomic differences, core pathways found inmany or all organisms paint a different picture. We used GOterm mapping and text mining to group domains into majorfunctions and to look for consistent skews across severaldomains. Several core functions, including DNA-associatedproteins (DNA polymerase, gyrase, topoisomerase), ribosomal

subunits shared by all three kingdoms, marker proteins suchas recA and dnaJ, and TCA cycle enzymes all tend to be GOSenriched. This suggests that oceanic genomes may be morecompact than sequenced genomes and so have a higherproportion of core pathways.

Characteristics and Kingdom Distribution of KnownProtein DomainsA decade ago, databases were highly biased towards

proteins of known function. Today, whole-genome sequenc-ing and structural genomics efforts have presumably reducedthe biases that are a result of targeted protein sequencing. Weused the Pfam database to compare the characteristics andkingdom distribution of known protein domains in the GOSdataset to that of proteins in the publicly available datasets(NCBI-nr, PG, TGI-EST, and ENS). Such an effort can be usedto assess biases in these datasets, help direct future samplingefforts (of underrepresented organisms, proteins, and proteinfamilies), make more informed generalizations about theprotein universe, and provide important context for deter-mination of protein evolutionary relationships (as biasedsampling could indicate expected but missing sequences).For this analysis we used the nonredundant datasets (at

100% identity) discussed in Figure 1. We refer to the set of3,167,979 nonredundant sequences from NCBI-nr, PG, TGI-EST, and ENS as the public-100 set and the similarly filtered setof 5,654,638 sequences from the GOS data as the GOS-100 set.About 70% of public-100 sequences and 56% of GOS-100

sequences significantly match at least one Pfam model. Themost obvious difference between the sets is that the vastmajority of GOS sequences are bacterial, and this has to betaken into account when comparing the numbers. Sincedifferent Pfam families appear with different frequencies inthe kingdoms, we considered the results for each kingdomseparately (Figure 5). We then evaluated all kingdomstogether, with results normalized by relative abundance ofmembers from the different kingdoms. A domain foundcommonly and exclusively in eukaryotes and abundant in

Figure 5. Coverage of GOS-100 and Public-100 by Pfam and Relative Sizes of Pfam Families by Kingdom, Sorted by Size

The public-100 sequences are annotated using the NCBI taxonomy and the source public database annotations. GOS-100 sequences were givenkingdom weights as described in Materials and Methods. For each kingdom, the fraction of sequences with �1 Pfam match are shown, while the tenlargest Pfam families shown as discrete sections whose size is proportional to the number of matches between that family and GOS-100 or public-100sequences. Pfam families that are smaller than the ten largest are binned together in each column’s bottom section. Pfam covers public-100 better thanGOS-100 in all kingdoms, with the greatest difference occurring in the viral kingdom, where 89.1% of public-100 viral sequences match a Pfam domain,while only 27.5% of GOS-100s have a sequence match.doi:10.1371/journal.pbio.0050016.g005

PLoS Biology | www.plosbiology.org | S65 Special Section from March 2007 | Volume 5 | Issue 3 | e160441

Expanding the Protein Family Universe

Page 72: Plos Biology Venter Collection Low

public-100 would be expected to be found rarely in GOS-100.We used a conservative BLAST-based kingdom assignmentmethod to assign kingdoms to the GOS sequences (seeMaterials and Methods).

In each kingdom, sequences in GOS-100 are less likely tomatch a Pfam family than those in public-100 (Figure 5). Forthe cellular kingdoms, these differences are comparativelymodest. While diversity of the GOS data accounts for some ofthis difference, it might also be explained in part by thefragmentary nature of the GOS sequences. Viruses tell adramatic and different story. Of public-100 viral sequences,89.1% match a Pfam domain, while only 27.5% of GOS-100viral sequences have a match. This tremendous differenceappears to be due to heavy enrichment of the public data forminor variants of a few protein families, indicated by the sizesof the ten most populous Pfams in each kingdom (Figure 5).Sequences from three Pfam families (envelope glycoproteinGP120, reverse transcriptase, and retroviral aspartyl pro-tease) account for a third of all public viral sequences. Bycontrast, the most populous three families in the GOS-100data (bacteriophage T4-like capsid assembly protein [Gp20],major capsid protein Gp23, and phage tail sheath protein)account for only about 7% of public-100 sequences. Such adifference may be due to intentional oversampling ofproteins that come from disease-causing organisms in thepublic dataset.

While the total proportion of proteins with a Pfam hit isfairly similar between public-100 (70%) and GOS-100 (56%)datasets, there are considerable differences with regard to thedistributions of protein families within these two datasets.The most highly represented Pfam families in GOS-100compared to public-100 are shown in Table 7. Notably, wefound that while many known viral families are absent inGOS-100, viral protein families dominate the list of thefamilies more highly represented in GOS-100; this ispresumably because of biases in the collection of previouslyknown viral sequences. Surprisingly few bacterial familieswere among the most represented in GOS-100 compared withpublic-100. By contrast, we also observed that those familiesfound more rarely in GOS-100 than public-100 werefrequently bacterial (Table 7). This appears to be a result ofthe large number of key bacterial and viral pathogen proteinsin public-100 that are comparatively less abundant in theoceanic samples and/or less intensively sampled.

GOS-100 Data Suggest That a Number of ‘‘Kingdom-Specific’’ Pfams Actually Are Represented in MultipleKingdoms

Of the 7,868 Pfam models in Pfam 17.0, 4,050 matchproteins from only a single kingdom in public-100. Theadditional sequences from GOS-100 reveal that some of thesefamilies actually have representatives in multiple kingdoms.Table 8 shows 12 families that have a Pfam match to at leastone GOS-100 protein with an E-value � 1 3 10�10, and whichwe confidently assigned to a kingdom different from that ofall the public-100 matches. Because our criteria for a‘‘confident’’ kingdom assignment are conservative, there areonly one or a few confident assignments for each Pfamdomain to a ‘‘new’’ kingdom. Our ‘‘confident’’ criteria areespecially difficult to meet in the case of kingdom-crossing,due to the votes contributed by the crossing protein (seeMaterials and Methods). Thus, many scaffolds have no

confident kingdom assignment. Our examination of each ofthe scaffolds responsible for a determination of kingdom-crossing confirms that each one had both a highly significantmatch to the Pfam model in question and an overwhelmingnumber of votes for the unexpected kingdom. These scaffoldassemblies were also manually inspected. No clear anomalieswere observed. In most instances, the assemblies in questionwere composed of a single unitig, and as such are high-confidence assemblies. Mate pair coverage and consistentdepth of coverage provide further support for the correct-ness of those assemblies that are built from multiple unitigs.Examples of kingdom-crossing families include indoleamine2,3-dioxygenase (IDO), MAM domain, and MYND finger [15],which have previously only been seen in eukaryotes, but wefind them also to be present in bacteria. These Pfams nowcross kingdoms, due either to their being more ancient thanpreviously realized or to lateral transfer.We explored the IDO family further. This family has

representatives in vertebrates, invertebrates, and multiplefungal lineages [15,61] in public-100. Members of the IDOfamily are heme-binding, and mammalian IDOs catalyze therate-limiting step in the catabolic breakdown of tryptophan[62], while family members in mollusks have a myoglobinfunction [63]. In mammals, IDO also appears to have a role inthe immune system [62,64–66]. The IDO Pfam has matches to66 proteins in public-100, all of which are eukaryotic.However, it also has matches to ten GOS-100 sequences thatwe confidently labeled as bacterial proteins and matches to206 GOS-100 sequences for which a confident kingdomassignment could not be made (many of these are likelybacterial sequences due to the GOS sampling bias). Toreconstruct a phylogeny of the IDO family, we searched arecent version of NCBI-nr (March 5, 2006) for IDO proteinsthat were not included in the public-100 dataset. The searchidentified two bacterial proteins from the whole genomes ofthe marine bacteria Erythrobacter litoralis and Nitrosococcusoceani, and 24 eukaryotic proteins (see Materials and Methods).The phylogeny shown in Figure 6 shows 54% bootstrapsupport for a separation of the clade containing exclusivelypublic-100 and NCBI-nr 2006 eukaryotic sequences from aclade with the GOS-100 sequences as well as the two NCBI-nrE. litoralis andN. oceani sequences. We confirmed this feature ofthe tree topology with multiple other phylogeny reconstruc-tion methods. Curiously, there is considerable intermixing ofbacterial and eukaryotic sequences in the clade of GOS-100sequences and the two NCBI-nr bacteria. A manual inspectionof the scaffolds that contain the ten GOS-100 sequences(containing the IDO domain) that we confidently labeled asbacterial, overwhelmingly supports the kingdom assignment.However, a manual inspection of the scaffolds that contain theten GOS-100 sequences (containing the IDO domain) that weconfidently labeled as eukaryotes presents a less convincingpicture. These scaffolds are short, with most of themcontaining only two voting ORFs. Since the NCBI-nr versionused in the public-100 set has IDO from eukaryotes only, theORF with the IDO domain itself would cast four votes foreukaryotes. Thus, these GOS-100 eukaryotic labelings are notnearly as confident as the ones labeled bacterial.

Structural Genomics ImplicationsKnowledge about global protein distributions can be used

to inform priorities in related fields such as structural

PLoS Biology | www.plosbiology.org | S66 Special Section from March 2007 | Volume 5 | Issue 3 | e160442

Expanding the Protein Family Universe

Page 73: Plos Biology Venter Collection Low

Ta

ble

7.

To

pP

fam

Fam

ilie

sR

ep

rese

nte

dM

ore

Hig

hly

or

Less

Hig

hly

inG

OS-

10

0th

anin

Pu

blic

-10

0

Ca

teg

ory

Acc

ess

ion

Nu

mb

er

De

scri

pti

on

Pu

bli

c-1

00

Hit

sG

OS

-10

0H

its

Arc

ha

ea

Ba

cte

ria

Eu

ka

ryo

taV

iru

ses

Un

kn

ow

nT

ota

lE

xp

ect

ed

Ba

sed

on

Pu

bli

c-1

00

Ob

serv

ed

Ob

serv

ed

/

Ex

pe

cte

d

Ch

iS

qu

are

Fam

ilie

sre

pre

sen

ted

mo

reh

igh

ly

PF0

70

68

Maj

or

cap

sid

pro

tein

Gp

23

00

04

10

41

81

,81

82

3,4

50

%,

13

10�

30

3

PF0

34

20

Pro

he

adco

rep

rote

inp

rote

ase

00

01

10

11

61

,22

32

2,1

76

%,

13

10�

30

3

PF0

68

41

T4

-lik

evi

rus

tail

tub

ep

rote

ing

p1

90

00

13

01

36

79

51

4,0

36

%,

13

10�

30

3

PF0

44

51

Irid

ovi

rus

maj

or

cap

sid

pro

tein

00

11

38

01

39

15

1,6

92

11

,26

9%

,1

31

0�

30

3

PF0

72

30

Bac

teri

op

hag

eT

4-l

ike

cap

sid

asse

mb

lyp

rote

in(G

p2

0)

0

00

21

10

21

12

01

,63

37

,99

2%

,1

31

0�

30

3

PF0

18

18

Bac

teri

op

hag

etr

ansl

atio

nal

reg

ula

tor

00

01

00

10

54

05

7,4

44

%,

13

10�

30

3

PF0

12

31

Ind

ole

amin

e2

,3-d

ioxy

ge

nas

e0

06

60

06

67

22

63

,47

1%

,1

31

0�

30

3

PF0

33

22

Gam

ma-

bu

tyro

be

tain

eh

ydro

xyla

se0

13

11

70

01

30

60

1,8

07

3,0

04

%,

13

10�

30

3

PF0

47

77

Erv1

/Alr

fam

ily0

01

77

10

01

87

10

30

92

,99

6%

,1

31

0�

30

3

PF0

53

67

Ph

age

en

do

nu

cle

ase

I0

20

10

01

21

32

90

2,1

52

%,

13

10�

30

3

PF0

48

32

SOU

Lh

em

e-b

ind

ing

pro

tein

38

17

30

11

85

43

71

41

,64

8%

,1

31

0�

30

3

PF0

31

59

XR

N59-

39

exo

nu

cle

ase

N-t

erm

inu

s0

02

14

20

21

61

11

70

1,5

84

%,

13

10�

30

3

PF0

62

13

Co

bal

amin

bio

syn

the

sis

pro

tein

Co

bT

03

30

00

33

13

72

,15

51

,56

9%

,1

31

0�

30

3

PF0

17

86

Alt

ern

ativ

eo

xid

ase

05

23

90

02

44

31

47

91

,52

7%

,1

31

0�

30

3

PF0

02

74

Fru

cto

se-b

isp

ho

sph

ate

ald

ola

secl

ass-

I0

28

93

20

09

60

14

32

,07

61

,45

3%

,1

31

0�

30

3

PF0

32

91

mR

NA

cap

pin

ge

nzy

me

00

14

93

30

18

21

11

57

1,3

95

%,

13

10�

30

3

PF0

47

24

Gly

cosy

ltra

nsf

era

sefa

mily

17

01

11

80

01

19

12

15

51

,29

6%

,1

31

0�

30

3

PF0

09

40

DN

A-d

ep

en

de

nt

RN

Ap

oly

me

rase

05

20

82

30

23

63

23

94

1,2

22

%,

13

10�

30

3

PF0

30

30

Ino

rgan

icHþ

pyr

op

ho

sph

atas

e1

18

33

82

00

47

63

55

4,2

13

1,1

87

%,

13

10�

30

3

PF0

27

47

Pro

life

rati

ng

cell

nu

cle

aran

tig

en

,

C-t

erm

inal

do

mai

n

19

01

75

90

20

32

12

43

1,1

53

%,

13

10�

30

3

Fam

ilie

sre

pre

sen

ted

less

hig

hly

PF0

16

17

Surf

ace

anti

ge

n0

99

10

00

99

13

,98

70

0%

,1

31

0�

30

3

PF0

05

16

Enve

lop

eg

lyco

pro

tein

GP

12

00

01

41

,11

51

14

1,1

27

3,0

71

00

%,

13

10�

30

3

PF0

00

77

Re

tro

vira

las

par

tyl

pro

teas

e0

01

53

26

,74

79

26

,90

92

,00

40

0%

,1

31

0�

30

3

PF0

46

50

YSI

RK

typ

esi

gn

alp

ep

tid

e0

46

90

03

47

21

,88

90

0%

,1

31

0�

30

3

PF0

35

07

Cag

Ae

xoto

xin

03

33

00

03

33

1,3

43

00

%4

31

0�

29

4

PF0

34

82

sic

pro

tein

02

85

00

02

85

1,1

50

00

%4

31

0�

25

2

PF0

13

08

Ch

lam

ydia

maj

or

ou

ter

me

mb

ran

ep

rote

in0

26

40

00

26

41

,06

60

0%

83

10�

23

4

PF0

27

07

Maj

or

ou

ter

she

ath

pro

tein

N-t

erm

inal

reg

ion

02

64

00

02

64

1,0

66

00

%8

31

0�

23

4

PF0

09

34

PE

fam

ily0

24

90

00

24

91

,00

50

0%

13

10�

22

0

PF0

08

20

Bo

rrel

ialip

op

rote

in0

22

30

00

22

39

01

00

%6

31

0�

19

8

PF0

27

22

Maj

or

ou

ter

she

ath

pro

tein

C-t

erm

inal

reg

ion

02

23

00

02

23

90

10

0%

63

10�

19

8

PF0

09

21

Bo

rrel

ialip

op

rote

in0

20

20

00

20

28

16

00

%1

31

0�

17

9

PF0

28

76

Stap

hyl

oco

ccal

/str

ep

toco

ccal

toxi

n,

be

ta-g

rasp

do

mai

n

01

97

31

22

03

79

70

0%

33

10�

17

5

PF0

18

56

Ou

ter

me

mb

ran

ep

rote

in0

17

60

00

17

67

12

00

%7

31

0�

15

7

PF0

11

23

Stap

hyl

oco

ccal

/str

ep

toco

ccal

toxi

n,

OB

-fo

ldd

om

ain

01

66

31

21

72

67

20

0%

43

10�

14

8

PF0

24

74

No

du

lati

on

pro

tein

A(N

od

A)

01

57

00

01

57

63

60

0%

33

10�

14

0

PF0

64

58

Mu

cBP

do

mai

n0

15

52

00

15

76

28

00

%2

31

0�

13

8

PF0

33

23

Ba

cillu

s/cl

ost

rid

ium

Ge

rAsp

ore

ge

rmin

atio

n

pro

tein

01

49

00

01

49

60

30

0%

33

10�

13

3

PF0

75

48

Ch

lam

ydia

po

lym

orp

hic

me

mb

ran

ep

rote

in

mid

dle

do

mai

n

01

46

00

01

46

59

10

0%

13

10�

13

0

PF0

22

55

PT

Ssy

ste

m,

lact

ose

/ce

llob

iose

-sp

eci

fic

IIAsu

bu

nit

01

41

00

01

41

57

10

0%

33

10�

12

6

Gre

enin

dic

ates

excl

usi

vely

bac

teri

alin

pu

blic

-100

;b

lue,

excl

usi

vely

euka

ryo

tic

inp

ub

lic-1

00;

red

,ex

clu

sive

lyvi

ral

inp

ub

lic-1

00.

Exp

ecte

dn

um

ber

of

mat

ches

inG

OS-

100

toea

chP

fam

mo

del

was

calc

ula

ted

asd

escr

ibed

inM

ater

ials

and

Met

ho

ds.

This

calc

ula

tio

nis

bas

edo

nth

en

um

ber

of

mat

ches

toea

chP

fam

inp

ub

lic-1

00

and

corr

ecte

dfo

rth

ed

iffer

ent

kin

gd

om

pro

po

rtio

ns

inG

OS-

100

and

pu

blic

-100

.Fo

rea

chP

fam

mo

del

,th

ep

erce

nta

ge

rep

rese

nta

tio

nra

tio

isth

en

um

ber

of

ob

serv

edG

OS-

100

mat

ches

toth

atP

fam

div

ided

by

the

nu

mb

erex

pec

ted

,an

dex

pre

ssed

asa

per

cen

tag

e.Th

eto

ph

alf

of

the

tab

lesh

ow

sth

eto

p2

0m

ost

hig

hly

rep

rese

nte

dp

rote

ins

that

hav

ere

pre

sen

tati

on

rati

os

.1

,00

0%an

dh

ave

chi-

squ

ared

p-v

alu

e,

13

10�

30

3.N

um

ber

so

fo

bse

rved

mat

ches

toth

ese

Pfa

ms

inp

ub

lic-1

00ar

eal

soin

dic

ated

acco

rdin

gto

kin

gd

om

.An

um

ber

of

Pfa

ms

hig

hly

rep

rese

nte

din

GO

S-10

0ap

pea

rto

occ

ur

excl

usi

vely

or

alm

ost

excl

usi

vely

ina

par

ticu

lar

kin

gd

om

inp

ub

lic-1

00.

For

exam

ple

,Pfa

ms

that

are

char

acte

rist

ical

lyvi

ral

inp

ub

lic-1

00(c

olo

red

inre

d)

do

min

ate

the

top

of

this

list,

and

anin

trig

uin

gp

rote

infa

mily

(IDO

)w

ith

akn

ow

nim

mu

ne

fun

ctio

nin

hig

her

euka

ryo

tes

(blu

e)al

soap

pea

rs.T

he

bo

tto

mh

alf

of

the

tab

lesh

ow

sth

e20

Pfa

md

om

ain

sn

ot

ob

serv

edin

GO

S-10

0w

ith

the

hig

hes

tex

pec

tati

on

bas

edo

np

ub

lic-1

00(o

req

uiv

alen

tly,

wit

hth

em

ost

sig

nifi

can

tch

i-sq

uar

edp

-val

ues

).Th

us,

ala

rge

nu

mb

ero

fke

yb

acte

rial

and

vira

lpat

ho

gen

pro

tein

sin

pu

blic

-100

are

no

to

bse

rved

inth

eo

cean

icsa

mp

les.

do

i:10

.13

71

/jo

urn

al.p

bio

.00

50

01

6.t

00

7

PLoS Biology | www.plosbiology.org | S67 Special Section from March 2007 | Volume 5 | Issue 3 | e160443

Expanding the Protein Family Universe

Page 74: Plos Biology Venter Collection Low

genomics. Structural genomics is an international effort todetermine the 3-D shapes of all important biological macro-molecules, with a primary focus on proteins [67–72]. Previousstudies have shown that an efficient strategy for covering theprotein structure universe is to choose protein targets forexperimental structure characterization from among thelargest families with unknown structure [73,74]. If thestructure of one family member is determined, it may beused to accurately infer the fold of other family members,even if the sequence similarity between family members is toolow to enable accurate structural modeling [75]. Therefore,large families are a focus of the production phase of theProtein Structure Initiative (PSI), the National Institutes ofHealth–funded structural genomics project that commencedin October 2005 [76].In March 2005, 2,729 (36%) of 7,677 Pfam families had at

least one member of known structure; these families could beused to infer folds for approximately 51% of all pre-GOSprokaryotic proteins (covering 44% of residues) [74]. ThePfam5000 strategy is to solve one structure from each of thelargest remaining families, until a total of 5,000 families haveat least onemember with known structure [73]. As this strategyis similar to that being used at PSI centers to choose targets,projections based on the Pfam5000 should reflect PSI results.Completion of the Pfam5000, a tractable goal within theproduction phase of PSI, would enable accurate fold assign-ment for approximately 65% of all pre-GOS prokaryoticproteins. In the GOS-100 dataset, we observed that 46% of theproteins might currently be assigned a fold based on Pfamfamilies of known structure (see Materials and Methods).Completion of the Pfam5000 would increase this coverage to55%.

Figure 6. Maximum Likelihood Phylogeny for the IDO Family

The phylogeny is based on an alignment of 93 sequences from GOS-100and 51 sequences from public-100 and NCBI-nr from March 2006 thatmatched the IDO Pfam model and satisfied multiple alignment qualitycriteria. The IDO family is eukaryotic specific in public-100. Thephylogeny shows a clade with all the GOS sequences, predicted to bebacterial (navy blue), eukaryotic (yellow), or unknown (gray), along withtwo sequences from the marine bacteria Erythrobacter litoralis andNitrosococcus oceani (lime green) submitted to the sequence databaseafter February 2005, and a public-only clade of only eukaryoticsequences (orange).doi:10.1371/journal.pbio.0050016.g006

Ta

ble

8.

Ne

wM

ult

i-K

ing

do

mP

fam

s

Kin

gd

om

Sp

eci

fici

ty

inP

ub

lic-

10

0

Da

tab

ase

Pfa

m

Acc

ess

ion

Nu

mb

er

Mo

de

lD

esc

rip

tio

nM

atc

he

sin

Pu

bli

c-1

00

/Ma

tch

es

inG

OS

-10

0B

est

Sco

refo

rM

atc

h

inN

ov

el

Kin

gd

om

Be

stE

-Va

lue

for

Ma

tch

inN

ov

el

Kin

gd

om

Arc

ha

ea

Ba

cte

ria

Eu

ka

ryo

taV

iru

ses

Un

kn

ow

nP

fam

TC

Eo

nly

PF0

12

31

Ind

ole

amin

e2

,3-d

ioxy

ge

nas

e0

/00

/10

66

/10

0/0

0/2

06

10

7.1

24

7.7

92

.00

31

0�

71

PF0

06

29

MA

Md

om

ain

0/0

0/6

71

2/1

0/0

0/1

65

17

.83

42

.32

7.0

03

10�

10

0

PF0

17

53

MY

ND

fin

ge

r0

/00

/17

98

/10

/00

/20

41

5.1

35

.88

6.1

03

10�

11

PF0

20

89

Pal

mit

oyl

pro

tein

thio

est

era

se0

/00

/11

97

/00

/00

/17

14

.35

7.9

14

.30

31

0�

16

PF0

50

19

Co

en

zym

eQ

(ub

iqu

ino

ne

)

bio

syn

the

sis

pro

tein

Co

q4

0/0

0/1

10

0/1

0/0

0/1

33

6.2

41

.89

3.5

03

10�

13

PF0

29

19

Euka

ryo

tic

DN

Ato

po

iso

me

rase

I,

DN

Ab

ind

ing

frag

me

nt

0/1

0/0

17

3/6

0/0

0/2

04

2.1

18

5.7

19

.70

31

0�

60

Bo

nly

PF0

69

45

Pro

tein

of

un

kno

wn

fun

ctio

n(D

UF1

28

9)

0/0

10

8/2

50

0/0

0/3

0/2

39

47

.86

0.4

65

.40

31

0�

17

PF0

42

34

Co

pp

er

resi

stan

cep

rote

inC

op

C0

/19

1/4

20

/00

/00

/80

38

.45

0.8

23

.50

31

0�

14

Ao

nly

PF0

49

67

HT

HD

NA

-bin

din

gd

om

ain

88

/30

/80

/00

/00

/92

40

.77

4.6

32

.50

31

0�

22

PF0

19

11

Rib

oso

mal

LXp

rote

in2

7/0

0/3

0/0

0/0

0/9

34

.24

0.6

33

.40

31

0�

12

PF0

18

89

Me

mb

ran

ep

rote

ino

fu

nkn

ow

n

fun

ctio

n(D

UF6

3)

21

/00

/20

/00

/00

/15

45

51

.44

1.4

03

10�

14

PF0

66

26

Pro

tein

of

un

kno

wn

fun

ctio

n(D

UF1

15

2)

8/0

0/1

0/0

0/0

0/2

78

25

0.9

72

.20

31

0�

72

Som

eP

fam

do

mai

ns

ob

serv

ed

exc

lusi

vely

ino

ne

kin

gd

om

inp

ub

lic-1

00

are

fou

nd

ina

dif

fere

nt

kin

gd

om

inG

OS-

10

0.T

he

nu

mb

er

of

seq

ue

nce

sin

the

pu

blic

dat

ase

tth

atm

atch

eac

hP

fam

mo

de

lis

liste

dab

ove

the

nu

mb

er

of

seq

ue

nce

sin

GO

Sw

ith

aco

nfi

de

nt

kin

gd

om

assi

gn

me

nt

and

ah

igh

lysi

gn

ific

ant

mat

chto

the

mo

de

l.T

he

TC

bit

sco

reis

pro

vid

ed

for

eac

hm

od

el,

tog

eth

er

wit

hth

eb

itsc

ore

and

E-va

lue

of

the

be

stm

atch

toth

em

od

el

inan

un

exp

ect

ed

kin

gd

om

.Fo

rth

isan

alys

is,

Pfa

mm

atch

es

are

filt

ere

dw

ith

anE-

valu

ecu

toff

of

13

10�

10.I

ne

very

case

,th

eb

itsc

ore

isat

leas

tfi

veb

its

gre

ate

rth

anth

eT

Cfo

rth

em

od

el,

be

cau

seo

fth

ela

rge

rsi

zeo

fth

eG

OS

dat

ase

tre

lati

veto

tho

seu

sed

for

cre

atin

gth

eT

Cth

resh

old

s.In

add

itio

nto

pas

sin

gth

e‘‘

con

fid

en

t’’

crit

eri

a(s

ee

Mat

eri

als

and

Me

tho

ds)

,th

eki

ng

do

mas

sig

nm

en

tsar

eal

lco

nfi

rme

db

yvi

sual

insp

ecti

on

of

the

BLA

STki

ng

do

mvo

ted

istr

ibu

tio

ns

for

the

resp

ect

ive

scaf

fold

s.d

oi:1

0.1

37

1/j

ou

rnal

.pb

io.0

05

00

16

.t0

08

PLoS Biology | www.plosbiology.org | S68 Special Section from March 2007 | Volume 5 | Issue 3 | e160444

Expanding the Protein Family Universe

Page 75: Plos Biology Venter Collection Low

The GOS sequences will affect Pfam in two ways: some willbe classified in existing protein families, thus increasing thesize of these families; others may eventually be classified intonew GOS-specific families. Both of these will alter the relativesizes of different families, and thus their prioritization forstructural genomics studies. We calculated the sizes for allPfam families based on the number of occurrences of eachfamily in the public-100 dataset. Proteins in GOS-100 werethen added and the family sizes were recalculated. A total of190 families that are not in the Pfam5000 based on public-100are moved into the Pfam5000 after addition of the GOS data.The 30 largest such families are shown in Table 9. As 20 of the30 families are annotated as domains of unknown function inPfam, structural characterization might be helpful in identi-fying their cellular or molecular functions. Reshuffling thePfam5000 to prioritize these 190 families would improvestructural coverage of GOS sequences after completion of thePfam5000 by almost 1% relative to the original Pfam5000(from 55.4% to 56.1%), with only a small decrease in coverageof public-100 sequences (from 67.7% to 67.5%).

The Pfam5000 would be further reprioritized by theclassification of clusters of GOS sequences into Pfam.Assuming each cluster of pooled GOS-100 and public-100sequences without a current Pfam match would be classifiedas a single Pfam family, 885 such families would replaceexisting families in the Pfam5000. These 885 clusters containa total of 383,019 proteins in GOS-100 and public-100. Thereprioritized Pfam5000 would also retain 1,183 families ofunknown structure from the current Pfam5000; these familiescomprise a total of 1,040,330 proteins in GOS-100 and public-100.

Known Protein Families and Increased Diversity Due toGOS Data

Several protein families serve as examples to furtherhighlight the diversity added by the GOS dataset. In thispaper, we examined UV irradiation DNA damage repairenzymes, phosphatases, proteases, and the metabolic enzymesglutamine synthetase and RuBisCO (Table 10). The RecAfamily (unpublished data) and the kinase family [77] have alsobeen explored in the context of the GOS data. There aremore than 5,000 RecA and RecA-like sequences in the GOSdataset (Table 10). An analysis of the RecA phylogenyincluding the GOS data reveals several completely new RecAsubfamilies. A detailed study of kinases in the GOS datasetdemonstrated the power of additional sequence diversity indefining and exploring protein families [77]. The discovery of16,248 GOS protein kinase–like enzymes enabled the defi-nition and analysis of 20 distinct kinase-like families. Thediverse sequences allowed the definition of key residues foreach family, revealing novel core motifs within the entiresuperfamily, and predicted structural adaptations in individ-ual families. This data enabled the fusion of choline andaminoglycoside kinases into a single family, whose sequencediversity is now seen to be at least as great as the eukaryoticprotein kinases themselves.

Proteins Involved in the Repair of UV-Induced DNADamage

Much of the attention in studies of the microbes in theworld’s oceans has justifiably focused on phototrophy, suchas that carried out by the proteorhodopsin proteins.

Previously, in the Sargasso Sea study [10] it was shown thatshotgun sequencing reveals a much greater diversity ofproteorhodopsin-like proteins than was previously knownfrom cloning and PCR studies. However, along with thepotential benefits of phototrophy come many risks, such asthe damage caused to cells by exposure to solar irradiation,especially the UV wavelengths. Organisms deal with thepotential damage from UV irradiation in several ways,including protection (e.g., UV absorption), tolerance, andrepair [78]. Our examination of the protein family clustersreveals that the GOS data provides an order of magnitudeincrease in the diversity (in both numbers and types) ofhomologs of proteins known to be involved in pathwaysspecifically for repairing UV damage.One aspect of the diversity of UV repair genes is seen in the

overrepresentation of photolyase homologs in the GOS data(see Table 10). Photolyases are enzymes that chemicallyreverse the UV-generated inappropriate covalent bonds incyclobutane pyrimidine dimers and 6–4 photoproducts [79].The massive numbers of homologs of these proteins in theGOS data (11,569 GOS proteins in four clusters; see Table 10)is likely a reflection of their presence in diverse species andthe existence of novel functions in this family. New repairfunctions could include repair of other forms of UV dimers(e.g., involving altered bases), use of novel wavelengths of lightto provide the energy for repair, repair of RNA, or repair indifferent sequence contexts. In addition, some of theseproteins may be involved in regulating circadian rhythms,as seen for photolyase homologs in various species. Ourfindings are consistent with the recent results of a compara-tive metagenomic survey of microbes from different depthsthat found an overabundance of photolyase-like proteins atthe surface [51].A good deal was known about the functions and diversity of

photolyases prior to this project. However, much less isknown about other UV damage–specific repair enzymes, andexamination of the GOS data reveals a remarkable diversityof each of these. For example, prior to this project, there wereonly some 25 homologs of UV dimer endonucleases (UVDEs)available [80], and most of these were from the Bacillusspecies. There are 420 homologs of UVDE (cluster 6239) inthe GOS data representing many new subfamilies (Figure 7Aand Materials and Methods). A similar pattern is seen forspore lyases (which repair a UV lesion specific to spores [81])and the pyrimidine dimer endonuclease (DenV, which wasoriginally identified in T4 phage [82]). We believe this will alsobe true for UV dimer glycosylases [83], but predictions offunction for homologs of these genes are difficult since theyare in a large superfamily of glycosylases.Our analysis of the kingdom classification assignments

suggests that the diversity of UV-specific repair pathways isseen for all types of organisms in the GOS samples. Thisapparently extends even to the viral world (e.g., 51 of theUVDE homologs are assigned putatively to viruses), suggest-ing that UV damage repair may be a critical function thatphages provide for themselves and their hosts in oceansurface environments. Based on the sheer numbers of genes,their sequence diversity, and the diversity of types oforganisms in which they are apparently found, we concludethat many novel UV damage–repair processes remain to bediscovered in organisms from the ocean surface water.

PLoS Biology | www.plosbiology.org | S69 Special Section from March 2007 | Volume 5 | Issue 3 | e160445

Expanding the Protein Family Universe

Page 76: Plos Biology Venter Collection Low

Evidence of Reversible Phosphorylation in the OceansReversible phosphorylation of proteins represents a major

mechanism for cellular processes, including signal trans-duction, development, and cell division [84]. The activity ofprotein kinases and phosphatases serve as antagonistic

regulators of the cellular response. Protein phosphatasesare divided into three major groups based on substratespecificity [85]. The Mg2þ- or Mn2þ-dependent phosphoserine/phosphothreonine protein phosphatase family, exemplifiedby the human protein phosphatase 2C (PP2C), represents the

Table 9. The 30 Largest Structural Genomics Target Families Added to the Pfam5000 Based on Inclusion of GOS Sequences

Accession Number Description Family Size after GOS Family Size before GOS

PF06213.2 Cobalamin biosynthesis protein CobT 2,188 33

PF04244.3 Deoxyribodipyrimidine photolyase-related protein 1,628 51

PF07021.1 Methionine biosynthesis protein MetW 1,305 50

PF03420.3 Prohead core protein protease 1,234 11

PF06347.2 Protein of unknown function (DUF1058) 1,114 40

PF06439.1 Domain of unknown function (DUF1080) 1,021 48

PF06253.1 Trimethylamine methyltransferase (MTTB) 942 38

PF06242.1 Protein of unknown function (DUF1013) 915 36

PF06841.2 T4-like virus tail tube protein gp19 808 13

PF05992.2 SbmA/BacA-like family 746 26

PF04018.3 Domain of unknown function (DUF368) 720 54

PF06230.1 Protein of unknown function (DUF1009) 703 38

PF07583.1 Protein of unknown function (DUF1549) 703 58

PF07864.1 Protein of unknown function (DUF1651) 539 20

PF06539.1 Protein of unknown function (DUF1112) 529 38

PF06684.1 Protein of unknown function (DUF1185) 519 32

PF07586.1 Protein of unknown function (DUF1552) 491 21

PF06844.1 Protein of unknown function (DUF1244) 470 32

PF06938.1 Protein of unknown function (DUF1285) 451 27

PF07075.1 Protein of unknown function (DUF1343) 441 49

PF07587.1 Protein of unknown function (DUF1553) 439 58

PF06041.1 Bacterial protein of unknown function (DUF924) 416 59

PF03209.5 PUCC protein 415 48

PF01996.6 Protein of unknown function (DUF129) 414 53

PF06146.1 Phosphate-starvation-inducible E 393 44

PF07627.1 Protein of unknown function (DUF1588) 372 31

PF05610.1 Protein of unknown function (DUF779) 356 30

PF06245.1 Protein of unknown function (DUF1015) 348 47

PF06175.1 tRNA-(MS[2]IO[6]A)-hydroxylase (MiaE) 342 46

PF01969.7 Protein of unknown function (DUF111) 337 60

The 30 largest families after inclusion of GOS data that were not among the 5000 largest families before inclusion of GOS data are shown here. Family size was calculated as the number ofmatches in public-100 (before GOS) and in the combined GOS-100 and public-100 datasets (after GOS).doi:10.1371/journal.pbio.0050016.t009

Table 10. Clustering of Sequences in Families That Are Explored in This and Companion Papers

Protein Family Cluster ID Nonredundant Sequences Total Sequences NCBI-nr PG TGI-EST ENS GOS

RecA 1146 2,897 7,423 1,683 235 288 104 5,113

UVDE 6239 417 484 38 25 1 0 420

Photolyase 411 1,387 2,261 19 9 0 0 2,233

1285 5,907 9,796 302 145 182 15 9,152

3077 319 482 149 2 176 42 113

3454 67 73 1 1 0 0 71

Spore lyase 5283 237 331 39 25 0 0 267

PP2C phosphatase 78 2,917 3,933 762 112 2,295 199 565

3673 62 106 39 0 22 45 0

9118 68 73 0 0 72 1 0

11012181 36 69 34 0 15 20 0

11021747 19 72 13 11 0 0 48

11066319 19 38 14 0 15 9 0

Glutamine synthetase (type I, II, III) 3709 4,284 11,322 1,504 320 489 48 8,961

3072 159 192 46 11 6 0 129

4547 30 32 1 0 1 0 30

RuBisCO (large subunit) 3734 1,979 14,149 13,532 41 148 0 428

doi:10.1371/journal.pbio.0050016.t010

PLoS Biology | www.plosbiology.org | S70 Special Section from March 2007 | Volume 5 | Issue 3 | e160446

Expanding the Protein Family Universe

Page 77: Plos Biology Venter Collection Low

smallest group in number. An understanding of theirphysiological roles has only recently begun to emerge. Ineukaryotes, one of the major roles of PP2C activity is toreverse stress-induced kinase cascades [86–89].

We identified 613 PP2C-like sequences in the GOS dataset,and they are grouped into two clusters (Table 10). These

sequences contain at least seven motifs known to beimportant for phosphatase structure and function [90,91].Invariant residues involved in metal binding (aspartate inmotifs I, II, VIII) and phosphate ion binding (arginine inmotif I) are highly conserved among the GOS sequences.Using the catalytic domain portion of these sequences we

Figure 7. Phylogenies Illustrating the Diversity Added by GOS Data to Known Families That We Examined

Kingdom assignments of the sequences are indicated by color: yellow, GOS-eukaryotic; navy blue, GOS-bacterial/archaeal; aqua, GOS-viral; orange,NCBI-nr–eukaryotic; lime green, NCBI-nr–bacterial/archaeal; pink, NCBI-nr–viral; gray, unclassified.(A) Phylogeny of UVDE homologs.(B) Phylogeny of PP2C-like sequences.(C) Phylogeny of type II GS gene family. In addition to the large amount of diversity of bacterial type II GS in the GOS data, a large group of GOS viralsequences and eukaryotic GS co-occur at the top of the tree with the eukaryotic virus Acanthamoeba polyphaga mimivirus (shown in pink). The red starsindicate the locations of eight type II GS sequences found in the type I–type II GS gene pairs. They are located in different branches of the phylogenetictree. The rest of the type II GS sequences were filtered out by the 98% identity cutoff.(D) Phylogeny of the homologs of RuBisCO large subunit. A large portion of the RuBisCO sequences from the GOS data forms new branches that aredistinct from the previously known RuBisCO sequences in the NCBI-nr database.doi:10.1371/journal.pbio.0050016.g007

PLoS Biology | www.plosbiology.org | S71 Special Section from March 2007 | Volume 5 | Issue 3 | e160447

Expanding the Protein Family Universe

Page 78: Plos Biology Venter Collection Low

constructed a phylogeny showing that despite the overallconserved structure of the PP2C family of proteins, theknown bacterial PP2C-like sequences group together with theGOS bacterial PP2C-like sequences (Figure 7B, Materials andMethods). Furthermore, the eukaryotic PP2Cs display a muchgreater degree of sequence divergence compared to thebacterial PP2C sequences.

We also examined the combined dataset of PP2C-likephosphatases further for potential differences in amino acidcomposition between the bacterial and eukaryotic groups.We observed a striking distinction between the eukaryoticand bacterial PP2C-like phosphatases in motif II, where ahistidine residue (His62 in human PP2Ca) is conserved inmore than 90% of sequences, but not observed in thebacterial group. The bacterial PP2C group contains amethionine (at the corresponding position) in the majorityof the cases (70%). This histidine residue is involved in theformation of a beta hairpin in the crystal structure of humanPP2C [91]. Furthermore, His62 is proposed to act as a generalacid for PP2C catalysis [92]. Both amino acids lie in theproximity of the phosphate-binding domain, but at this timeit is unclear how the difference at this position wouldcontribute to the overall structure and function of the twoPP2C groups. Nonetheless, the large number of diverse PP2C-like phosphatases in this dataset allowed us to identify apreviously unrecognized key difference between bacterialand eukaryotic PP2Cs.

Bacterial genes that perform closely related functions canbe organized in close proximity to each other and often infunctional units. Linked Ser/Thr kinase-phosphatase geneticunits have been described in several bacterial species,including Streptococcus pneumoniae, Bacillus subtilis, and Myco-bacterium tuberculosis [93–96]. Two major neighboring clustersare found to be associated with the set of PP2C-likephosphatases in the GOS bacterial group. We observed thatone of these clusters contained a protein serine/threoninekinase domain as its most common Pfam domain. Anadditional neighboring cluster found to be associated withthe GOS set of bacterial PP2Cs was identified as a set ofsequences containing a PASTA (penicillin-binding proteinand serine/threonine kinase–associated) domain. This domainis unique to bacterial species, and is believed to playimportant roles in regulating cell wall biosynthesis [97].

Our identification of a conserved group of unique PP2C-like phosphatases in the GOS dataset significantly increasesthe number and diversity of this enzyme family. This analysisof the NCBI-nr, PG ORFs, TGI-EST ORFs, and ENS datasetsalong with the sequences obtained from the GOS datasetsignificantly increases the overall number of PP2C-likesequences from that estimated just a year ago [98]. Thepresence of genes encoding bacterial serine/threonine kinasedomains located adjacent to PP2Cs in the GOS data supportsthe notion that the process of reversible phosphorylation onSer/Thr residues controls important physiological processesin bacteria.

Proteases in GOS DataProteases are a group of enzymes that degrades other

proteins and, as such, plays important roles in all organisms[99]. On the basis of their catalysis mechanism, proteases aredivided into six distinct catalytic types: aspartic, cysteine,metallo, serine, threonine, and glutamic proteases [99]. They

differ from each other by the presence of specific amino acidsin the active site and by their mode of action. The MEROPSdatabase [100] is a comprehensive source of information forthis large divergent group of sequences and provides a widelyaccepted classification of proteases into families, based on theamino acid sequence comparison, and then into clans basedon the similarity of their 3-D structures.We identified 222,738 potential proteases in the GOS

dataset based on similarity to sequences in MEROPS (seeMaterials and Methods). According to our clustering method,95% of these sequences are grouped into 190 clusters, witheach cluster on the average containing more than 1,100 GOSsequences. These sequences were compared to proteases inNCBI-nr. There are groups of proteases in NCBI-nr that arehighly redundant. For example, there are a large number ofviral proteases from HIV-1 and hepatitis C viruses thatdominate the NCBI-nr protease set. Thus, we computed anonredundant set of NCBI-nr proteases and, for the sake ofconsistency, a nonredundant set of proteases from the GOSset using the same parameters. The majority of proteases inboth sets are dominated by cysteine, metallo, and serineproteases. The GOS dataset is dominated by proteasesbelonging to the bacterial kingdom. That is not surprising,given the filter sizes used to collect the samples. In NCBI-nrthe proteases are more evenly distributed between thebacterial and the eukaryotic kingdoms.Our comparison of the protease clan distribution of the

bacterial sequences in the NCBI-nr and GOS sets reveals thatthe distribution of clans is very similar for metallo- and serineproteases. However, the distribution of clans in aspartic andcysteine proteases is different in the two datasets. Amongaspartic proteases, the most visible difference is the increasedratio of proteases of the AC clan and the decreased ratio inthe AD clan. Proteases in the former clan are involved inbacterial cell wall production, while those in the latter clanare involved in pilin maturation and toxin secretion [99].Among cysteine proteases, the most apparent is the decreasein the CA clan and an increase in the number of proteasesfrom the PB(C) clan. Bacterial members of the CA clan aremostly involved in degradation of bacterial cell wall compo-nents and in various aspects of biofilm formation [99]. It ispossible that both activities are less important for marinebacteria present in surface water. Proteases from the PB(C)clan are involved in activation (including self-activation) ofenzymes from acetyltransferase family. In fungi this family isinvolved in penicillin synthesis, while their function inbacteria is unknown [99].We were unable to detect any caspases (members of the CD

clan) in the GOS data. This is consistent with the apoptoticcell death mechanism being present only in multicellulareukaryotes, which, based on the filter sizes, are expected to bevery rare in the GOS dataset.

Metabolic Enzymes in the GOS DataTo gain insights into the diversity of metabolism of the

organisms in the sea, we studied the abundance and diversityof glutamine synthetase (GS) and ribulose 1,5-bisphosphatecarboxylase/oxygenase (RuBisCO), two key enzymes in nitro-gen and carbon metabolism.GS is the central player of nitrogen metabolism in all

organisms on earth. It is one of the oldest enzymes inevolution [101]. It converts ammonia and glutamate into

PLoS Biology | www.plosbiology.org | S72 Special Section from March 2007 | Volume 5 | Issue 3 | e160448

Expanding the Protein Family Universe

Page 79: Plos Biology Venter Collection Low

glutamine that can be utilized by cells. GS can be classifiedinto three types based on sequence [101]. Type I has beenfound only in bacteria, and it forms a dodecameric structure[102,103]. Type II has been found mainly in eukaryotes, and insome bacteria. Type III GS is less well studied, but has beenfound in some anaerobic bacteria and cyanobacteria. Thereare 18 active site residues in both bacterial and eukaryotic GSthat play important roles in binding substrates and catalyzingthe enzymatic reactions [104].

We found 9,120 GS and GS-like sequences in the GOS data(Table 10). Using profile HMMs [41,105] constructed fromknown GS sequences of different types, we were able toclassify 4,350 sequences as type I GS, 1,021 sequences as typeII GS, and 469 sequences as type III GS (see Materials andMethods).

The number of type II GS sequences found in the GOS datais surprisingly high, since previously type II GS wereconsidered to be mainly eukaryotic and very few eukaryoticorganisms were expected to be included in the GOSsequencing (Figure 7C and Materials and Methods). We usedgene neighbor analysis to classify the origin of GS genes bythe nature of other proteins found on the same scaffold.Using this approach, most of the neighboring genes of thetype II GS in the GOS data are identified as bacterial genes.The neighboring genes of the type II GS include nitrogenregulatory protein PII, signal transduction histidine kinase,NH3-dependent NADþ synthetase, A/G-specific adenine gly-cosylase, coenzyme PQQ synthesis protein c, pyridoxinebiosynthesis enzyme, aerobic-type carbon monoxide dehy-drogenase, etc. We were able to assign more than 90% of thetype II GS sequences in the GOS data to bacterial scaffoldsbased on a BLAST-based kingdom assignment method (seeMaterials and Methods). Both neighboring genes and king-dom assignments suggest that most of the type II GSsequences in the GOS data come from bacterial organisms.In comparison, the same type II GS profile HMM detects only12 putative type II GS sequences from the PG dataset of 222prokaryotic genomes. Within these, there are only sevenunique type II GS sequences and six unique bacterial speciesrepresented. The reason why bacteria in the ocean have somany type II GS genes is unclear.

Two hypotheses have been raised to explain the origin oftype II GS in bacterial genomes: lateral gene transfer fromeukaryotic organisms [106] and gene duplication prior to thedivergence of prokaryotes and eukaryotes [101]. The type IIGS sequences in the predominantly bacterial GOS data arenot only abundant, but also diverse and divergent from mostof known eukaryotic GS sequences (Figure 7C). This makesthe hypothesis of lateral gene transfer less favorable. If the GSgene duplication preceded the prokaryote–eukaryote diver-gence according to the gene duplication hypothesis, it ispossible that many oceanic organisms retained type II GSgenes during evolution.

Interestingly, we found 19 cases where a type I GS gene isadjacent to a type II GS gene on the same scaffold. Both GSgenes seem to be functional based on the high degree ofconservation of active site residues. The same gene arrange-ment was observed previously in Frankia alni CpI1 [107]. Thefunctional significance of maintaining two types of GS genesadjacent to one another in the genome remains to beelucidated. Most of the sequences of these GS genes arehighly similar. We examined the geographic distribution of

these adjacent GS sequences across all the GOS samples. Theyare mainly found in the samples taken from two sites. Theirgeographic distribution is significantly different from thedistributions of types I and II GS across the samples. The highsequence similarity among the adjacent GS pairs and theirgeographic distribution suggest that these adjacent GSsequences may come from only a few closely relatedorganisms. This is consistent with the protein sequence treeof type II GS, where the type II GS sequences from the GSgene pairs mainly reside in two distinct branches (Figure 7C).The active site residues are very well conserved in all GS

sequences in the GOS data, except one residue, Y179, whichcoordinates the ammonium-binding pocket. We observedsubstitutions of Y179 to phenylalanine in about half of thetype II GS sequences. The activity of type I GS in somebacteria is regulated by adenylylation at residue Tyr397. Inthe GOS data, Tyr397 is relatively conserved in type I GS, withvariations to phenylalanine and tryptophan in about half ofthe sequences. This indicates that the activity of some of thetype I GS is not regulated by adenylylation, as shownpreviously in some Gram-positive bacteria [108,109].RuBisCO is the key enzyme in carbon fixation. It is the

most abundant enzyme on earth [110] and plays an importantrole in carbon metabolism and CO2 cycle. RuBisCO can beclassified into four forms. Form I has been found in bothplants and bacteria, and has an octameric structure. Form IIhas been found in many bacteria, and it forms a dimer inRhodospirillum rubrum. Form III is mainly found in archaea,and forms various oligomers. Form IV, also called theRuBisCO-like protein (RLP), has been recently discoveredfrom bacterial genome-sequencing projects [111,112]. RLPrepresents a group of proteins that do not have RuBisCOactivity, but resemble RuBisCO in both sequence andstructure [111,113]. The functions of RLPs are largelyunknown and seem to differ from each other.Contrary to the large number of GS sequences, we

identified only 428 sequences homologous to the RuBisCOlarge subunit in the GOS data. The small number of RuBisCOsequences may partly be due to the fact that larger-sizedbacterial organisms were not included in the sequencingbecause of size filtering. However, it could also indicate thatCO2 is not the major carbon source for these sequencedocean organisms.The RuBisCO homologs in the GOS data are more diverse

than the currently known RuBisCOs (Figure 7D, Materialsand Methods). Six of 19 active site residues—N123, K177,D198, F199, H327, and G404—are not well conserved in allsequences, suggesting that the proteins with these mutationsmay have evolved to have new functions, such as in the case ofRLPs. From the studies of the RLPs from Chlorobium tepidumand B. subtilis [111,114], it has been shown that the active siteof RuBisCO can accommodate different substrates and ispotentially capable of evolving new catalytic functions[113,114]. On the other hand, two sequence motifs, helicesaB and a8, that are not involved in substrate binding andcatalytic activity are well conserved in the GOS RuBisCOsequences. The higher degree of conservation of thesenonactive site residues than that of active site residuessuggests that these motifs are important for their structure,function, or interaction with other proteins.We found 47 (31 at 90% identity filtering) GOS sequences

in the branch with known RLP sequences in a phylogenetic

PLoS Biology | www.plosbiology.org | S73 Special Section from March 2007 | Volume 5 | Issue 3 | e160449

Expanding the Protein Family Universe

Page 80: Plos Biology Venter Collection Low

tree of RuBisCO (Figure 7D). In this phylogenetic tree, inaddition to the clades for each of the four forms of RuBisCO,there are also new groups of 65 (58 at 90% identity filtering)GOS sequences that do not cluster with any known RuBisCOsequences. This indicates that there could be more than onetype of RuBisCO-like protein existing in organisms. Thenovel groups of RuBisCO homologs in the GOS data alsosuggest that we have not fully explored the entire RuBisCOfamily of proteins (Figure 7D).

GOS Data and Remote Homology DetectionThe addition of GOS sequences may help greatly in

defining the range and diversity of many known proteinfamilies, both by addition of many new sequences and by theincreased diversity of GOS sequences. Our comparison ofHMM scores for GOS sequences with those from the otherfour datasets shows that GOS sequences consistently tend tohave lower scores, which indicates additional diversity fromthat captured in the original HMM (Figure 8). The addition ofGOS data into domain profiles may broaden the profile andallow it to detect additional remote family members in bothGOS and other datasets. As a trial, we rebuilt the Pfam modelPF01396, which describes a zinc finger domain withinbacterial DNA topoisomerase. The original model finds 821matches to 481 proteins in NCBI-nr. Our model that includesGOS sequences reveals 1,497 matches to 722 sequences, anincrease of 50% in sequences and 82% in domains (mosttopoisomerases have three such domains, of which one isdivergent and difficult to detect). Of these new matches, 104are validated by the presence of additional topoisomerasedomains, or they are annotated as topoisomerase, while most

others are unannotated or similar to other DNA-modifyingenzymes not previously thought to have zinc finger domains.HMM profiles can be further exploited by using matches

beyond the conservative trusted cutoff (TC) used in this study.For instance, the Pfam for the poxvirus A22 protein familyhas no GOS matches above the TC, but 137 matches with E-values of 1 3 10�3 to 1 3 10�10, containing a short conservedmotif overlap with A22 proteins. Alignment of these matchesshows an additional two short motifs in common with A22,establishing their homology, and using a profile HMM, wefound a total of 269 family members in GOS and eight familymembers in NCBI-nr. Many members of this new family aresurrounded by other novel clusters, or are in putative viralscaffolds, suggesting that these weak matches are an entrypoint into a new clade of viruses.

ORFans with Matches in GOS DataFurther evidence of the diversity added by GOS sequences

is provided by their matches to ORFans. ORFans aresequences in current protein databases that do not haveany recognizable homologs [117]. ORFan sequences (dis-counting those that may be spurious gene predictions)represent genes with organism-specific functions or veryremote homologs of known families. They have the potentialto shed light on how new proteins emerge and how old onesdiversify.We identified 84,911 ORFans (5,538 archaea, 35,292

bacteria, 37,427 eukaryotic, 5,314 virus, and 1,340 unclassi-fied) from the NCBI-nr dataset using CD-HIT [116,117] andBLAST (see Materials and Methods). Of these, 6,044 havematches to GOS sequences using BLAST (E-value �1 3 10�6).Figure 9 shows the distribution of the matched ORFansgrouped by organisms, number of their GOS matches, andthe lowest E-value of the matches. We found matches to GOSsequences for 13%, 6.3%, 0.89%, and 8.9% of bacterial,archaeal, eukaryotic, and viral ORFans, respectively. Whilemost of these ORFans have very few GOS matches, 626 ofthem have �20 GOS matches. The similarities between GOSsequences and eukaryotic ORFans are much weaker thanthose between GOS sequences and noneukaryotic ORFans.The average sequence identity between eukaryotic ORFansand their closest GOS matches is 38%. This is 6% lower thanthe identity between noneukaryotic ORFans and their closestGOS matches.The ORFans that match GOS sequences are from approx-

imately 600 organisms. Table 11 lists the 20 most populatedorganisms. Out of the 6,044 matched ORFans, approximately2,000 are from these 20 organisms. For example, Rhodopirellulabaltica SH 1, a marine bacterium, has 7,325 proteins depositedin NCBI-nr. We identified 1,418 ORFans in this organism, ofwhich 322 have GOS matches. Another interesting example inthis list is Escherichia coli. Although there are .20 differentstrains sequenced, 168 ORFans are identified in strainCFT073, and 67 of them have GOS matches. The onlyeukaryotic organism in this list is Candida albicans SC5314, afungal human pathogen, which has 49 ORFans with GOSmatches.We examined a small but interesting subset of the ORFans

that have 3-D structures deposited in PDB. Out of 65 PDBORFans, GOS matches for eight of them are found (seeSupporting Information for their PDB identifiers and names).

Figure 8. Distribution of Average HMM Score Difference between GOS

and Public (NCBI-nr, MG, TGI-EST, and ENS)

Only matches to the full length of an HMM are considered, and onlyHMMs that have at least 100 matches to each of GOS and publicdatabases are considered. This results in 1,686 HMMs whose averagescores to GOS and public databases are considered. The mean of thedistribution is �50, showing that GOS sequences tend to score lowerthan sequences in public, thereby reflecting diversity compared tosequences in public.doi:10.1371/journal.pbio.0050016.g008

PLoS Biology | www.plosbiology.org | S74 Special Section from March 2007 | Volume 5 | Issue 3 | e160450

Expanding the Protein Family Universe

Page 81: Plos Biology Venter Collection Low

They include four restriction endonucleases, three hypo-thetical proteins, and a glucosyltransferase.

GOS sequences can play an important role in identifyingthe functions of existing ORFans or in confirming proteinpredictions. For example, we found that the hypotheticalprotein AF1548, which is a PDB ORFan, has matches to 16GOS sequences. A PSI-BLAST search with AF1548 as thequery against a combined set of GOS and NCBI-nr identifiedseveral significant restriction endonucleases after threeiterations. With the support of 3-D structure and multiplesequence alignment of AF1548 and its GOS matches, wepredict that AF1548 along with its GOS homologs arerestriction endonucleases (Figure 10). When combined withan established consensus of active sites of the related

endonucleases families [118], we predicted three catalyticresidues.

Genome Sequencing Projects and Protein ExplorationWith respect to protein exploration and novel family

discovery, microbial sequencing offers more promise com-pared to sequencing more mammalian genomes. This isillustrated by Figure 11, where the number of clusters thatprotein predictions from various finished mammalian ge-nomes fall into was compared to the number of clusters thatsimilar-sized random subsets of microbial sequences fall into(see Materials and Methods). As the figure shows, the rate ofprotein family discovery is higher for microbes than formammals. Indeed, the rate of new family discovery isplateauing for mammalian sequences. This is not surprising,

Table 11. Top 20 Organisms with Most ORFans Matched by GOS

Organism Total Proteinsa Total ORFans ORFans Matched

Rhodopirellula baltica SH 1 7,325 1,418 322

Shewanella oneidensis MR-1 4,472 292 206

Cytophaga hutchinsonii 3,686 555 170

Bdellovibrio bacteriovorus HD100 3,587 753 152

Kineococcus radiotolerans SRS30216 4,559 1,070 125

Synechococcus sp. WH 8102 2,517 143 116

Burkholderia cepacia R18194 7,717 198 100

Aeropyrum pernix K1 1,841 1,312 95

Burkholderia cepacia R1808 7,915 292 94

Magnetospirillum magnetotacticum MS-1 10,146 826 92

Microbulbifer degradans 2–40 4,038 386 85

Burkholderia fungorum LB400 7,994 190 84

Desulfitobacterium hafniense DCB-2 4,389 758 75

Escherichia coli CFT073 5,379 168 67

Bradyrhizobium japonicum USDA 110 8,317 580 66

Acanthamoeba polyphaga mimivirus 911 304 57

Caulobacter crescentus CB15 3,737 333 56

Rubrivivax gelatinosus PM1 4,307 287 53

Mesorhizobium loti MAFF303099 7,272 370 53

Candida albicans SC5314 14,107 1,647 49

aTotal number of proteins of this organism deposited at NCBI; may have redundant entries.doi:10.1371/journal.pbio.0050016.t011

Figure 9. Pie Chart of ORFans That Had GOS Matches

ORFans are grouped by organism (left), number of their GOS matches (middle), and the lowest E-value to their GOS matches in negative logarithm form(right). For both middle and right charts, inner and outer circles represent noneukaryotic and eukaryotic ORFans, respectively. From the middle chart itis seen that 626 (¼ 404þ 180þ 21 þ 21) ORFans form significant protein families with �20 GOS matches.doi:10.1371/journal.pbio.0050016.g009

PLoS Biology | www.plosbiology.org | S75 Special Section from March 2007 | Volume 5 | Issue 3 | e160451

Expanding the Protein Family Universe

Page 82: Plos Biology Venter Collection Low

as mammalian divergence from a common ancestor is muchmore recent than microbial divergence from a commonancestor, which suggests that mammals will share a largercore set of less-diverged proteins. Microbial sequencing isalso more cost effective than mammalian sequencing foracquiring protein sequences because microbial proteindensity is typically 80%–90% versus 1%–2% for mammals.This could be addressed with mammalian mRNA sequencing,but issues with acquiring rarely expressed mRNAs would needto be considered. There are, of course, other reasons tosequence mammalian genomes, such as understandingmammalian evolution and mammalian gene regulation.

ConclusionsThe rate of protein family discovery is approximately

linear in the (current) number of protein sequences. Addi-tional sequencing, especially of microbial environments, isexpected to reveal many more protein families and sub-families. The potential for discovering new protein families isalso supported by the GOS diversity seen at the nucleotidelevel across the different sampling sites [30]. Averaged overthe sites, 14% of the GOS sequence reads from a site areunique (at 70% nucleotide identity) to that site [30].

The GOS data provides almost complete coverage ofknown prokaryotic protein families. In addition, it adds agreat deal of diversity to many known families and offers newinsights into the evolution of these families. This is illustratedusing several protein families, including UV damage–repairenzymes, phosphatases, proteases, glutamine synthetase,RuBisCO, RecA (unpublished data), and kinases [77]. Only ahandful of protein families have been examined thus far, andmany thousands more remain to be explored.The protein analysis presented indicates that we are far

from exploring the diversity of viruses. This is reflected inseveral of the analyses. The GOS-only clusters show anoverrepresentation of sequences of viral origin. In addition,our domain analysis using HMM profiling shows a lower Pfamcoverage of the GOS sequences in the viral kingdomcompared to the other kingdoms. At least two of the proteinfamilies we explored in detail (UV repair enzymes andglutamine synthetase) contain abundant new viral additions.The extraordinary diversity of viruses in a variety ofenvironmental settings is only now beginning to be under-stood [57,119–121]. A separate analysis of GOS microbial andviral sequences (unpublished data) shows that multiple viralprotein clusters contain significant numbers of host-derived

Figure 11. Rate of Cluster Discovery for Mammals Compared to That for Microbes

The x-axis denotes the number of sequences (in thousands), and the y-axis denotes the number of clusters (in thousands). Five mammalian genomesare considered for the ‘‘Mammalian’’ dataset, and the plot shows the number of clusters that are hit when each additional genome is added. For the‘‘Mammalian Random’’ dataset, the order of the sequences from the ‘‘Mammalian’’ dataset is randomized. For the NCBI-nr prokaryotic and GOSdatasets, random subsets of size similar to that of the mammalian set are considered.doi:10.1371/journal.pbio.0050016.g011

Figure 10. Structure and GOS Homologs of Hypothetical Protein AF1548

Yellow bars represent b-strands. Highlighted are predicted catalytic residues: 38D, 51E, and 53K.doi:10.1371/journal.pbio.0050016.g010

PLoS Biology | www.plosbiology.org | S76 Special Section from March 2007 | Volume 5 | Issue 3 | e160452

Expanding the Protein Family Universe

Page 83: Plos Biology Venter Collection Low

proteins, suggesting that viral acquisition of host genes isquite widespread in the oceans.

Data generated by this GOS study and similar environ-mental shotgun sequencing studies present their own analysischallenges. Methods for various analyses (e.g., sequencealignment, profile construction, phylogeny inference, etc.)are generally designed and optimized to work with fullsequences. They have to be tailored to analyze the mostlyfragmentary sequences that are generated by these projects.Nevertheless, these data are a valuable source of newdiscoveries. These data have the potential to refine oldhypotheses and make new observations about proteins andtheir evolution. Our preliminary exploration of the GOS dataidentified novel protein families and also showed that manyORFan sequences from current databases have homologs inthese data. The diversity added by GOS data to proteinfamilies also allows for the building of better profile modelsand thereby improves remote homology detection. Thediscovery of kingdom-crossing protein families that werepreviously thought to be kingdom-specific presents evidencethat the GOS project has excavated proteins of more ancientlineage than that previously known, or that have undergonelateral gene transfer. This is another example of howmetagenomics studies are changing our understanding ofprotein sequences, their evolution, and their distributionacross the various forms of life and environments. Biases inthe currently published databases due to oversampling ofsome proteins or organisms are illuminated by environ-mental surveys that lack such biases. Such knowledge can helpus make better predictions of the real distribution patterns ofproteins in the natural world and indicate where increasedsampling would be likely to uncover new families or familymembers of tremendous diversity (such as in the viralkingdom).

These data have other significant implications for the fieldsof protein evolution and protein structure prediction.Having several hundreds or even tens of thousands of diverseproteins from a family or examples of a specific protein foldshould provide new approaches for developing proteinstructure prediction models. Development of algorithms thatconsider the alignments of all these family members/proteinfolds and analyze how amino acid sequence can vary withoutsignificantly altering the tertiary structure or function mayprovide insights that can be used to develop new ab initomethods for predicting protein structures. These samedatasets could also be used to begin to understand how aprotein evolves a new function. Finally, this large database ofamino acid sequence data could help to better understandand predict the molecular interactions between proteins. Forexample, they may be used to predict the protein–proteininteractions so critical for the formation of specific func-tional complexes within cells.

The GOS data also have implications for nearly allcomputational methods relying on sequence data. Theincrease in the number of known protein sequences presentschallenges to many algorithms due to the increased volume ofsequences. In most cases this increase in sequence data can becompensated for with additional CPU cycles, but it is also aforeshadowing of times to come as the pace of large-scalesequence-collecting accelerates. A related challenge is theincrease in the diversity of protein families, with many newdivergent clades present. With more protein similarity

relationships falling into the twilight zone overlapping withrandom sequence similarity, the number of false positives forhomology detection methods increases, making the truerelationships more difficult to identify. Nevertheless, a deeperknowledge of protein sequence and family diversity intro-duces unprecedented opportunities to mine similarity rela-tionships for clues on molecular function and molecularinteractions as well as providing much expanded data for allmethods utilizing homologous sequence information data.The GOS dataset has demonstrated the usefulness of large-

scale environmental shotgun sequencing projects in explor-ing proteins. These projects offer an unbiased view ofproteins and protein families in an environmental sample.However, it should be noted that the GOS data reported hereare limited to mostly ocean surface microbes. Even with thistargeted sampling a tremendous amount of diversity is addedto known families, and there is evidence for a large number ofnovel families. Additional data from larger filter sizes (thatwill sample more eukaryotes) coupled with metagenomicstudies of different environments like soil, air, deep sea, etc.will help to achieve the ultimate goal of a whole-earth catalogfor proteins.

Materials and Methods

Data description. NCBI-nr [31,32] is the single largest publiclyavailable protein resource and includes protein sequences submittedto SWISS-PROT (curated protein database) [122], PDB (a database ofamino acid sequences with solved structures) [123], PIR (ProteinInformation Resource) [124], and PRF (Protein Research Founda-tion). In addition, NCBI-nr also contains protein predictions fromDNA sequences from both finished and unfinished genomes inGenBank [125], EMBL [126], and DNA Databank of Japan (DDBJ)[127]. The nonredundancy in NCBI-nr is only to the level of distinctsequences, and any two sequences of the same length and content aremerged into a single entry. NCBI-nr contains partial proteinsequences and is not a fully curated database. Therefore it alsocontains contaminants in the form of sequences that are falselypredicted to be proteins.

Expressed sequence tag (EST) databases also provide the potentialto add a great deal of information to protein exploration and containinformation that is not well represented in NCBI-nr. To this end,assemblies of EST sequences from the TIGR Gene Indices [34], an ESTdatabase, were included in this study. To minimize redundancy, onlyEST assemblies from those organisms for which the full genome is notyet known, were included. The protein predictions on metazoangenomes that are fully sequenced and annotated were obtained byincluding the Ensembl database [35,36] in this study.

Both finished and unfinished sequences from prokaryotic genomeprojects submitted to NCBI were included. The protein predictionsfrom the individual sequencing projects are submitted to NCBI-nr.Nevertheless, these genomes were included in this dataset both forthe purpose of evaluating our approach and also for the purpose ofidentifying any proteins that were missed by the annotation processused in these projects.

Thus, for this study the following publicly available datasets, alldownloaded on February 10, 2005—NCBI-nr, PG, TGI-EST, andENS—were used. The organisms in the PG set and the TGI-EST setare listed in Protocol S1.

Assembly of the GOS dataset. Initial assembly (construction of‘‘unitigs’’) was performed so that only overlaps of at least 98% DNAsequence identity and no conflicts with other overlaps were accepted.False assemblies at this phase of the assembler are extremely rare,even in the presence of complex datasets [37,128]. Paired-end (alsoknown as mate-pair) data were then used to order, orient, and mergeunitigs into the final assemblies, but only when two mate pairs or asingle mate pair and an overlap between unitigs implied the samelayout. In one respect, mate pair data was used more aggressively thanis typical in assembly of a single genome in that depth-of-coverageinformation was largely ignored [10]. This potentially allows chimericassemblies through a repeat within a genome or through an orthologbetween genomes. Thus, a conclusion that relies on the correctness ofa single assembly involving multiple unitigs should be considered

PLoS Biology | www.plosbiology.org | S77 Special Section from March 2007 | Volume 5 | Issue 3 | e160453

Expanding the Protein Family Universe

Page 84: Plos Biology Venter Collection Low

tentative until the assembly can be confirmed in some way.Assemblies involved in key results in this paper were subjected toexpert manual review based on thickness of overlaps, presence ofwell-placed mate pairs across thin overlaps or across gaps betweencontigs, and consistency of depth of coverage.

Data release and availability. All the GOS protein predictions willbe submitted to GenBank. In addition, all the data supporting thispaper, including the clustering and the various analyses, will be madepublicly available via the CAMERA project (Community Cyberinfras-tructure for Advanced Marine Microbial Ecology Research andAnalysis; http://camera.calit2.net), which is funded by the Gordonand Betty Moore Foundation.

All-against-all BLASTP search. We used two sets of computerresources. At the J. Craig Venter Institute, 125 dual 3.06-GHz Xeonprocessor systems with 2 Gb of memory per system were used. Eachsystem had 80 GB local storage and was connected by GBit ethernetwith storage area network (SAN) I/O of ;24 GBit/sec and networkattached storage (NAS) I/O of ;16 GBit/sec. A total of 466,366 CPUhours was used on this system. In addition, access to the NationalEnergy Research Scientific Computing Center (NERSC) Seaborgcomputer cluster was available, including 380 nodes each with sixteen375-MHz Power3 processors. The systems had between 16 GB and 64GB of memory. Only 128 nodes were used at a time. A total of 588,298CPU hours was used on this system. The dataset of 28.6 millionsequences was searched against itself in a half-matrix using NCBIBLAST [38] with the following parameters: -F ‘‘m L’’ -U T -p blastp -e1 3 10�10 -z 3 3 109 -b 8000 -v 10. In this paper, similarity of analignment is defined to be the fraction of aligned residues with apositive score according to the BLOSUM62 substitution matrix [129]used in the BLAST searches.

Identification of nonredundant sequences. Given a set of sequencesS and a threshold T, a nonredundant subset S9 of S was identified byfirst partitioning S (using the threshold T) and then picking arepresentative from each partition. The set of representativesconstitutes the nonredundant set S9. The process was implementedusing the following graph-theoretic approach. A directed graph G ¼(V, E) is constructed with vertex set V and edge set E. Each vertex in Vrepresents a sequence from S. A directed edge (u,v) 2 E if sequence uis longer than sequence v and their sequence comparison satisfies thethreshold T; for sequences of identical length, the sequence with thelexicographically larger id is considered the longer of the two. Notethat G does not have any cycles. Source vertices (i.e., vertices with noin-degree) are sorted in decreasing order of their out-degrees and(from largest out-degree to smallest) processed in this order. A sourcevertex u is processed as follows: mark all vertices that have not beenseen before and are reachable from vertex u as being redundant andmark vertex u as their representative.

We used two thresholds in this paper, 98% similarity and 100%identity. The former was used in the first stage of the clustering andthe later was used in the HMM profile analysis. For the 98% similaritythreshold, two sequences satisfy the threshold if the following threecriteria are met: (1) similarity of the match is at least 98%; (2) at least95% of the shorter sequence is covered by the match; and (3) (matchscore)/(self score of shorter sequence) � 95%.

For the 100% identity threshold, two sequences satisfy thethreshold if their match identity is 100%.

Description of the clustering algorithm. The starting point for theclustering was the set of pairwise sequence similarities identifiedusing the all-against-all BLASTP compute. Because of both thevolume and nature of the data, the clustering was carried out in foursteps: redundancy removal, core set identification, core set merging,and final recruitment.

A set of nonredundant sequences (at 98% similarity) was identifiedusing the procedure given in Materials and Methods (Identification ofnonredundant sequences). Only the nonredundant sequences wereconsidered in further steps of the clustering process.

The aim of the core set identification step was to identify core sets ofhighly related sequences. In graph-theoretic terms, this involveslooking for dense subgraphs in a graph where the vertices correspondto sequences and an edge exists between two sequences if theirsequence match satisfies some reasonable threshold (for instance,40% similarity match over 80% of at least one sequence and areclearly homologous based on the BLAST threshold). Dense subgraphswere identified by using a heuristic. This approach utilizes long edges.These are edges where the match threshold is computed relative tothe longer sequence. This was done to prevent, as much as possible,unrelated proteins from being put into the same core set. If all thesequences were full length, using long edges would have offered agood solution to keeping unrelated proteins apart. However, thesituation here is complicated by the presence of a large amount of

fragmentary sequence data of varying lengths. This was dealt withsomewhat by working with rather stringent match thresholds and atwo-stage process to identify the core sets. We used the concept ofstrict long edges and weak long edges. A strict long edge exists betweentwo vertices (sequences) if their match has the following properties:(1) 90% of the longer sequence is involved in the match; (2) the matchhas 70% similarity; and (3) the score of the match is at least 60% ofthe self-score of the longer sequence. A weak long edge exists betweentwo vertices (sequences) if their match has the following properties:(1) 80% of the longer sequence is involved in the match; (2) the matchhas 40% similarity; and (3) the score of the match is at least 30% ofthe self-score of the longer sequence. Core set identification had twosubstages: large core initialization and core extension. The large coreinitialization step identified sets of sequences where these sets were ofa reasonable size and the sequences in them were very similar to eachother. Furthermore, these sets could be extended in the coreextension step by adding related sequences. In the large coreinitialization step, a directed graph G was constructed on thesequences using strict long edges, with each long edge being directedfrom the longer to the shorter sequence. For each vertex v in G, letS(v) denote the friends set of v consisting of v and all neighbors that vhas an out-going edge to.

Initially all the vertices in G are unmarked. Consider the set of allfriends sets in the decreasing order of their size. For S(v) that iscurrently being considered, do the following: (1) initialize seed set A¼S(v); (2) while there exists some v9 such that jS(v) \ S(v9)j � k, set A¼A[ S(v9). (Note: k¼ 10 is chosen); (3) output set A and mark all verticesin A; and (4) update all friends sets to contain only unmarked vertices.

In the core extension step, we constructed a graph G using weaklong edges. All vertices in seed sets (computed from the large coreinitialization step) were marked and the rest of the verticesunmarked. Each seed set was then greedily extended to be a coreset by adding a currently unmarked vertex that has at least kneighbors (k¼ 10 is chosen) in the set; the added vertex was marked.After this process, a clique-finding heuristic was used to identifysmaller cliques (of size at most k � 1) consisting of currentlyunmarked vertices; these were also extended to become core sets. Afinal step involved merging the computed core sets on the basis ofweak edges connecting them.

In the core set merging step, we constructed an FFAS (Fold andFunction Assignment System) profile [39] for each core set using thelongest sequence in the core set as query. FFAS was then used to carryout profile–profile comparisons in order to merge the core sets intolarger sets of related sequences. Due to computational constraintsimposed by the number of core sets, profiles were built on only coresets containing at least 20 sequences.

Final recruitment involved constructing a PSI-BLAST profile [40]on core sets of size 20 or more (using the longest sequence in the coreset as query) and then using PSI-BLAST (–z 13109, –e 10) to recruit asyet unclustered sequences or small-sized clusters (size less than 20) tothe larger core sets. For a sequence to be recruited, the sequence–profile match had to cover at least 60% of the length of the sequencewith an E-value � 1310�7. In a final step, unclustered sequences wererecruited to the clusters using their BLAST search results. A length-based threshold was used to determine if the sequence is to berecruited.

Identification of clusters containing shadow ORFs. A well-knownproblem in predicting coding intervals for DNA sequences is shadowORFs. The key requirement that coding intervals not contain in-frame stop codons requires that coding intervals be subintervals ofORFs. Long ORFs are therefore obvious candidates to be codingintervals. Unfortunately, the constraints on the coding interval to bean ORF often cause subintervals and overlapping intervals of thecoding interval to also be ORFS in one of the five other readingframes (two on the same strand and three on the opposite strand).These coincidental ORFs are called shadow ORFs since they arefound in the shadow of the coding ORF. In rare cases (and morefrequently in certain viruses) coding intervals in different readingframes can overlap but usually only slightly. Overwhelmingly distinctcoding intervals do not overlap. However, this constraint is not asstrict for ORFs that contain a coding interval, as the exact extent ofthe coding interval is not known. Prokaryotes predominate in thesedata and are the focus of the ORF predictions. Their 39 end of anORF is very likely to be part of the coding interval because a stopcodon is a clear signal for the termination of both the ORF and thecoding interval (this signal could be obscured by frameshift errors insequencing). The 59 end is more problematic because the true startcodon is not so easily identified and so the longest ORF with areasonable start codon is chosen and this may extend the ORFbeyond the true coding interval. For this reason different criteria

PLoS Biology | www.plosbiology.org | S78 Special Section from March 2007 | Volume 5 | Issue 3 | e160454

Expanding the Protein Family Universe

Page 85: Plos Biology Venter Collection Low

were set for when ORFs have a significant overlap depending on theorientation (or the 59 or 39 ends) of the ORFs involved. Two ORFs onthe same strand are considered overlapping if their intervals overlapby at least 100 bp. Two ORFs that are on the opposite strands areconsidered overlapping either if their intervals overlap by at least 50bp and their 39 ends are within each others intervals, or if theirintervals overlap by at least 150 bp and the 59 end of one is in theinterval of the other.

ORFs for coding intervals are clustered based on sequencesimilarity. In most cases this sequence similarity is due to the ORFsevolving from a common ancestral sequence. Due to functionalconstraints on the protein being coded for by the ORF, somesequence similarity is retained. There are no known explicitconstraints on the shadow ORFs to constrain drift from the ancestralsequences. However, the shadow ORFs still tend to cluster togetherfor some obvious reasons. The drift has not yet obliterated thesimilarity. There are implicit constraints due to the functionalconstraints on the overlapping coding ORF. There are also otherpossible unknown functional constraints beyond the coding ORF. Atfirst it was surmised that within shadow ORF clusters the diversityshould be higher than for the coding ORF, but this did not prove tobe a reliable signal. The apparent problem is that the shadow ORFstend to be fractured into more clusters due to the introduction ofstop codons that are not constrained because the shadow ORFs arenoncoding. What rapidly became apparent is that the most reliablesignal that a cluster was made up of shadow ORFs is that the clusterwas smaller than the coding cluster containing the ORFs overlappingthe shadow ORFs.

The basic rule for labeling a cluster as a shadow ORF cluster is thatthe size of the shadow ORF cluster is less than the size of anothercluster that contained a significant proportion of the overlappingORFs for the shadow ORF cluster. A specific set of rules was used tolabel shadow ORF clusters based on comparison to other clusters thatcontained ORFs overlapping ORFS in the shadow ORF cluster (calledthe overlapping cluster for this discussion). First, the overlappingcluster cannot be the same cluster as the shadow ORF cluster (thereare sometimes overlapping ORFs within the same cluster due toframeshifts). Second, both the redundant and nonredundant sizes ofthe shadow ORF cluster must be smaller than the corresponding sizesof the overlapping cluster. Third, at least one-third of the shadowORFs must have overlapping ORFs in the overlapping cluster. Fourth,less than one-half of the shadow ORFs are allowed to contain theiroverlapping ORFs (this test is rarely needed but did eliminate the vastmajority of the very few obvious false positives that were found usingthese rules). Finally, the majority of the shadow ORFs that overlappedmust overlap by more than half their length.

When using this rule, 1,274,919 clusters were labeled as shadowORF clusters, and 6,570,824 singletons were labeled as shadow ORFs.The rules need to be somewhat conservative so as not to eliminatecoding clusters. To test these rules, clusters containing at least twoNCBI-nr sequences were examined. Two sequences were used insteadof one because occasional spurious shadow ORFs have beensubmitted to NCBI-nr. There were 989 shadow ORF clusterscontaining at least two NCBI-nr sequences and with more thanone-tenth as many NCBI-nr sequences as the overlapping cluster.This was 0.86% of all clusters (114,331 in total) with at least twoNCBI-nr sequences. Of these 989, a few were obvious mistakes, andthe others involved very few NCBI-nr sequences of dubious curation,

such as ‘‘hypothetical.’’ Just to be conservative, all of these 989clusters were rescued and not labeled as shadow ORF clusters.

Ka/Ks test to determine if sequences in a cluster are under selectivepressure. For a cluster containing conserved but noncodingsequences, it is expected that there is no selection at the codonlevel. We checked this by computing the ratio of nonsynonymous tosynonymous substitutions (Ka/Ks test) [130,131] on the DNAsequences from which the ORFs in the cluster were derived. Formost proteins, Ka/Ks � 1, and for proteins that are under strongpositive selection, Ka/Ks� 1. A Ka/Ks value close to 1 is an indicationthat sequences are under no selective pressure and hence are unlikelyto encode proteins [134,135]. Weakly selected but legitimate codingsequences can have a Ka/Ks value close to 1. These were identified tosome extent by using a model in which different partitions of thecodons experience different levels of selective pressure. A cluster wasrejected only if no partition was found to be under purifyingselection at the amino acid level.

The Ka/Ks test [130,131] was run only on those clusters (remainingafter the shadow ORF filtering step) that did not contain sequenceswith HMM matches or have NCBI-nr sequences in them. Only thenonredundant sequences in a cluster were considered. Sequences ineach of the clusters were aligned with MUSCLE [134]. For eachcluster, a strongly aligning subset of sequences was selected for theKa/Ks analysis. The codeml program from PAML [135,136] was runusing model M0 to calculate an overall (i.e., branch- and position-independent) Ka/Ks value for the cluster. Clusters with Ka/Ks � 0.5,indicating purifying selection and therefore very likely coding, wereconsidered as passing the Ka/Ks filter. In addition, the remainingclusters were examined by running codeml with model M3. Thispartitioned the positions of the alignment into three classes that maybe evolving differently (typically, a few positions may be underpositive selection while the remainder of the sequence is conserved).A likelihood ratio test was applied to select clusters for which M3explained the data significantly better than M0 [136]. If a cluster wasthus selected, and if one of the resulting partitions had a Ka/Ks � 0.5and comprised at least 10% of the sequence, then that cluster was alsoconsidered as passing the Ka/Ks filter. All other clusters were markedas containing spurious ORFs.

Statistics for the various stages of the clustering process Thenumber of sequences that remain after redundancy removal (at 98%similarity) for each dataset is given in Table 12. Recall that the size ofa cluster is the number of nonredundant sequences in it.

Number of core sets of size two or more totals 1,586,454; numberof nonredundant sequences in core sets of size two or more totals8,337,256; and total number of sequences in core sets of size two ormore is 12,797,641.

Total number of clusters after profile merging and (PSI-BLASTand BLAST) recruitment is 1,871,434; number of clusters of size twoor more totals 1,388,287; number of nonredundant sequences inclusters of size two or more totals 11,494,078; total number ofsequences in clusters of size two or more is 16,565,015.

The final clustering statistics (after shadow ORF detection and Ka/Ks tests) are as follows: number of clusters of size two or more totals297,254; number of nonredundant sequences in clusters of size two ormore totals 6,212,610; total number of sequences in clusters of sizetwo or more is 9,978,637.

In the final BLAST recruitment step, a pattern was seen involvinghighly compositionally biased sequences that recruited unrelatedsequences to clusters. This was reflected in the pre- and post-BLASTrecruitment numbers, where the postrecruitment sizes were morethan three to four times the size of the prerecruitment numbers.There were 75 such clusters, and these were removed.

Searching sequences using profile HMMs. The full set of 7,868Pfam release 17 models was used, along with additional nonredun-dant profiles from TIGRFAM (1,720 of 2,443 profiles; version 4.1).HMM profiling was carried out using a TimeLogic DeCypher system(Active Motif, Inc., http://www.activemotif.com) and took 327 hours intotal (on an eight-card machine). A sequence was considered asmatching a Pfam (fragment model) if its sequence score was above theTC score for that Pfam and had an E-value � 1 3 10�3. It wasconsidered as matching a TIGRFAM if the match had an E-value � 13 10�7.

Evaluation of protein prediction via clustering. Our evaluation ofprotein prediction via the clustering shows a very favorablecomparison to currently used protein prediction methods forprokaryotic genomes. We used the PG dataset for this evaluation(Table 2). Of the 3,049,695 PG ORFs, 575,729 sequences (19%) wereclustered (the clustered set). Of the 614,100 predictions made by thegenome projects, 600,911 sequences could be mapped to the PG ORFset (the submitted set); 93% of the unmapped sequences were ,60 aa

Table 12. The Number of Sequences in NCBI-nr, PG ORFs, TGI-EST ORFs, ENS, and GOS ORFs prior to and after the RedundancyRemoval Step of Our Clustering

Data Number of Amino Acid Sequences

Original Set Nonredundant Set

NCBI-nr 2,317,995 1,017,058

PG ORFs 3,049,695 2,424,016

TGI-EST ORFs 5,458,820 5,085,945

ENS 361,668 137,057

GOS ORFs 17,422,766 14,134,842

Total 28,610,944 22,798,918

doi:10.1371/journal.pbio.0050016.t012

PLoS Biology | www.plosbiology.org | S79 Special Section from March 2007 | Volume 5 | Issue 3 | e160455

Expanding the Protein Family Universe

Page 86: Plos Biology Venter Collection Low

(recall that the ORF calling procedure only produced ORFs of length�60 aa). The clustered set and submitted set had 493,756 ORFs incommon. Of the 107,155 sequences that were only in the submittedset, 24,217 sequences (23%) had HMM matches. As with otherunclustered HMM matches, most were weak or partial. Thesesequences had an average of only 48% of their lengths covered byHMMs. Of the remaining 82,938 sequences that did not have an HMMmatch, 13,724 (17%) were removed by the filters used, and the rest fellinto clusters with only one nonredundant sequence (and thus werenot labeled as predicted proteins by the clustering analysis). Based onNCBI-nr sequences in them, these clusters were mostly labeled as‘‘hypothetical,’’ ‘‘unnamed,’’ or ‘‘unknown.’’ Our clustering methodidentified 81,973 ORFs not predicted by the genome projects, ofwhich 16,042 (20%) were validated by HMM matches (with averageHMM coverage of 69% of sequence length) and an additional 27,120(33%) had significant BLAST matches (E-value � 1 3 10�10) tosequences in NCBI-nr. Thus, if the submitted set is considered astruth, then protein prediction via clustering produces 493,756 truepositives (TP), 81,973 false positives (FP), and 107,155 false negatives(FN), thereby having a sensitivity (TP/[TP þ FN]) of 83% andspecificity (TP/[TP þ FP]) of 86%. However, if truth is considered asthose sequences that are common to both the clustered andsubmitted sets in addition to those sequences with HMM matches,then our protein prediction method via clustering has 95% sensitivityand 89% specificity, while protein prediction by the prokaryoticgenome projects has 97% sensitivity and 86% specificity.

Evaluation of protein clustering. We used Pfams to evaluate theclustering method in two ways. For both evaluations the clusteringwas restricted to only those sequences with Pfam matches. It shouldbe kept in mind that there are redundancies among Pfams in thatthere can be more than one Pfam for a homologous domain family(for instance, the kinase domain Pfams—PF00069 protein kinasedomain and PF07714 protein tyrosine kinase), and these redundan-cies can affect the evaluation statistics reported below.

For the first evaluation, each sequence was represented by the setof Pfams that match it. This is referred to as the domain architecture fora sequence. While Pfams provide a domain-centric view of proteins,the domain architecture attempts to approximate the full sequence-based approach used here, and thus could be used to shed light on thegeneral performance of the clustering. We measured how oftenunrelated sequences were present in a given cluster. Two sequenceswere defined to be unrelated if their domain architectures each hadat least one Pfam that was not present in the other’s domainarchitecture. Note that this measure did not penalize the case whenthe domain architecture of one sequence was a proper subset of thedomain architecture of the other sequence. This was done to allowfragmentary sequences in clusters to be included in the evaluation aswell (and also because it is not always easy to determine whether anamino acid sequence is fragmentary or not). For each cluster, wecomputed the percentage of sequence pairs that are unrelated underthis measure. A total of 92% of the clusters had at most 2% unrelatedpairs. Then we carried out an assessment of how many instances of a

given domain architecture appear in a single cluster. A total of 58%of the domain architectures were confined to single clusters (i.e.,100% of their occurence is in one cluster), and 88% of the domainarchitectures was such that .50% of their occurences is in onecluster.

For the second evaluation, we selected all sequences with Pfammatches, and each sequence was assigned to the Pfam that matches itwith the highest score. With this assignment, the Pfams induce apartition on the sequences. The distribution of the number ofsequences in clusters induced by the Pfams was compared to those ofclusters from the clustering method. Figure 12A shows comparison asa log–log plot of the number of sequences versus the number ofclusters with at least that many sequences for the two cases. The plotshows that cluster size distributions are quite similar, with both themethods having an inflection point around 2,500. The differencebetween the two curves is that there are more big clusters (and alsofewer small clusters) induced by the Pfams as compared to theclustering method. This can be explained by noting that twosequences that are in the same Pfam cluster can nevertheless be putinto different clusters by the clustering method if they differ in theirremaining portions.

Our clustering also shows a good correspondence with HMMprofiling on the phylogenetic markers that we looked at. Theclustering identifies 7,423, 12,553, and 13,657 sequences, respectively,for RecA (cluster ID 1146), Hsp70 (cluster ID 197), and RpoB (clusterID 1187). HMM profiling identifies 5,292, 12,298, and 12,165sequences, respectively, for these families. For each of these families,there are at least 94% of sequences (relative to the smaller set) incommon between clustering and HMM profiling.

Difference in ratio of predicted proteins to total ORFs for the PGset and the GOS set. The ratio of clustered ORFs to total ORFs issignificantly higher for the GOS ORFs (0.3471) compared to the PGORFs (0.1888). This can be explained by the fragmentary nature ofthe GOS data. For the large majority of the GOS data, the averagesequence length is 920 bp compared to full-length genomes for thePG data. For the PG data, clustered ORFs have a mean length of 325aa and a median length of 280 aa. Unclustered ORFs have a meanlength of 119 aa and a median length of 87 aa. Assuming that thegenomic GOS data has a similar underlying ORF structure to PG data,the effect that GOS fragmentation had on ORF lengths is estimated.Each reading frame will have a mixture of clustered and unclusteredORFs, but on average there will be 2 ORFs per reading frame per 920-bp GOS fragment, and both ORFs will be truncated. Assuming thetruncation point for the ORF is uniformly distributed across theORF, the truncated ORF will drop below the 60-aa threshold to beconsidered as an ORF with a probability of 60/(length of the ORF).Using the median length, the percentage of clustered ORFs droppingbelow the threshold due to truncation is 21%; for unclustered ORFs,it is 69%. Accounting for this truncation, the expected ratio ofclustered ORFs to total ORFs for the GOS ORFs based on the PGORFs would be 0.3708, which is very close to the observed value.

Kingdom assignment strategy and its evaluation. We used several

Figure 12. Log–Log Plots of Cluster Size Distributions

The x-axis is logarithm of the cluster size X and the y-axis is the logarithm of the number of clusters of size at least X; logarithms are base 10.(A) Plot comparing the sizes of clusters produced by our clustering approach (red) to those of clusters produced by Pfams (green). The curves track eachother quite well, with both of them having an inflection point around cluster size 2,500 (approximately 3.4 on the x-axis). Each sequence is assigned tothe highest scoring Pfam that it matches. Two sequences that are assigned to the same Pfam can nevertheless be assigned to different clusters by thefull-sequence–based clustering approach if they differ in the remaining portion. This is especially true for commonly occurring domains that are presentin different multidomain proteins. Thus, there tends to be a larger number of big clusters in the Pfam approach as compared to the full-sequence–based approach. Hence, the green curve is above the red curve at the higher sizes.(B) Plot of the cluster size distributions for core sets (green) and for final clusters (red). Both curves have an inflection point around cluster size 2,500(approximately 3.4 on the x-axis). Note that these plots give the cumulative distribution function (cdf), while the power law exponents reported in thetext are for the number of clusters of size X (i.e., the probability density function [pdf]). The relationship between these exponents is bpdf¼ 1þ bcdf.doi:10.1371/journal.pbio.0050016.g012

PLoS Biology | www.plosbiology.org | S80 Special Section from March 2007 | Volume 5 | Issue 3 | e160456

Expanding the Protein Family Universe

Page 87: Plos Biology Venter Collection Low

approaches to assign kingdoms for GOS sequences. They are allfundamentally based upon a strategy that takes into account topBLAST matches of a GOS sequence to sequences in NCBI-nr, andthen voting on a majority.

We evaluated a simple strict-majority voting scheme (of the topfour BLAST matches) using the NCBI-nr set. First, the redundancy inNCBI-nr was removed using a two-staged process. A nonredundantset of NCBI-nr sequences was computed involving matches with 98%similarity over 95% of the length of the shorter sequence (using theprocedure discussed in Materials and Methods [Identification ofnonredundant sequences]). This set was made further nonredundantby considering matches involving 90% similarity over 95% of thelength of the shorter sequence. The nonredundant sequences thatremained after this step constituted the evaluation dataset S. For eachsequence in S, its top four BLAST matches to other sequences in S(ignoring self-matches) were used to assign a kingdom for it (based ona strict majority rule). This predicted kingdom assignment for thesequence was compared to its actual kingdom. A correct classificationis obtained for 93% of the sequences. The correct classification rateper kingdom is given in Table 13.

While this evaluation shows that the BLAST-based voting schemeprovides a reasonable handle on the kingdom assignment problem,there are caveats associated with it. The kingdom assignment for a setof query sequences is greatly influenced by the taxonomic groupsfrom each kingdom that are represented in the reference datasetagainst which these queries are being compared. If certain taxa areonly sparsely represented in the reference set, then, depending ontheir position in the tree of life, queries from these taxa can bemisclassified (using a nearest-neighbor type approach based onBLAST matches). This explains why the archaeal classification rate isquite low compared to the others. Thus, the true classification ratefor the GOS dataset based on this approach will also depend on thedifferences in taxonomic biases in the GOS dataset (query) and theNCBI-nr set (reference).

The kingdom proportion for the GOS dataset reported in Figure 1is based on a kingdom assignment of scaffolds. Those GOS ORFs withBLAST matches to NCBI-nr were considered, and the top-fourmajority rule was used to assign a kingdom to each of them. Using theORF coordinates on the scaffold, the fraction (of bp) of a scafffoldassigned to each kingdom was computed. The scaffold was labeled asbelonging to a kingdom if the fraction of the scaffold assigned to thatkingdom was .50%. All ORFs on this scaffold were then assigned tothe same kingdom.

Cluster size distribution, the power law, and the rate of proteinfamily discovery. Earlier studies of protein family sizes in singleorganisms [137–139] have suggested that P(d), the frequency ofprotein families of size d, satisfies a power law: that is, P(d) ’ d � b

with exponent b reported between 2.68 and 4.02. Power laws havebeen used to model various biological systems, including protein–protein interaction networks and gene regulatory networks [42,43].Figure 12B illustrates the distribution of the cluster sizes from ourdata on a log–log scale, a scale for which a power law distributiongives a line. In contrast to family size distributions reported in singleorganisms, the cluster sizes from our data are not well described by asingle power law. Rather, there appear to be different power laws:one governs the size distribution of very large clusters, and anotherdescribes the rest. This behavior is observed both in the distributionof the core set sizes and also in the distribution of the final clustersizes. We identified an inflection point for both the core setdistribution and the final clusters at around size 2,500, and estimatedthe power law exponent b via linear regression separately in each sizeregime. For the core set distribution, the exponent b ¼ 1.99 (R2 ¼0.994) for clusters of size � 2,500, and b ¼ 3.34 (R2 ¼ 0.996) forclusters of size . 2,500. For the final cluster sizes, the exponent b ¼

1.72 (R2¼ 0.995) for clusters of size � 2,500, and b¼ 2.72 (R2¼ 0.995)for clusters of size . 2,500. The estimates for b are different for thecore clusters compared to the final clusters, reflecting a largernumber of medium and large clusters in the final clustering as aresult of the cluster-merging and additional recruitment steps. Asimilar dichotomy between the size distributions of large and smallprotein families was observed in a study [140] of protein familiescontained in the ProDom, Protomap, and COG databases, where theexponent b reported was in the range of 1.83 to 1.98 for the 50smallest clusters and 2.54 to 3.27 for the 500 largest clusters in thesedatabases.

Our clustering method was run separately on the following sevendatasets: set 1 consisted of only NCBI-nr sequences; set 2 consisted ofall sequences in NCBI-nr, ENS, TGI-EST, and PG; sets 3 through 6consisted of set 2 in combination with a random subset of 20%, 40%,60%, and 80% of the GOS sequences, respectively; set 7 consisted ofset 2 in combination with all the GOS sequences. On each of theseven datasets, the redundancy removal (using the 98% similarityfilter) was run, followed by the core set detection steps. Figure 2shows the number of core sets of varying sizes (�3, �5, �10, and �20)as a function of the number of nonredundant sequences for eachdataset.

The observed linear growth in number of families with increase insample size n is related to the power law distribution in the followingway. We model protein families as a graph where each vertexcorresponds to a protein sequence and an edge between two verticesindicates sequence similarity between the corresponding proteins.Consider a clustering (partitioning) of the vertices of a graph with nvertices such that the cluster sizes obey a power law distribution. LetCd(n) [respectively, C�d(n)] denote the number of clusters of size d(respectively, �d). Since the distribution of cluster sizes follows apower law, there exist constants a, b such that for all x � n, Cx(n) ¼ax�b.

As every vertex of the graph is a member of exactly one cluster,

n ¼Xn

x¼1xCxðnÞ ¼

Xn

x¼1a x1�b ’

an2�b � 12� b

� �b 6¼ 2

alnn b ¼ 2

8<: ð1Þ

The number of clusters of size at least d is

C�dðnÞ ¼Xn

x¼dCxðnÞ’ a

n1�b � d1�b

1� b

� �b 6¼ 1

alnn b ¼ 1

8<: ð2Þ

Combining the two equations, we obtain values (up to a multiplicativeconstant) for C�d(n) as shown in Table 14. In all cases with b . 1, thenumber of clusters C�d(n) increases as n increases, and as d decreases.Specifically, for b . 2, the growth is linear in n for all d, with slopedecreasing as d increases. For 1 , b , 2, the growth is sublinear in nfor all d.

Note that while the observed distribution of protein family sizes isfit by two different power laws, one for clusters of size less than 2,500with b¼1.99 and another for clusters of size greater than 2,500 with b¼ 3.34 for the current number of (nonredundant) sequences, thecontribution of large families to the rate of growth is negligiblecompared to the small families.

The above formulas for C�d(n) also suggest the dependence of therate of growth of clusters on the cluster size d. For example, in thecase when b is very close to 2,

C�dðnÞ ¼ mn

db�1 ð3Þ

Table 13. BLAST-Based Classification Rate per Kingdom

Kingdom Total Number Correct Classification Percent Correct

Eukaryota 440,951 422,173 95.7

Bacteria 465,692 430,014 92.3

Archae 36,894 25,527 69.2

Viruses 36,346 32,381 89.0

doi:10.1371/journal.pbio.0050016.t013

Table 14. The Values for C�d(n), the Number of Clusters of Size�d, as a Function of the Power Law Exponent b and Constant a

b a C�d(n)

b , 1 nb�1 1

b ¼ 1 1 ln n

1 , b , 2 nb�1 ðn d= Þb�1b ¼ 2 n lnn= Þð n d ln n= Þðb . 2 n n db�1� ��

doi:10.1371/journal.pbio.0050016.t014

PLoS Biology | www.plosbiology.org | S81 Special Section from March 2007 | Volume 5 | Issue 3 | e160457

Expanding the Protein Family Universe

Page 88: Plos Biology Venter Collection Low

for some constant m. Thus, the rate of growth of cluster sizes is linear,and the slope m(d) of rate of growth is given by m(d)¼md1�b. Figure 13shows how well the observed rates of growth match the valuespredicted by this equation. A fit to a sublinear function (not shown)also gives similar results as in Figure 13.

GOS versus known prokaryotic versus known nonprokaryotic.Examples of top five clusters in the various categories (except GOS-only) are given below. The cluster identifiers are in parentheses.

Known prokaryotic only: (Cluster ID 1319) outer surface protein inAnaplasma ovis, Wolbachia, Ehrlichia canis; (Cluster ID 10911) nitritereductase in uncultured bacterium; (Cluster ID 1266) outer mem-brane lipoprotein in Borrelia; (Cluster ID 8595) methyl-coenzyme Mreductase subunit A in uncultured archaeon; (Cluster ID 2959) outermembrane protein in Helicobacter. Known nonprokaryotic only:(Cluster ID 2226) Pol polyprotein HIV sequences; (Cluster ID 4023)maturase K; (Cluster ID 6257) NADH dehydrogenase subunit 2;(Cluster ID 8644) HIV protease; (Cluster ID 12196) MHC class I and IIantigens. GOS and known prokaryotic only: (Cluster ID 3369)carbamoyl transferase; (Cluster ID 688) apolipoprotein N-acyltrans-ferase; (Cluster ID 3726) potassium uptake proteins; (Cluster ID 300)primosomal protein N9; (Cluster ID 4605) DNA polymerase III deltasubunit. GOS and known nonprokaryotic only: (Cluster ID 186) seventransmembrane helix receptors; (Cluster ID 2069) zinc fingerproteins; (Cluster ID 3092) MAP kinase; (Cluster ID 1413) potentialmitochondrial carrier proteins; (Cluster ID 233) pentatricopeptide(PPR) repeat-containing protein. Known prokaryotic and knownnonprokaryotic only: (Cluster ID 3510) immunoglobulin (andimmunoglobulin-binding) proteins; (Cluster ID 600) expansin; (Clus-ter ID 50) pectin methylesterase; (Cluster ID 6492) lectin; (Cluster ID986) BURP domain-containing protein. GOS and known prokaryoticand known nonprokaryotic: (Cluster ID 2568) ABC transporters;(Cluster ID 49) short-chain dehydrogenases; (Cluster ID 4294)epimerases; (Cluster ID 1239) AMP-binding enzyme; (Cluster ID2630) envelope glycoprotein.

Neighbor functional linkage methods. For the sequences in eachGOS-only cluster, we determined if neighboring ORFs occurring onthe same strand had a similar biological process in the GO [49]. If thisshared biological process of the neighbors occurred statistically moreoften than expected by chance, that inferred a potential operonlinkage and a biological process term for the GOS-only cluster. Thisapproach weighted ORFs by sequence similarity to reduce theskewing effect of sequences from highly related organisms.

For definition of linked ORFs, we collected pairs of same-strandORF protein predictions with intergenic distances less than 500 bp.Negative distances were possible if the 59 end of the downstream ORFin the pair occurred 59 to the 39 end of the upstream ORF. We used aprobability function to estimate the probability that two putativegenes belong to the same operon given their intergenic distance [47].Because sequences come from a variety of unknown organisms, theprobability distribution was created by averaging properties of 33randomly chosen divergent genomes. The exact choice of genomesdid not greatly affect the ability of the distribution to separateexperimentally determined same-operon gene pairs from adjacent,same-strand gene pairs in different known operons annotated in aversion of RegulonDB downloaded on March 29, 2005 [141].

We measured the functional linkage between two protein clusters

by searching for all occurrences of nearby pairs of ORFs belonging tothe two clusters of interest. Sufficiently close pairs were more likely tobe encoded in the same operon. We devised a scoring mechanism toreward those pairs of clusters for which many divergent examples oflikely operon pairs existed in the set of ORF pairs. For each pair ofclusters, a weight was applied to the contribution of each pair ofORFs, and this was proportional to how similar the pair of ORFs wasto other example pairs. Thus, many near-identical pairs of ORFs,likely from the same or similar species, are not overrepresented in thefinal cluster pair score, while conserved examples of neighboringposition from more divergent sequences contribute an increasedweight. The score for each cluster pair is calculated as:

SðC1C2Þ ¼ 1� Pi¼n

i¼1½1� PrðOgi1g

i2jdistÞ � wi

1 � wi2� ð4Þ

where S(C1C2) is the linkage score of clusters C1 and C2. Theprobability PrðOgi1g

i2jdistÞ that any two genes gi1 from C1 and gi2 from

C2 are in an operon is dependent on the distance between them ascalculated by [47], and is weighted according to the sequence weightswi1 and wi

2 described below, for all example pairs i.We calculated sequence weights in a manner similar to that used in

progressive multiple sequence alignment [142]. Briefly, neighbor-joining trees were built for all clusters using the QuickJoin [143] andQuickTree programs [144] based on a distance matrix constructedfrom all-against-all BLAST scores within a cluster, normalized to self-scores. For those few clusters with more than 30,000 members, treeswere not built. Instead, equal sequence weights for all members wereassigned because of computational limitations. The root of each treewas placed at the midpoint of the tree by using the retree package inPHYLIP [145]. The individual sequence weights were then computedby summing the distance from each leaf to the root after dividingeach branch’s weight by the number of nodes in the subtree below it.Weights were normalized so that the sum of weights in any given treewas equal to 1.0. This weighting scheme is superior to one in whichweights are normalized to the largest weight in the tree, one that doesnot weight sequences according to divergence, and one that onlyconsiders the number of example pairs seen (Figure 14). To comparethe different scoring methods, pairs of clusters annotated with GOterms that contained adjacent ORFs in the data were gathered. Thesepairs were divided into into functionally related and unrelatedclusters based on a measure of GO term similarity (p-value � 0.01)[146]. We evaluated scoring methods for the ability to recover

Figure 13. Log–Log plot of Slopes m(d) of Linear Regression Fit to the

Rate of Growth in Figure 2 for Different Values of Cluster Size d

According to the equation derived in the text, m(d) ¼ md1�b for someconstant m. The best linear fit to log [m(d)] gives a line with slope�0.91(R2 ¼ 0.98) that is close to the predicted value 1� b¼�0.99.doi:10.1371/journal.pbio.0050016.g013

Figure 14. Receiver Operating Characteristic Curve Used to Evaluate

Various Methods of Scoring Pairs of Clusters for Functional Similarity

Pairs of clusters with �1 example of neighboring ORFs and assigned GOterms were divided into a set of functionally related (true positive) andfunctionally unrelated (true negative) cluster pairs based on the similarityof their GO terms. The scoring methods evaluated are described in thetext.doi:10.1371/journal.pbio.0050016.g014

PLoS Biology | www.plosbiology.org | S82 Special Section from March 2007 | Volume 5 | Issue 3 | e160458

Expanding the Protein Family Universe

Page 89: Plos Biology Venter Collection Low

functionally similar pairs. In all analyses, linkages between clusterswere ignored if there were fewer than five examples of clustermember ORFs adjacent to each other on a scaffold.

Function for novel families was inferred as follows. (1) Assignmentof GO terms to clusters. We downloaded the GO [49] database onSeptember 21, 2005, from http://www.geneontology.org, along with thefiles gene_association.goa_uniprot and pfam2go.txt dated July 12,2005. Only the biological process component of the ontology wasconsidered. If a cluster had at least 10% of its redundant sequencesannotated by the most abundant Pfam domain for that cluster, andthat Pfam domain had a GO biological process term provided by thepfam2go mapping, then we assigned a cluster the GO term of its mostabundant Pfam annotation. In addition, if a cluster contained at least20% of its Uniprot GO annotations the same, it was assigned that GOterm. For each cluster, redundant GO terms found on the same pathto the root were removed. (2) Identification of neighbors to GOS-onlyclusters. Neighbors of GOS-only clusters were defined as thoseclusters that had a cluster linkage score above a predeterminedthreshold (1310�6) and had at least five examples of cluster membersadjacent to each other in the data. These neighbors were thenscreened for those that had been annotated with a GO term by theprocess described above. (3) Overrepresentation of neighbor GOterms. We attempted to define GO terms for a set of GOS-onlyneighbors that were statistically overrepresented. Because of thehighly dependent nature of the terms in the GO, a simulation-basedapproach was chosen to determine which terms might be over-represented. Annotated neighbors to a cluster of unknown functionwere identified as described above. For each annotated neighbor,counts for the associated GO term and all terms on the path to theroot of the ontology were incremented. A total of 100,000 simulatedneighbor lists of the same size as the true neighbor list were computedby selecting without replacement from those clusters with annotatedGO terms, and an identical counting scheme was performed for eachsimulation. Overrepresentation of neighbor terms was calculated foreach term on the ontology by asking how many times out of the100,000 simulations the count for each GO term in the ontology metor exceeded the observed count for the actual neighbors. Thisfraction of simulations was interpreted as a p-value. If a term isunusually prevalent in the true observed neighbors, it should berelatively infrequent in the simulated data. For the purpose of themetric used here, ‘‘is-a’’ and ‘‘part-of’’ relationships were treatedequally. In cases where a cluster had more than one GO term assignedto it, any redundant terms occurring on each other’s path to the rootwere first removed. For any remaining clusters with nonredundant,multiple GO annotations, all possible lists of functions for each list ofneighbor clusters were enumerated, and one function from eachcluster was chosen. Each node in the ontology was assigned themaximum count observed from the enumerated function lists. Weconsistently applied this rule for the observed and simulated data.

The following descriptive measures of the novel GOS-only clusterset were obtained. Transmembrane helix prediction was carried out

with the programs TMHMM [147] and SPLIT4 [148]. GC content wascalculated as (GþC)/(GþCþAþT) bases for each ORF in a cluster,and averaged for each cluster within a set. The GC content, reportedas the mean and standard deviation of the cluster averages, is asfollows for each cluster set: Group I, 36.7% 6 8.0%; Group II, 35.9%6 7.9%. Group I size-matched sample, 48.8% 6 11.1%; Group II size-matched sample, 49.5% 6 11.2%; Group I viral fraction, 37.8% 65.1%; Group II viral fraction, 37.3% 6 4.6%. To address theinterconnectivity of the novel clusters within the context of alloperon linkages, we constructed a graph with clusters as nodes andinferred operon linkages (with score � 1 3 10�6) as edges. We thenasked for every node in the set of novel clusters what was thecumulative fraction of novel nodes that could be reached within avarying edge distance from the starting node. The expectation of thisfraction was calculated at each distance, and the procedure wasrepeated for the set of size-matched clusters (Figure 15).

We tried three different BLAST-based approaches for kingdomassignment of ORFs. The first method, used in the analysis, required amajority of the four top BLAST matches to vote for the samekingdom (archaea, bacteria, eukaryota, or viruses; see Materials andMethods [Kingdom assignment strategy and its evaluation]). Thesecond method required all eight top BLAST matches to vote for thesame kingdom. The last method we used was the scaffold-basedkingdom assignment described in Materials and Methods (Kingdomassignment strategy and its evaluation). Figure 16 shows the results ofusing these assignments to infer the kingdom of GOS-only clusters(Figure 16D–16F) and their neighboring ORFs (Figure 16A–16C).GOS-only clusters were assigned a kingdom only if .50% of theirneighboring ORFs were assigned the same kingdom. The generaltrends observed are the same for each method, though the coveragedecreases slightly for the more stringent methods.

Characteristics and kingdom distribution of known proteindomains. For these analyses we used the predicted proteins fromthe public (NCBI-nr, PG, TGI-EST, and ENS) and GOS datasets. Thepublic dataset contains multiple identical copies of some sequencesdue to overlaps between the source datasets. For example, manysequences in PG are also found in NCBI-nr. We filtered the public setat 100% identity to avoid overcounting these sequences. Because thisfiltering was necessary for the public dataset, the GOS dataset wasalso filtered at 100% identity. If two or more sequences were 100%identical at the residue level, but were of different lengths, only thelongest sequence was kept. The resulting datasets of nonredundantproteins are referred to as public-100 and GOS-100.

We assigned each protein in public-100 to a kingdom based on thespecies annotations provided in the source datasets (NCBI-nr,Ensembl, TIGR, and PG). The NCBI taxonomy tree was used todetermine the kingdom of each species. Of 3,167,979 proteinsequences in public-100, 3,158,907 can be annotated by kingdom.The remaining 9,072 sequences are largely synthetic.

Determining the kingdom of origin of an environmental sequencecan be difficult; while an unambiguous assignment can be made forsome sequences, others can be assigned only tentatively or not at all.Therefore, we took a probabilistic approach (kingdom-weightingmethod), calculating ‘‘weights’’ or probabilities that each proteinsequence originated from a given kingdom.

The top four BLAST matches (E-value , 1310�10) of GOS ORFs toNCBI-nr were considered. The kingdom of origin for each match wasdetermined. We pooled these ‘‘kingdom votes’’ for each scaffold,since (presuming accurate assembly) each scaffold must come from asingle species and hence from a single kingdom. Each ORF on ascaffold contributed up to four votes. If an ORF had fewer than fourBLAST matches with an E-value , 1 3 10�10, then it contributedfewer votes. ORFs with no BLAST matches contributed no votes.

In many cases, the votes were not unanimous, indicating that someuncertainty must be associated with any kingdom assignment. Anadditional source of uncertainty is the finite number of votes. Weaccounted for these statistical issues by applying the followingprocedure to each scaffold. First, two pseudocounts were added tothe votes for the ‘‘unknown’’ kingdom to represent the uncertaintythat remains even when votes are unanimous (especially when thereare few votes). The frequency of votes for each kingdom wascalculated. The vote frequency for a kingdom provides the maximumlikelihood estimate of the kingdom probability (i.e., the votefrequency that would have been observed on a scaffold of similarcomposition but with infinitely many voting ORFs). However, thatestimate may not be accurate or precise. Therefore, the multinomialstandard deviation was calculated for each vote frequency p as SQRT[p3 (1� p)/(n� 1)], where n is the number of votes. A distance of twostandard deviations from the mean corresponds roughly to a 95%confidence interval. Thus, two standard deviations were subtracted

Figure 15. Novel GOS-Only Clusters Are More Interconnected Than a

Size-Matched Sample of Clusters

Red line, novel clusters; green line, size-matched sample; blue line (rightaxis), log2 ratio of fraction novel clusters recovered divided by fractionsample clusters recovered.doi:10.1371/journal.pbio.0050016.g015

PLoS Biology | www.plosbiology.org | S83 Special Section from March 2007 | Volume 5 | Issue 3 | e160459

Expanding the Protein Family Universe

Page 90: Plos Biology Venter Collection Low

from each vote frequency, and called the result (or zero, if the resultwas negative) the ‘‘kingdom weight.’’ This ‘‘kingdom weight’’ is aconservative estimate. There is 95% chance that the actual kingdomprobability is greater.

The kingdom weights do not sum to one because of the standarddeviation penalty. The difference between the sum of the kingdomweights and unity is a measure of the total uncertainty about thekingdom assignment. This is called the ‘‘unknown weight.’’

Finally, we assigned each ORF the kingdom weights calculated forthe scaffold as a whole. This procedure assigned kingdom weights tomany ORFs with no BLAST matches. Overall, 4,745,649 (84%) of the5,654,638 proteins in GOS-100 receive nonzero kingdom weights.

The kingdom weights calculated in this way provide a basis forestimating the proportion of sequences originating from eachkingdom, pGOS(K). The weights over all sequences in GOS-100 weresummed for each of the known kingdoms, and divided by the sum ofthe weights for all kingdoms (excluding the unknown weight). Thisprocedure suggested that 96% of the sequences are bacterial, asomewhat higher proportion than is estimated by the methoddescribed in Materials and Methods (Kingdom assignment strategyand its evaluation). Similarly, kingdom proportions, pGOS–Pfam(K),were calculated for the subset of GOS-100 sequences that have asignificant Pfam hit, and 97% are found to be bacterial.

We used the kingdom weights directly in the analyses where

possible (e.g., to calculate the expected kingdom distribution of agiven set of proteins by summing the weights). However, it wasnecessary in some cases to use discrete assignments of a singlekingdom to each ORF. A tentative assignment can be made for agiven scaffold by choosing the kingdom with the highest weight. Thepossibility remains, in this case, that a fraction of the ‘‘unknown’’weight should rightfully belong to a different kingdom. However, if akingdom weight is greater than 0.5, then this danger is averted, and a‘‘confident’’ assignment of the scaffold and its constituent ORFs tothat kingdom can be made.

Given the uncertainty penalty above, achieving a kingdom weightgreater than 0.5 generally requires overwhelming support for onekingdom over the others. In particular, on a given scaffold, at leasteight unanimous votes for a kingdom are needed (i.e., two ORFscontributing four votes each) to make a confident assignment to thatkingdom. Any disagreement between the votes increases the requirednumber rapidly: for instance, 15 votes for a single kingdom arerequired to override four votes for other kingdoms.

‘‘Confident’’ kingdom assignments were made for 2,626,178 (46%)of the 5,654,638 proteins in GOS-100.

In the analysis that identified new multi-kingdom Pfams, we usedthe subset of confidently kingdom-annotated proteins. Here, a Pfammodel was designated as ‘‘kingdom-specific’’ in public-100 if therewere only matches to proteins in one particular kingdom, and no‘‘unknown’’ matches. A Pfam model that was kingdom specific inpublic-100 was further designated as newly ‘‘multi-kingdom’’ if it hadmatches to one or more GOS-100 proteins that were confidentlylabeled as belonging to a kingdom different from that found in thepublic-100 matches. Also, we filtered Pfam matches with an E-valuecutoff of 1 3 10�10. In every case, the bit score is at least five bitsgreater than the trusted cutoff for the model. In addition to passingthe ‘‘confident’’ criteria, the kingdom assignments were all confirmedby visual inspection of the BLAST kingdom vote distributions for therespective scaffolds. Because the criteria for a ‘‘confident’’ kingdomassignment were conservative, there were only one or a few confidentassignments for each domain to a ‘‘new’’ kingdom. The ‘‘confident’’criteria are especially difficult to meet in the case of kingdom-crossing due to the votes contributed by the crossing protein. Forinstance, because the IDO domain itself always contributes four votesfor ‘‘Eukaryota,’’ at least 15 votes for ‘‘Bacteria’’ were required to calla scaffold ‘‘bacterial.’’ Thus, many scaffolds have no confidentkingdom assignment.

We compared the relative diversities of protein families betweenGOS-100 and public-100 as represented by Pfam sequence models. Inorder to do this, the number of matches expected to be found foreach Pfam model in the GOS-100 data was computed, assuming thatthe matches were distributed among the models in the sameproportions that they were in the public-100 data. These ‘‘expected’’match counts were compared with the observed counts to identifydomains that are more diverse in GOS-100 than in public-100 andvice versa.

Because kingdoms differ in their protein usage, Pfam modelsmatch sequences from different kingdoms with different frequencies,and some models match sequences exclusively from one kingdom.Thus, to calculate the expected number of matches to a given Pfam inGOS-100 based on the number of matches observed in public-100, wecorrected for the radically different kingdom composition of the twodatasets.

The expected proportion of all Pfam matches in GOS-100 that areto a given model M was calculated as follows. First, we made asimplifying assumption that sequences from different kingdoms wereequally likely to have a Pfam hit, and thus that the Pfam matches inGOS-100 would be distributed among the kingdoms according to thekingdom proportions calculated using the weighted method above(for instance, it is assumed that 97% of the matches would be tobacterial sequences). Probability that a Pfam hit in GOS-100 is from K’ pGOS-Pfam(K) (for sequences in GOS-100 with at least one Pfam hit)for kingdoms K in fArchae, Bacteria, Eukaryotes, Virusesg.

Second, we assumed that Pfam models match with the samerelative rates within each kingdom in GOS-100 as they do in public-100. For instance, since twice as many SH3 domains as SH2 domainsare found in public-100 eukaryotic sequences, the same ratio isexpected to be found in GOS-100 eukaryotic sequences. Using thepublic-100 data, we calculated the frequency of matches for eachPfam model M within each kingdom, relative to the total number ofPfam matches to that kingdom. Pseudocounts of one were added toboth the ‘‘match’’ and ‘‘no match’’ counts (i.e., using a uniformDirichlet prior), to allow proper statistical treatment of families withfew or no matches in the public databases for some kingdom. InEquation 5 below, Obspublic(M,K) is the observed number of public-

Figure 16. GOS-Only Clusters Are Enriched for Sequences of Viral Origin

Independently of the Kingdom Assignment Method Employed

For each panel, clusters are as in Figure 4. For (A–C), a kingdom isassigned to each neighboring ORF within each cluster set; thepercentage of all neighboring ORFs with a given kingdom assignmentis plotted. For (D–F), a kingdom is assigned to each cluster if more than50% of all that cluster’s neighbors with a kingdom assignment share thesame assignment; the percentage of clusters in each set with a givenassignment is plotted. In (A) and (D), a kingdom is assigned to aneighboring ORF by a majority vote of the top four BLAST matches to aprotein in NCBI-nr (Materials and Methods). In (B) and (E), a kingdom isassigned if all eight highest-scoring BLAST matches agree in kingdom. In(C) and (F), all ORFs on a scaffold are assigned the same kingdom byvoting among all ORFs with BLAST matches to NCBI-nr on that scaffold(Materials and Methods). In all graphs, only clusters with at least oneassignable neighbor are considered. When compared to the size-matched controls, in all cases the GOS-only clusters show enrichmentfor viral sequences.doi:10.1371/journal.pbio.0050016.g016

PLoS Biology | www.plosbiology.org | S84 Special Section from March 2007 | Volume 5 | Issue 3 | e160460

Expanding the Protein Family Universe

Page 91: Plos Biology Venter Collection Low

100 hits to M in K, and Obspublic(K) is the observed number of public-100 hits to all models in K.

pGOS�PfamðMjKÞ’ ppub�PfamðMjKÞ ¼ObspublicðM;KÞ þ 1ObspublicðKÞ þ 2

ð5Þ

By multiplying the conditional probability of each model given akingdom by the respective kingdom probability (pGOS-Pfam(K),calculated as described above in ‘‘Kingdom annotation of GOS-100proteins: kingdom weighting method’’), the proportions of Pfammatches in GOS-100 due to each combination of kingdom and Pfammodel were then predicted. Finally, these predictions were summedacross kingdoms to obtain the expected proportion of matches toeach model.

pGOS�PfamðMÞ ¼ SUMðK ¼ fA;B;E;VgÞ½pGOS�PfamðMjKÞpGOS�PfamðKÞ�ð6Þ

Relatively fewer GOS-100 sequences than public-100 sequenceshave a Pfam hit (likely because Pfam is based on sequences in thepublic databases). To avoid systematically overestimating the numberof GOS-100 hits for each Pfam model due to this global effect, thepredicted counts were based on the observed total number of Pfammatches to all models in GOS-100, and an attempt was made to predictonly how these matches are distributed among models. Thus, theexpected number of Pfam hits to a given model in GOS-100 is equal tothe expected proportion of hits to that model, as calculated above,multiplied by the total number of Pfam hits. In the equation below,ObsGOS is the total number of Pfam hits to all models in GOS-100.

Expected count of hits to M in GOS�100 ¼ pGOS�PfamðMÞ3ObsGOS

ð7Þ

In summary, calculation of the expected number of Pfam hits to amodel M in GOS-100 for all kingdoms can be expressed in oneequation as follows:

ðSUMðK 2 fA;B;E;VgÞ½ððObspublicðM;KÞ þ 1Þ=ðObspublicðKÞ þ 2ÞÞ

3 pGOS�PfamðKÞ�Þ3ObsGOS

ð8Þ

where Obspublic(M,K) is the observed number of public-100 hits tomodel M in K, Obspublic(K) is the observed number of public-100 hitsto all models in K, pGOS-Pfam(K) is the proportion of GOS-100sequences that have at least one Pfam hit in K, and ObsGOS is the totalnumber of Pfam hits to all models in GOS-100.

The ratio of the observed to the predicted number of hits for eachPfam model is a measure of the relative diversity of that Pfam familyin GOS-100 compared to public-100, corrected for the differingkingdom proportions in the two datasets. We computed thesignificance of this ratio using the CHITEST function in Excel, whichimplements the standard Pearson’s Chi-square test with one degree offreedom and expresses the result as a probability. For many proteinfamilies, the difference in diversity between the two datasets was sopronounced that Excel reports a probability of zero due to numericalunderflow, indicating a p-value less than 1 3 10�303.

IDO analysis. The GOS-100 and public-100 sequences selected forthe IDO family alignment matched the PF01231 Pfam fs model with ascore above the trusted bit-score cutoff at the sequence level. Inaddition, the sequences were required to have the width of theirmatching region spanning over 50% of the Pfam IDO HMM modellength. Next, all sequence matches to the Pfam IDO model from theNCBI-nr database downloaded on March 6, 2006, were added (thesealso satisfied the trusted score cutoff and model alignment spancriteria). An additional 26 IDO sequences were found in the newsequence database relative to the GOS public sequence data freezeafter filtering for identical and 1 aa different sequences and presenceof first and last residues in the final trimmed alignment. Jevtrace(version 3.14) [149] was used to assess alignment quality, to removesequences problematic for alignment, to remove sequence redun-dancy (at the 0-aa and 1-aa difference levels) while allowing forredundant nonoverlapping sequences, to trim the alignment to ablock of aligned columns, to delete columns with more than 50%gaps, and to remove sequences with missing first or last residues. Onesequence (GenBank ID 72038700) was likely a multidomain proteinproblematic for alignment and was removed manually. This set ofprocedures produced a block sequence alignment of 144 sequencesand 231 characters. We aligned sequences with MUSCLE (version3.52) [134] using default parameters. The final alignment was used toreconstruct phylogenies with a series of phylogeny reconstructionmethods: PHYML [150], Tree-Puzzle [151], Weighbor [152], and theprotpars program from the PHYLIP package (version 3.6a3) [145].Bootstrapping was performed with the protpars program using 1,000bootstrap replicates, each with 100 jumbles; the majority consensustree was produced by the consense program in the PHYLIP package.

Structural genomics implications. The Pfam5000 families used inthis study were chosen from among the manually curated (Pfam-A)families in from Pfam version 17. We added 2,932 families with astructurally characterized representative as of October 27, 2005, tothe Pfam5000 in descending order by family size, followed by 2,068additional families without a structurally characterized representa-tive, in descending order by family size. Pre-GOS family size wascalculated as the number of sequences in public-100 that had a matchto the Pfam family. Post-GOS family size was calculated as thenumber of sequences in public-100 and GOS-100 that matched eachfamily. We used the results of the HMM profiling effort (using Pfams)used for this analysis.

Coverage of GOS-100 and public-100 sequences by both versionsof the Pfam5000 was measured using the subset of families in Pfam 17that were also in Pfam 16. This was done in order to enable directcomparison of coverage results with a previous study of coverage offully sequenced bacterial and eukaryotic genomes [73]. The versionsof Pfam are similar in size (Pfam 16 contains 7,677 families, and Pfam17 contains 7,868 families).

Phylogeny construction for various families. For the UVDE family,sequences were aligned using MUSCLE [134] and a tree was builtusing QuickTree [144].

For the PP2C family, the catalytic domain portions of thesequences were identified and aligned using the PP2C Pfam model.Sequences that contained �70% nongaps in this alignment were usedto generate a phylogenetic tree of all the PP2C-like sequences. Thephylogeny was inferred using the protdist and neighbor-joiningprograms in PHYLIP [145]. We used 1,941 total PP2C-like sequencesfor the phylogenetic analysis. The breakdown was as follows: public

Figure 17. Content of Protease Types in NCBI-nr and GOS, and Kingdom

Distribution of All Proteases

Due to the highly redundant nature of some NCBI-nr protease groups,nonredundant sets for both NCBI-nr and GOS are computed; thesenonredundant sets are referred to as NCBI-nr60 and GOS60.doi:10.1371/journal.pbio.0050016.g017

PLoS Biology | www.plosbiology.org | S85 Special Section from March 2007 | Volume 5 | Issue 3 | e160461

Expanding the Protein Family Universe

Page 92: Plos Biology Venter Collection Low

Figure 18. Content of Bacterial Protease Clans

doi:10.1371/journal.pbio.0050016.g018

PLoS Biology | www.plosbiology.org | S86 Special Section from March 2007 | Volume 5 | Issue 3 | e160462

Expanding the Protein Family Universe

Page 93: Plos Biology Venter Collection Low

eukaryotic sequences, 73%; public bacterial sequences, 14%; GOS-eukaryotic sequences, 2%; GOS-bacterial sequences, 10%; and GOS-viral and GOS-unknown sequences, less than 1% combined.

For the type II GS family, sequences in GOS and NCBI-nr weresearched with a type II GS HMM constructed from 17 previouslyknown bacterial and eukaryotic type II GS sequences. Matchingsequences from NCBI-nr and GOS were filtered separately forredundancy at 98% identity; the combined set of sequences wasaligned and a neighbor-joining tree was constructed.

For the RuBisCO family, matching RuBisCO sequences from GOSand NCBI-nr were filtered separately for redundancy at 90% identity,resulting in 724 sequences in total. The 724 RuBisCO sequences werethen aligned and a neighbor-joining tree was constructed.

Identification of proteases.We clustered sequences in the MEROPSPeptidase Database [100] using CD-HIT [116,117] at 40% similaritylevel. This resulted in 7,081 sequences, which were then divided intogroups based on catalytic type and Clan identifier. These sequenceswere used as queries to search against a clustered version of NCBI-nr(clustered at 60% similarity threshold) using BLASTP (E-value � 1 310�10). A similar search was carried out against GOS (clustered at 60%similarity threshold). Figure 17 shows the content of protease types inNCBI-nr and GOS together with the kingdom distributions. Figure 18shows the content of bacterial protease clans.

Metabolic enzymes in GOS. Hmmsearch from the HMMERpackage [105] was used to search the GOS sequences for differentGS types. The GlnA TIGRFAM model was used for finding GSIsequences. The HMMs built from known examples of 17 GSII and 18GSIII sequences from NCBI-nr were used to search the GOSsequences.

Identification of ORFans in NCBI-nr. ORFans are proteins that donot have any recognizable homologs in known protein databases. Astraightforward way to identify ORFans is through all-against-allsequence comparison using relaxed match parameters. However, thisis not computationally practical. An effective approach is to firstremove the non-ORFans that can be easily found, and then to identifyORFans from the remaining sequences.

We identified non-ORFans by clustering the NCBI-nr with CD-HIT[116,117], an ultrafast sequence clustering program. A multistepiterated clustering was performed with a series of decreasingsimilarity thresholds. NCBI-nr was first clustered to NCBI-nr90,where sequences with .90% similarities were grouped. NCBI-nr90was then clustered to NCBI-nr80/70/60/50 and finally NCBI-nr30.After each clustering stage, the total number of clusters of NCBI-nrwas decreased and non-ORFans were identified. A one-step clusteringfrom NCBI-nr directly to NCBI-nr30 can be performed. However, themultistep clustering is computationally more efficient.

At the 30% similarity level, all the NCBI-nr proteins were groupedinto 391,833 clusters, including 259,571 singleton clusters. Theproteins in nonsingleton clusters are by definition non-ORFans.However, proteins that remain as singletons are not necessarilyORFans, because their similarity to other proteins may not bereported for two reasons: (1) significant sequence similarity can be,30%; and (2) in order to prevent a cluster from being too diverse,CD-HIT, like all other clustering algorithms, may not add a sequenceto that cluster even if the similarity between this sequence and asequence in that cluster meet the similarity threshold.

The 259,571 singletons were compared to NCBI-nr with BLASTP[38] to identify real ORFans from them. The default low-complexity

filter was enabled in the BLAST comparisons, and similarity thresh-old in the form of an E-value was set to 1 3 10�6. In the end, 84,911proteins with at least 100 aa are identified as ORFans. About 100,000short ORFans less than 100 aa were removed from this study, becausethey may not be real proteins.

Genome sequencing projects and rate of discovery. We usedEnsembl sequences for Homo sapiens, Mus musculus, Rattus norvegicus,Canis familiaris, and Pan troglodytes. Their clustering information isshown in Table 15. When we considered the datasets in the order HS,HSþMM, HSþMMþRN, HSþMMþRNþCF, and HSþMMþRNþCFþ PT, the numbers of distinct clusters were 10,536, 12,731, 13,605,14,606, and 14,993, respectively. These numbers were comparedagainst a random subset of NCBI-nr bacterial sequences (of a similarsize) and also against a random subset of GOS sequences. We alsorandomized the order of the mammalian sequences to produce adataset that was independent of the genome order being considered.

Supporting Information

Protocol S1. Supplementary Information

Found at doi:10.1371/journal.pbio.0050016.sd001 (25 KB DOC).

Accession Numbers

All NCBI-nr sequences from February 10, 2005 were used in ouranalysis. Protocol S1 lists the GenBank (http://www.ncbi.nlm.nih.gov/Genbank) accession numbers of (1) the genomic sequences used inthe PG set, (2) the sequences used in building GS profiles, and (3) theNCBI-nr sequences used in building the IDO phylogeny. The otherGenBank sequences discussed in this paper are Bacillus sp. NRRL B-14911 (89089741), Janibacter sp. HTCC2649 (84385106), Erythrobacterlitoralis (84785911), and Nitrosococcus oceani (76881875). The Pfam(http://pfam.cgb.ki.se) structures discussed in this paper are envelopeglycoprotein GP120 (PF00516), reverse transcriptase (PF00078),retroviral aspartyl protease (PF00077), bacteriophage T4-like capsidassembly protein (Gp20) (PF07230), major capsid protein Gp23(PF07068), phage tail sheath protein (PF04984), IDO (PF01231),poxvirus A22 protein family (PF04848), and PP2C (PF00481). Theglutamine synthetase TIGRFAM (http://www.tigr.org/TIGRFAMs) usedin the paper is GlnA: glutamine synthetase, type I (TIGR00653). ThePDB (http://www.rcsb.org/pdb) identifiers and the names of the eightPDB ORFans with GOS matches are: restriction endonuclease MunI(1D02), restriction endonuclease BglI (1DMU), restriction endonu-clease BstYI (1SDO), restriction endonuclease HincII (1TX3); alpha-glucosyltransferase (1Y8Z), hypothetical protein PA1492 (1T1J),putative protein (1T6T), and hypothetical protein AF1548 (1Y88).

Acknowledgments

We are indebted to a large group of individuals and groups forfacilitating our sampling and analysis. We thank the governments ofCanada, Mexico, Honduras, Costa Rica, Panama, and Ecuador andFrench Polynesia/France for facilitating sampling activities. Allsequencing data collected from waters of the above-named countriesremain part of the genetic patrimony of the country from which theywere obtained. We also acknowledge TimeLogic (Active Motif, Inc.)and in particular Chris Hoover and Joe Salvatore for helping makethe DeCypher system available to us; the Department of Energy foruse of their NERSC Seaborg compute cluster; Marty Stout, RandyDoering, Tyler Osgood, Scott Collins, and Marshall Peterson (J. CraigVenter Institute) for help with the compute resources; Peter Daviesand Saul Kravitz (J. Craig Venter Institute) for help with dataaccessibility issues; Kelvin Li and Nelson Axelrod (J. Craig VenterInstitute) for discussions on data formats; K. Eric Wommack(University of Delaware, Newark) and the captain and crew of theR/V Cape Henlopen for their assistance in field collection ofChesapeake Bay virioplankton samples; John Glass (J. Craig VenterInstitute) for assistance with the collection and processing of thevirioplankton samples; Beth Hoyle and Laura Sheahan (J. CraigVenter Institute) for help with paper editing; and Matthew LaPointeand Jasmine Pollard (J. Craig Venter Institute) for help with figureformatting. STM, MPJ, CvB, DAS, and SEB acknowledge KasperHansen for statistical advice. We also acknowledge the reviewers fortheir valuable comments.

Author contributions. SY contributed to the design and imple-mentation of the clustering process, and the subsequent analyses ofthe clusters; he also contributed to and coordinated all of the analysesin the paper, and wrote a large portion of the paper. GS contributed

Table 15. Clustering Information for Ensembl Sequences for H.sapiens, M. musculus, R. norvegicus, C. familiaris, and P.troglodytes

Genome Number of

Sequences

from Ensembl

Number of

Sequences

in Clusters

Number of

Clusters

H. sapiens 33,860 31,268 10,536

M. musculus 32,442 30,025 9,734

R. norvegicus 28,545 27,486 9,485

C. familiaris 30,308 29,041 9,397

P. troglodytes 38,822 34,697 9,978

doi:10.1371/journal.pbio.0050016.t015

PLoS Biology | www.plosbiology.org | S87 Special Section from March 2007 | Volume 5 | Issue 3 | e160463

Expanding the Protein Family Universe

Page 94: Plos Biology Venter Collection Low

to the design and analysis of the clustering process, contributed ideas,analysis, and also wrote parts of the paper. DBR identified ORFs fromthe assemblies, performed the all-against-all BLAST searches, con-tributed to GOS kingdom assignment, and contributed analysis toolsand ideas. ALH performed the assembly of GOS sequences, andcontributed analysis tools and ideas. SW contributed to the analysisof viral sequences. KR contributed to project planning and paperwriting. JAE performed the analysis of UV damage repair enzymes,and also contributed to paper writing. KBH, RF, and RLS contributedto project planning. GM performed the profile HMM searches,carried out the domain analysis, and contributed to paper writing.WL and AG carried out the ORFan analysis and contributed to paperwriting. LJ contributed to the profile-profile search process. PC andAG carried out the analysis of proteases and contributed to paperwriting. CSM, HL, and DE carried out the analysis of novel clusters,the analysis of metabolic enzymes and contributed to paper writing.YZ contributed to the profile HMM searches and domain analysis.STM, MPJ, CvB, DAS, and SEB carried out the analysis of Pfamdomain distributions in GOS and current proteins, analysis of IDO,contributed to GOS kingdom assignment, and also contributed topaper writing. DAS and SEB also contributed to the Ka/Ks test. JMCand SEB carried out the analysis on the implications for structuralgenomics and contributed to paper writing. SL, KN, SST, and JEDcarried out the phosphatase analysis and contributed to paperwriting. SST and JED also contributed to project planning. BJR andVB contributed to the analysis of cluster size distribution, familydiscovery rate, and contributed to paper writing. MF contributed to

paper writing, project planning, and ideas for analysis. JCV conceivedand coordinated the project, and supplied ideas.

Funding. The authors acknowledge the Department of EnergyGenomics: GTL Program, Office of Science (DE-FG02-02ER63453),the Gordon and Betty Moore Foundation, the Discovery Channel andthe J. Craig Venter Science Foundation for funding to undertake thisstudy. GM acknowledges funding from the Razavi-Newman Centerfor Bioinformatics and was also supported by National CancerInstitute grant P30 CA014195. PC was partially supported by a Centerfor Proteolytic Pathways (CPP)–National Institutes of Health (NIH)grant 5U54 RR020843–02. CSM, HL, and DE acknowledge the supportof DOE Biological and Environmental Research (BER). SL and JEDwere supported by research grants from NIH. BJR was supported by aCareer Award at the Scientific Interface from the BurroughsWellcome Fund. Support for the Brenner lab work was provided byNIH K22 HG00056 and an IBM Shared University Research grant.STM was supported by NIH Genomics Training Grant 5T32HG00047. MPJ was supported by NIH P20 GM068136 and NIH K22HG00056. CvB was supported in part by the Haas Scholars Program.DAS was supported by a Howard Hughes Medical InstitutePredoctoral Fellowship. JMC was supported by NIH grant R01GM073109, and by the US Department of Energy Genomics: GTLprogram through contract DE-AC02-05CH11231.

Competing interests. The authors have declared that no competinginterests exist.

References1. Tatusov RL, Galperin MY, Natale DA, Koonin EV (2000) The COG

database: A tool for genome-scale analysis of protein functions andevolution. Nucleic Acids Res 28: 33–36.

2. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: A structuralclassification of proteins database for the investigation of sequences andstructures. J Mol Biol 247: 536–540.

3. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, et al. (1997)CATH—A hierarchic classification of protein domain structures. Struc-ture 5: 1093–1108.

4. Thornton JM, Orengo CA, Todd AE, Pearl FM (1999) Protein folds,functions and evolution. J Mol Biol 293: 333–342.

5. Todd AE, Orengo CA, Thornton JM (2001) Evolution of function inprotein superfamilies, from a structural perspective. J Mol Biol 307: 1113–1143.

6. Coulson AF, Moult J (2002) A unifold, mesofold, and superfold model ofprotein fold use. Proteins 46: 61–71.

7. Rost B (2002) Did evolution leap to create the protein universe? Curr OpinStruct Biol 12: 409–416.

8. Kinch LN, Grishin NV (2002) Evolution of protein structures andfunctions. Curr Opin Struct Biol 12: 400–408.

9. Galperin MY, Koonin EV (2000) Who’s your neighbor? New computationalapproaches for functional genomics. Nat Biotechnol 18: 609–613.

10. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al. (2004)Environmental genome shotgun sequencing of the Sargasso Sea. Science304: 66–74.

11. Tringe SG, Rubin EM (2005) Metagenomics: DNA sequencing of environ-mental samples. Nat Rev Genet 6: 805–814.

12. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, et al. (2005)Comparative metagenomics of microbial communities. Science 308: 554–557.

13. Hallam SJ, Putnam N, Preston CM, Detter JC, Rokhsar D, et al. (2004)Reverse methanogenesis: Testing the hypothesis with environmentalgenomics. Science 305: 1457–1462.

14. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, et al. (2004)Community structure and metabolism through reconstruction of micro-bial genomes from the environment. Nature 428: 37–43.

15. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, et al. (2004) The Pfamprotein families database. Nucleic Acids Res 32: D138–D141.

16. Corpet F, Gouzy J, Kahn D (1998) The ProDom database of proteindomain families. Nucleic Acids Res 26: 323–326.

17. Sasson O, Vaaknin A, Fleischer H, Portugaly E, Bilu Y, et al. (2003)ProtoNet: Hierarchical classification of the protein space. Nucleic AcidsRes 31: 348–352.

18. Pipenbacher P, Schliep A, Schneckener S, Schonhuth A, Schomburg D, etal. (2002) ProClust: Improved clustering of protein sequences with anextended graph-based approach. Bioinformatics 18: S182–S191.

19. Apweiler R, Bairoch A, Wu CH (2004) Protein sequence databases. CurrOpin Chem Biol 8: 76–80.

20. Gasteiger E, Jung E, Bairoch A (2001) SWISS-PROT: Connectingbiomolecular knowledge via a protein database. Curr Issues Mol Biol 3:47–55.

21. Sonnhammer EL, Eddy SR, Birney E, Bateman A, Durbin R (1998) Pfam:

Multiple sequence alignments and HMM-profiles of protein domains.Nucleic Acids Res 26: 320–322.

22. Haft DH, Selengut JD, White O (2003) The TIGRFAMs database of proteinfamilies. Nucleic Acids Res 31: 371–373.

23. Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, et al. (2001)TIGRFAMs: A protein family resource for the functional identification ofproteins. Nucleic Acids Res 29: 41–43.

24. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, et al. (2004)UniProt: The Universal Protein knowledgebase. Nucleic Acids Res 32:D115–D119.

25. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, et al. (2005)InterPro, progress and status in 2005. Nucleic Acids Res 33: D201–D205.

26. Heger A, Holm L (2003) Exhaustive enumeration of protein domainfamilies. J Mol Biol 328: 749–767.

27. Liu X, Fan K, Wang W (2004) The number of protein folds and theirdistribution over families in nature. Proteins 54: 491–499.

28. Kunin V, Cases I, Enright AJ, de Lorenzo V, Ouzounis CA (2003) Myriadsof protein families, and still counting. Genome Biol 4: 401.

29. Bru C, Courcelle E, Carrere S, Beausse Y, Dalmar S, et al. (2005) TheProDom database of protein domain families: More emphasis on 3D.Nucleic Acids Res 33: D212–D215.

30. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al.(2007) The Sorcerer II Gobal Ocean Sampling expedition: NorthwestAtlantic through eastern tropical Pacific. PLoS Biol 5: e77. doi:10.1371/journal.pbio.0050077

31. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, et al. (2006)Database resources of the National Center for Biotechnology Information.Nucleic Acids Res 34: D173–D180.

32. National Center for Biotechnology Information (2005) Blast db [database].Washington (D.C.) National Center for Biotechnology Information.Available: ftp://ftp.ncbi.nih.gov/blast/db. Accessed 10 February 2005.

33. National Center for Biotechnology Information (2005) Microbial GenomeProjects db[database]. Washington (D.C.) National Center for Biotechnol-ogy Information. Available: ftp://ftp.ncbi.nih.gov/genomes/Bacteria. Ac-cessed 10 February 2005.

34. Quackenbush J, Liang F, Holt I, Pertea G, Upton J (2000) The TIGR geneindices: Reconstruction and representation of expressed gene sequences.Nucleic Acids Res 28: 141–145.

35. Birney E, Andrews D, Bevan P, Caccamo M, Cameron G, et al. (2004)Ensembl 2004. Nucleic Acids Res 32: D468–D470.

36. Birney E, Andrews TD, Bevan P, Caccamo M, Chen Y, et al. (2004) Anoverview of Ensembl. Genome Res 14: 925–928.

37. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, et al. (2000) Awhole-genome assembly of Drosophila. Science 287: 2196–2204.

38. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic localalignment search tool. J Mol Biol 215: 403–410.

39. Rychlewski L, Jaroszewski L, Li W, Godzik A (2000) Comparison ofsequence profiles. Strategies for structural predictions using sequenceinformation. Protein Sci 9: 232–241.

40. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al. (1997)Gapped BLAST and PSI-BLAST: A new generation of protein databasesearch programs. Nucleic Acids Res 25: 3389–3402.

41. Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence

PLoS Biology | www.plosbiology.org | S88 Special Section from March 2007 | Volume 5 | Issue 3 | e160464

Expanding the Protein Family Universe

Page 95: Plos Biology Venter Collection Low

analysis: Probabilistic models of proteins and nucleic acids. New York:Cambridge University Press. 356 p.

42. Barabasi AL, Albert R (1999) Emergence of scaling in random networks.Science 286: 509–512.

43. Barabasi AL, Oltvai ZN (2004) Network biology: Understanding the cell’sfunctional organization. Nat Rev Genet 5: 101–113.

44. Sullivan MB, Coleman ML, Weigele P, Rohwer F, Chisholm SW (2005)Three Prochlorococcus cyanophage genomes: Signature features and ecolog-ical interpretations. PLoS Biol 3: e144.

45. Giovannoni SJ, Tripp HJ, Givan S, Podar M, Vergin KL, et al. (2005)Genome streamlining in a cosmopolitan oceanic bacterium. Science 309:1242–1245.

46. Strong M, Mallick P, Pellegrini M, Thompson MJ, Eisenberg D (2003)Inference of protein function and protein linkages in Mycobacteriumtuberculosis based on prokaryotic genome organization: A combinedcomputational approach. Genome Biol 4: R59.

47. Bowers PM, Pellegrini M, Thompson MJ, Fierro J, Yeates TO, et al. (2004)Prolinks: A database of protein functional linkages derived fromcoevolution. Genome Biol 5: R35.

48. von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, et al. (2005)STRING: Known and predicted protein-protein associations, integratedand transferred across organisms. Nucleic Acids Res 33: D433–D437.

49. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. (2000) Geneontology: Tool for the unification of biology. The Gene OntologyConsortium. Nat Genet 25: 25–29.

50. Lindell D, Sullivan MB, Johnson ZI, Tolonen AC, Rohwer F, et al. (2004)Transfer of photosynthesis genes to and from Prochlorococcus viruses. ProcNatl Acad Sci U S A 101: 11013–11018.

51. DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, et al. (2006)Community genomics among stratified microbial assemblages in theocean’s interior. Science 311: 496–503.

52. Paul JH, Sullivan MB (2005) Marine phage genomics: What have welearned? Curr Opin Biotechnol 16: 299–307.

53. Edwards RA, Rohwer F (2005) Viral metagenomics. Nat Rev Microbiol 3:504–510.

54. Daubin V, Ochman H (2004) Bacterial genomes as new gene homes: Thegenealogy of ORFans in E. coli. Genome Res 14: 1036–1042.

55. Hsiao WW, Ung K, Aeschliman D, Bryan J, Finlay BB, et al. (2005) Evidenceof a large novel gene pool associated with prokaryotic genomic islands.PLoS Genet 1: e62.

56. Coleman ML, Sullivan MB, Martiny AC, Steglich C, Barry K, et al. (2006)Genomic islands and the ecology and evolution of Prochlorococcus. Science311: 1768–1770.

57. Breitbart M, Salamon P, Andresen B, Mahaffy JM, Segall AM, et al. (2002)Genomic analysis of uncultured marine viral communities. Proc Natl AcadSci U S A 99: 14250–14255.

58. Pedulla ML, Ford ME, Houtz JM, Karthikeyan T, Wadsworth C, et al. (2003)Origins of highly mosaic mycobacteriophage genomes. Cell 113: 171–182.

59. Wilson GA, Bertrand N, Patel Y, Hughes JB, Feil EJ, et al. (2005) Orphansas taxonomically restricted and ecologically important genes. Micro-biology 151: 2499–2501.

60. Takami H, Takaki Y, Uchiyama I (2002) Genome sequence of Oceanobacillusiheyensis isolated from the Iheya Ridge and its unexpected adaptivecapabilities to extreme environments. Nucleic Acids Res 30: 3927–3935.

61. Wellcome Trust Sanger Institute (2005) Pfam db [database]. Release 17.Cambridge (U.K.): Wellcome Trust Sanger Institute. Available: http://www.sanger.ac.uk/Software/Pfam.

62. Mellor AL, Munn DH (2004) IDO expression by dendritic cells: Toleranceand tryptophan catabolism. Nat Rev Immunol 4: 762–774.

63. Suzuki T, Yokouchi K, Kawamichi H, Yamamoto Y, Uda K, et al. (2003)Comparison of the sequences of Turbo and Sulculus indoleaminedioxygenase-like myoglobin genes. Gene 308: 89–94.

64. Fallarino F, Asselin-Paturel C, Vacca C, Bianchi R, Gizzi S, et al. (2004)Murine plasmacytoid dendritic cells initiate the immunosuppressivepathway of tryptophan catabolism in response to CD200 receptorengagement. J Immunol 173: 3748–3754.

65. Hayashi T, Beck L, Rossetto C, Gong X, Takikawa O, et al. (2004) Inhibitionof experimental asthma by indoleamine 2,3-dioxygenase. J Clin Invest 114:270–279.

66. Muller AJ, DuHadaway JB, Donover PS, Sutanto-Ward E, Prendergast GC(2005) Inhibition of indoleamine 2,3-dioxygenase, an immunoregulatorytarget of the cancer suppression gene Bin1, potentiates cancer chemo-therapy. Nat Med 11: 312–319.

67. Burley SK, Bonanno JB (2003) Structural genomics. Methods BiochemAnal 44: 591–612.

68. Blundell TL, Mizuguchi K (2000) Structural genomics: An overview. ProgBiophys Mol Biol 73: 289–295.

69. Brenner SE (2001) A tour of structural genomics. Nat Rev Genet 2: 801–809.

70. Montelione GT (2001) Structural genomics: An approach to the proteinfolding problem. Proc Natl Acad Sci U S A 98: 13488–13489.

71. Chance MR, Bresnick AR, Burley SK, Jiang JS, Lima CD, et al. (2002)Structural genomics: A pipeline for providing structures for the biologist.Protein Sci 11: 723–738.

72. Chandonia JM, Brenner SE (2006) The impact of structural genomics:expectations and outcomes. Science 311: 347–351.

73. Chandonia JM, Brenner SE (2005) Implications of structural genomicstarget selection strategies: Pfam5000, whole genome, and randomapproaches. Proteins 58: 166–179.

74. Chandonia JM, Brenner SE (2005) Update on the Pfam5000 strategy forselection of structural genomics targets. Proceedings of the 2005 IEEEEngineering in Medicine and Biology 27th Annual Conference, Shanghai,China 27: 751–755.

75. Baker D, Sali A (2001) Protein structure prediction and structuralgenomics. Science 294: 93–96.

76. Service R (2005) Structural biology. Structural genomics, round 2. Science307: 1554–1558.

77. Kannan N, Taylor SS, Zhai Y, Venter JC, Manning G (2006) Structural andfunctional diversity of the microbial kinome. PLoS Biol 5: e17. doi:10.1371/journal.pbio.0050017

78. Friedberg E (1985) DNA repair. New York W. H. Freeman and Co. 614 p.79. Sancar GB (2000) Enzymatic photoreactivation: 50 years and counting.

Mutat Res 451: 25–37.80. Bowman KK, Sidik K, Smith CA, Taylor JS, Doetsch PW, et al. (1994) A new

ATP-independent DNA endonuclease from Schizosaccharomyces pombe thatrecognizes cyclobutane pyrimidine dimers and 6–4 photoproducts.Nucleic Acids Res 22: 3026–3032.

81. Setlow P (2001) Resistance of spores of Bacillus species to ultraviolet light.Environ Mol Mutagen 38: 97–104.

82. Morikawa K, Ariyoshi M, Vassylyev D, Katayanagi K, Nakamura H, et al.(1994) Crystal structure of T4 endonuclease V. An excision repair enzymefor a pyrimidine dimer. Ann N Y Acad Sci 726: 198–207.

83. Piersen CE, Prince MA, Augustine ML, Dodson ML, Lloyd RS (1995)Purification and cloning of Micrococcus luteus ultraviolet endonuclease, anN-glycosylase/abasic lyase that proceeds via an imino enzyme-DNAintermediate. J Biol Chem 270: 23475–23484.

84. Hunter T (1995) Protein kinases and phosphatases: The yin and yang ofprotein phosphorylation and signaling. Cell 80: 225–236.

85. Kennelly PJ (2001) Protein phosphatases—A phylogenetic perspective.Chem Rev 101: 2291–2312.

86. Leroy C, Lee SE, Vaze MB, Ochsenbien F, Guerois R, et al. (2003) PP2Cphosphatases Ptc2 and Ptc3 are required for DNA checkpoint inactivationafter a double-strand break. Mol Cell 11: 827–835.

87. Meskiene I, Baudouin E, Schweighofer A, Liwosz A, Jonak C, et al. (2003)Stress-induced protein phosphatase 2C is a negative regulator of amitogen-activated protein kinase. J Biol Chem 278: 18945–18952.

88. Takekawa M, Maeda T, Saito H (1998) Protein phosphatase 2Calphainhibits the human stress-responsive p38 and JNK MAPK pathways. EMBOJ 17: 4744–4752.

89. Warmka J, Hanneman J, Lee J, Amin D, Ota I (2001) Ptc1, a type 2C Ser/Thr phosphatase, inactivates the HOG pathway by dephosphorylating themitogen-activated protein kinase Hog1. Mol Cell Biol 21: 51–60.

90. Bork P, Brown NP, Hegyi H, Schultz J (1996) The protein phosphatase 2C(PP2C) superfamily: Detection of bacterial homologues. Protein Sci 5:1421–1425.

91. Das AK, Helps NR, Cohen PT, Barford D (1996) Crystal structure of theprotein serine/threonine phosphatase 2C at 2.0 A resolution. EMBO J 15:6798–6809.

92. Jackson MD, Fjeld CC, Denu JM (2003) Probing the function of conservedresidues in the serine/threonine phosphatase PP2Calpha. Biochemistry 42:8513–8521.

93. Novakova L, Saskova L, Pallova P, Janecek J, Novotna J, et al. (2005)Characterization of a eukaryotic type serine/threonine protein kinase andprotein phosphatase of Streptococcus pneumoniae and identification of kinasesubstrates. FEBS J 272: 1243–1254.

94. Obuchowski M, Madec E, Delattre D, Boel G, Iwanicki A, et al. (2000)Characterization of PrpC from Bacillus subtilis, a member of the PPMphosphatase family. J Bacteriol 182: 5634–5638.

95. Boitel B, Ortiz-Lombardia M, Duran R, Pompeo F, Cole ST, et al. (2003)PknB kinase activity is regulated by phosphorylation in two Thr residuesand dephosphorylation by PstP, the cognate phospho-Ser/Thr phospha-tase, in Mycobacterium tuberculosis. Mol Microbiol 49: 1493–1508.

96. Chopra P, Singh B, Singh R, Vohra R, Koul A, et al. (2003) Phosphoproteinphosphatase of Mycobacterium tuberculosis dephosphorylates serine-threo-nine kinases PknA and PknB. Biochem Biophys Res Commun 311: 112–120.

97. Yeats C, Finn RD, Bateman A (2002) The PASTA domain: A beta-lactam-binding domain. Trends Biochem Sci 27: 438.

98. Schweighofer A, Hirt H, Meskiene I (2004) Plant PP2C phosphatases:Emerging functions in stress signaling. Trends Plant Sci 9: 236–243.

99. Barrett AJ, Rawlings ND, Woesner JFeditors (2004) Handbook ofproteolytic enzymes. Amsterdam: Elsevier. 2,140 p.

100. Rawlings ND, Morton FR, Barrett AJ (2006) MEROPS: The peptidasedatabase. Nucleic Acids Res 34: D270–D272.

101. Kumada Y, Benson DR, Hillemann D, Hosted TJ, Rochefort DA, et al.(1993) Evolution of the glutamine synthetase gene, one of the oldestexisting and functioning genes. Proc Natl Acad Sci U S A 90: 3009–3013.

102. Valentine RC, Shapiro BM, Stadtman ER (1968) Regulation of glutamine

PLoS Biology | www.plosbiology.org | S89 Special Section from March 2007 | Volume 5 | Issue 3 | e160465

Expanding the Protein Family Universe

Page 96: Plos Biology Venter Collection Low

synthetase. XII. Electron microscopy of the enzyme from Escherichia coli.Biochemistry 7: 2143–2152.

103. Almassy RJ, Janson CA, Hamlin R, Xuong NH, Eisenberg D (1986) Novelsubunit-subunit interactions in the structure of glutamine synthetase.Nature 323: 304–309.

104. Eisenberg D, Gill HS, Pfluegl GM, Rotstein SH (2000) Structure-functionrelationships of glutamine synthetases. Biochim Biophys Acta 1477: 122–145.

105. Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14: 755–763.

106. Carlson T, Chelm B (1986) Apparant eukaryotic origin of glutaminesynthetase II from the bacterium Bradyrhizobium japonicum. Nature 322:568–570.

107. Hosted TJ, Rochefort DA, Benson DR (1993) Close linkage of genesencoding glutamine synthetases I and II in Frankia alni CpI1. J Bacteriol175: 3679–3684.

108. Deuel TF, Ginsburg A, Yeh J, Shelton E, Stadtman ER (1970) Bacillus subtilisglutamine synthetase. Purification and physical characterization. J BiolChem 245: 5195–5205.

109. Fisher SH, Sonenshein AL (1984) Bacillus subtilis glutamine synthetasemutants pleiotropically altered in glucose catabolite repression. JBacteriol 157: 612–621.

110. Ellis RJ (1979) The most abundant protein in the world. Trends BiochemSci 4: 241–244.

111. Hanson TE, Tabita FR (2001) A ribulose-1,5-bisphosphate carboxylase/oxygenase (RubisCO)-like protein from Chlorobium tepidum that is involvedwith sulfur metabolism and the response to oxidative stress. Proc NatlAcad Sci U S A 98: 4397–4402.

112. Eisen JA, Nelson KE, Paulsen IT, Heidelberg JF, Wu M, et al. (2002) Thecomplete genome sequence of Chlorobium tepidum TLS, a photosynthetic,anaerobic, green-sulfur bacterium. Proc Natl Acad Sci U S A 99: 9509–9514.

113. Li H, Sawaya MR, Tabita FR, Eisenberg D (2005) Crystal structure of aRuBisCO-like protein from the green sulfur bacterium Chlorobium tepidum.Structure (Camb) 13: 779–789.

114. Ashida H, Saito Y, Kojima C, Kobayashi K, Ogasawara N, et al. (2003) Afunctional link between RuBisCO-like protein of Bacillus and photo-synthetic RuBisCO. Science 302: 286–290.

115. Fischer D, Eisenberg D (1999) Finding families for genomic ORFans.Bioinformatics 15: 759–762.

116. Li W, Jaroszewski L, Godzik A (2001) Clustering of highly homologoussequences to reduce the size of large protein databases. Bioinformatics 17:282–283.

117. Li W, Jaroszewski L, Godzik A (2002) Tolerating some redundancysignificantly speeds up clustering of large protein databases. Bioinfor-matics 18: 77–82.

118. Bujnicki JM, Rychlewski L (2001) Identification of a PD-(D/E)XK-likedomain with a novel configuration of the endonuclease active site in themethyl-directed restriction enzyme Mrr and its homologs. Gene 267: 183–191.

119. Breitbart M, Felts B, Kelley S, Mahaffy JM, Nulton J, et al. (2004) Diversityand population structure of a near-shore marine-sediment viral com-munity. Proc Biol Sci 271: 565–574.

120. Breitbart M, Hewson I, Felts B, Mahaffy JM, Nulton J, et al. (2003)Metagenomic analyses of an uncultured viral community from humanfeces. J Bacteriol 185: 6220–6223.

121. Cann AJ, Fandrich SE, Heaphy S (2005) Analysis of the virus populationpresent in equine faeces indicates the presence of hundreds ofuncharacterized virus genomes. Virus Genes 30: 151–156.

122. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, et al.(2003) The SWISS-PROT protein knowledgebase and its supplementTrEMBL in 2003. Nucleic Acids Res 31: 365–370.

123. Westbrook J, Feng Z, Chen L, Yang H, Berman HM (2003) The ProteinData Bank and structural genomics. Nucleic Acids Res 31: 489–491.

124. Wu CH, Yeh LS, Huang H, Arminski L, Castro-Alvear J, et al. (2003) TheProtein Information Resource. Nucleic Acids Res 31: 345–347.

125. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL (2003)GenBank. Nucleic Acids Res 31: 23–27.

126. Stoesser G, Baker W, van den Broek A, Garcia-Pastor M, Kanz C, et al.

(2003) The EMBL Nucleotide Sequence Database: Major new develop-ments. Nucleic Acids Res 31: 17–22.

127. Miyazaki S, Sugawara H, Gojobori T, Tateno Y (2003) DNA Data Bank ofJapan (DDBJ) in XML. Nucleic Acids Res 31: 13–16.

128. Celniker SE, Wheeler DA, Kronmiller B, Carlson JW, Halpern A, et al.(2002) Finishing a whole-genome shotgun: Release 3 of the Drosophilamelanogaster euchromatic genome sequence. Genome Biol 3: RE-SEARCH0079.

129. Henikoff S, Henikoff JG (1992) Amino acid substitution matrices fromprotein blocks. Proc Natl Acad Sci U S A 89: 10915–10919.

130. Ochman H (2002) Distinguishing the ORFs from the ELFs: Short bacterialgenes and the annotation of genomes. Trends Genet 18: 335–337.

131. Nekrutenko A, Makova KD, Li WH (2002) The K(A)/K(S) ratio test forassessing the protein-coding potential of genomic regions: An empiricaland simulation study. Genome Res 12: 198–202.

132. Li WH (1997) Molecular Evolution. Sunderland (MA): Sinauer Associates,Inc. 487 p.

133. Nei M, Kumar S (2000) Molecular evolution and phylogenetics. New York:Oxford University Press. 333 p.

134. Edgar RC (2004) MUSCLE: Multiple sequence alignment with highaccuracy and high throughput. Nucleic Acids Res 32: 1792–1797.

135. Yang Z (1997) PAML: A program package for phylogenetic analysis bymaximum likelihood. Comput Appl Biosci 13: 555–556.

136. Yang Z, Nielsen R, Goldman N, Pedersen AM (2000) Codon-substitutionmodels for heterogeneous selection pressure at amino acid sites. Genetics155: 431–449.

137. Huynen MA, van Nimwegen E (1998) The frequency distribution of genefamily sizes in complete genomes. Mol Biol Evol 15: 583–589.

138. Yanai I, Camacho CJ, DeLisi C (2000) Predictions of gene familydistributions in microbial genomes: Evolution by gene duplication andmodification. Phys Rev Lett 85: 2641–2644.

139. Qian J, Luscombe NM, Gerstein M (2001) Protein family and foldoccurrence in genomes: Power-law behaviour and evolutionary model. JMol Biol 313: 673–681.

140. Unger R, Uliel S, Havlin S (2003) Scaling law in sizes of protein sequencefamilies: From super-families to orphan genes. Proteins 51: 569–576.

141. Salgado H, Gama-Castro S, Martinez-Antonio A, Diaz-Peredo E, Sanchez-Solano F, et al. (2004) RegulonDB (version 4.0): Transcriptional regulation,operon organization and growth conditions in Escherichia coli K-12. NucleicAcids Res 32: D303–D306.

142. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: Improving thesensitivity of progressive multiple sequence alignment through sequenceweighting, position-specific gap penalties and weight matrix choice.Nucleic Acids Res 22: 4673–4680.

143. Mailund T, Pedersen CN (2004) QuickJoin—Fast neighbour-joining treereconstruction. Bioinformatics 20: 3261–3262.

144. Howe K, Bateman A, Durbin R (2002) QuickTree: Building hugeneighbour-joining trees of protein sequences. Bioinformatics 18: 1546–1547.

145. Felsenstein J (2005) PHYLIP (Phylogeny Inference Package) 3.6 edition[computer program]. Seattle: Department of Genome Sciences, Universityof Washington, Seattle.

146. Lord PW, Stevens RD, Brass A, Goble CA (2003) Investigating semanticsimilarity measures across the Gene Ontology: The relationship betweensequence and annotation. Bioinformatics 19: 1275–1283.

147. Krogh A, Larsson B, von Heijne G, Sonnhammer EL (2001) Predictingtransmembrane protein topology with a hidden Markov model: Applica-tion to complete genomes. J Mol Biol 305: 567–580.

148. Juretic D, Zoranic L, Zucic D (2002) Basic charge clusters and predictionsof membrane protein topology. J Chem Inf Comput Sci 42: 620–632.

149. Joachimiak MP, Cohen FE (2002) JEvTrace: Refinement and variations ofthe evolutionary trace in JAVA. Genome Biol 3: RESEARCH0077.

150. Guindon S, Gascuel O (2003) A simple, fast, and accurate algorithm toestimate large phylogenies by maximum likelihood. Syst Biol 52: 696–704.

151. Schmidt HA, Strimmer K, Vingron M, von Haeseler A (2002) TREE-PUZZLE: Maximum likelihood phylogenetic analysis using quartets andparallel computing. Bioinformatics 18: 502–504.

152. Bruno WJ, Socci ND, Halpern AL (2000) Weighted neighbor joining: Alikelihood-based approach to distance-based phylogeny reconstruction.Mol Biol Evol 17: 189–197.

PLoS Biology | www.plosbiology.org | S90 Special Section from March 2007 | Volume 5 | Issue 3 | e160466

Expanding the Protein Family Universe

Page 97: Plos Biology Venter Collection Low

Structural and Functional Diversityof the Microbial KinomeNatarajan Kannan

1,2, Susan S. Taylor

1,2, Yufeng Zhai

3, J. Craig Venter

4, Gerard Manning

3*

1 Department of Chemistry and Biochemistry, University of California San Diego, La Jolla, California, United States of America, 2 Howard Hughes Medical Institute, University

of California San Diego, La Jolla, California, United States of America, 3 Razavi-Newman Center for Bioinformatics, Salk Institute for Biological Studies, La Jolla, California,

United States of America, 4 J. Craig Venter Institute, Rockville, Maryland, United States of America

The eukaryotic protein kinase (ePK) domain mediates the majority of signaling and coordination of complex events ineukaryotes. By contrast, most bacterial signaling is thought to occur through structurally unrelated histidine kinases,though some ePK-like kinases (ELKs) and small molecule kinases are known in bacteria. Our analysis of the GlobalOcean Sampling (GOS) dataset reveals that ELKs are as prevalent as histidine kinases and may play an equallyimportant role in prokaryotic behavior. By combining GOS and public databases, we show that the ePK is just onesubset of a diverse superfamily of enzymes built on a common protein kinase–like (PKL) fold. We explored this hugephylogenetic and functional space to cast light on the ancient evolution of this superfamily, its mechanistic core, andthe structural basis for its observed diversity. We cataloged 27,677 ePKs and 18,699 ELKs, and classified them into 20highly distinct families whose known members suggest regulatory functions. GOS data more than tripled the count ofELK sequences and enabled the discovery of novel families and classification and analysis of all ELKs. Comparisonbetween and within families revealed ten key residues that are highly conserved across families. However, all but oneof the ten residues has been eliminated in one family or another, indicating great functional plasticity. We show thatloss of a catalytic lysine in two families is compensated by distinct mechanisms both involving other key motifs. Thisdiverse superfamily serves as a model for further structural and functional analysis of enzyme evolution.

Citation: Kannan N, Taylor SS, Zhai Y, Venter JC, Manning G (2007) Structural and functional diversity of the microbial kinome. PLoS Biol 5(3): e17. doi:10.1371/journal.pbio.0050017

Introduction

The eukaryotic protein kinase (ePK) domain is the mostabundant catalytic domain in eukaryotic genomes and medi-ates the control of most cellular processes, by phosphorylationof a significant fraction of cellular proteins [1–3]. Mostprokaryotic protein phosphorylation and signaling is thoughtto occur through structurally distinct histidine-aspartatekinases [4].However, there is growingevidence for theexistenceand importance of different families of ePK-like kinases (ELKs)in prokaryotes [5–10]. ePKs and ELKs share the protein kinase–like (PKL) fold [11] and similar catalytic mechanisms, but ELKsgenerally display very low sequence identity (7%–17%) to ePKsand to each other. Crystal structures of ELKs such as amino-glycoside, choline, and Rio kinases reveal striking similarity toePKs [12–14], and other ELKs have been defined by remotehomology methods [6,15] andmotif conservation [16]. Anotherset of even more divergent PKL kinases are undetectable by

sequence methods, but retain structural and mechanisticconservation with ePKs. These include the phosphatidylinositol kinases (PI3K) and related protein kinases, alphakinases, the slime mold actin fragmin kinases, and thephosphatidyl inositol 59 kinases [17–20].These studies demonstrate that PKL kinases conserve both

fold and catalytic mechanisms in the presence of tremendoussequence variation, which allows for an equivalent diversity insubstrate binding and function. This makes the PKL fold amodel system to investigate how sequence variation maps tofunctional specialization. Previous studies along these linesinclude the study of ePK-specific regulatory mechanisms,through ePK–ELK comparison [16], and the sequencedeterminants of functional specificity within one group(CMGC [CDK, MAPK, GSK3, and CLK kinases]) of ePKs [21].Previous studies have been hampered by poor annotation

and classification of ELK families and their low representa-tion in sequence databases relative to ePKs. Recent large-

Academic Editor: Tony Pawson, Samuel Lunenfeld Research Institute, Canada

Received May 18, 2006; Accepted September 20, 2006; Published March 13, 2007

Copyright: � 2007 Kannan et al. This is an open-access article distributed underthe terms of the Creative Commons Attribution License, which permits unrestricteduse, distribution, and reproduction in any medium, provided the original authorand source are credited.

Abbreviations: aa, amino acid; CAK, choline and aminoglycoside kinase; ChoK,choline kinase; ELK, ePK-like kinase; ePK, eukaryotic protein kinase; GOS, GlobalOcean Sampling; HMM, hidden Markov model; NCBI, National Center ofBiotechnology Information; PKL, protein kinase–like

* To whom correspondence should be addressed. E-mail: [email protected]

This article is part of Global Ocean Sampling collection in PLoS Biology. The fullcollection is available online at http://collections.plos.org/plosbiology/gos-2007.php.

PLoS Biology | www.plosbiology.org | S91 Special Section from March 2007 | Volume 5 | Issue 3 | e170467

PLoS BIOLOGY

Page 98: Plos Biology Venter Collection Low

scale microbial genomic sequencing, coupled with GlobalOcean Sampling (GOS) metagenomic data, now allow a muchmore comprehensive analysis of these families. In particular,the GOS data provides more than 6 million new peptidesequences, mostly from marine bacteria [22,23], and morethan triples the number of ELK sequences. Here, for the firsttime, we define the extent of 20 known and novel PKLfamilies, define a set of ten key conserved residues within thecatalytic domain, and explore specific elaborations thatmediate the unique functions of distinct families. Thesehighlight both underappreciated aspects of the catalytic coreas well as unique family specific features, which in severalcases reveal correlated changes that map to concertedvariations in structure and mechanism.

Results

Discovery and Classification of PKL Kinase FamiliesKinase sequences were detected using hidden Markov

model (HMM) profiles of known PKLs as well as with a motifmodel focused on key conserved PKL motifs [16,24]. Resultsof each approach were used to iteratively build, search, andrefine new sets of HMMs, using both public and GOS data.Weak but significant sequence matches were used as seeds todefine and elaborate novel families. The final result was16,248 GOS sequences (Dataset S1) classified into 20 HMM-defined PKL families (Table 1; Figure 1; Dataset S2). A similaranalysis of the National Center of Biotechnology Informationnonredundant public database (NCBI-nr) revealed 24,924ePK and 5,151 ELK sequences (Dataset S1). More than 1,400of the NCBI ELK sequences were annotated as hypotheticalor unknown, and several hundred more are misannotated orhave no functional annotation. GOS data at least doubles thesize of most families, and permits an in-depth analysis of

family structure and conservation. Two families that are morethan 10-fold enriched in GOS (CapK and HSK2) are foundlargely in a proteobacteria, which are also highly enriched inGOS. Both CapK and HRK contain viral-specific subfamiliesthat are also greatly GOS enriched, indicating that differ-ences in kinase distribution between databases are largely dueto taxonomic biases. As expected, eukaryotic-specific families,(ePK, Bub1, PI3K, AlphaK) are underrepresented in GOS.

Functional Diversity of PKL FamiliesThese 20 PKL families display great functional and

sequence diversity, though common sequence motifs andfunctional themes recur. Some families are entirely unchar-acterized, and few have been well studied, though most havesome characterized members, many with known kinaseactivity. Their substrates include proteins and small mole-cules such as lipids, sugars, and amino acids, and theygenerally appear to have regulative functions (Table 1). Thisis in contrast to the diversity of several other structurallyunrelated, small-molecule kinase families that play largelymetabolic roles [25]. Profile–profile alignments show clear butdistant relationships between several families, which areenclosed by ovals in Figure 1. The ePK cluster includes pknB,which is highly similar to but distinct from ePK and is

Author Summary

The huge growth in sequence databases allows the characterization ofevery protein sequence by comparison with its relatives. Sequencecomparisons can reveal both the key conserved functional motifs thatdefine protein families and the variations specific to individualsubfamilies, thus decorating any protein sequence with its evolu-tionary context. Inspired by the massive sequence trove from theGlobal Ocean Survey project, the authors looked in depth at theprotein kinase–like (PKL) superfamily. Eukaryotic protein kinases(ePKs) are the pre-eminent controllers of eukaryotic cell biology andamong the best studied of enzymes. By contrast, their prokaryoticrelatives are much more poorly known. The authors hoped to bothcharacterize and better understand these prokaryotic enzymes, andalso, by contrast, provide insight into the core mechanisms of theeukaryotic protein kinases. The authors used remote homologymethods, and bootstrapped on their discoveries to detect more than45,000 PKL sequences. These clustered into 20 major families, of whichthe ePKs were just one. Ten residues are conserved between thesefamilies: 6 were known to be important in catalysis, but four more—including three highly conserved in ePKs—are still poorly understood,despite their ancient conservation. Extensive family-specific featureswere found, including the surprising loss of all but one of the ten keyresidues in one family or another. The authors explored some of theselosses and found several cases in which changes in one key motifsubstitute for changes in another, demonstrating the plasticity ofthese sequences. Similar approaches can be used to better under-stand any other family of protein sequences.

Figure 1. Sequence and Structure Based Clustering of PKL Families

Despite minimal sequence similarity, relationships between families canbe estimated by profile–profile matching and alignments restricted toconserved motifs. Three main clusters of families are seen (shaded ovals):CAK, ePK, and KdoK. Four more families (towards bottom) are distantlyrelated to these clusters, while three more (PI3K, AlphaK, IDHK, atbottom) have no sequence similarity outside a subset of key motifs. Thearea of each sphere represents the family size within GOS data.doi:10.1371/journal.pbio.0050017.g001

PLoS Biology | www.plosbiology.org | S92 Special Section from March 2007 | Volume 5 | Issue 3 | e170468

Microbial Kinome

Page 99: Plos Biology Venter Collection Low

distinguished by its exclusive bacterial specificity, as opposedto the mostly eukaryotic ePK family. The other major clusteris centered on the large and divergent CAK (choline andaminoglycoside kinase) family, and includes three otherfamilies of small-molecule kinases. CAK itself is particularlydiverse, containing subfamilies that are specific for choline/ethanolamine and aminoglycosides, as well as many novelsubfamilies, some of which are specific to eukaryoticsublineages. A looser cluster is formed between the Rio andBud32 families, which are universal among both eukaryotesand archaeae, and the bacterial lipopolysaccharide kinasefamily KdoK. An additional four families (UbiB, revK, MalK,CapK) are distantly related to all three clusters, and aredistinct from another set—PI3K, AlphaK, and IDHK—whichhave even less similarity to any other kinase; for PI3K andAlphaK, the relationship to kinases was determined bystructural comparisons [11], while IDHK displays only

conservation of the key residues and motifs found in allPKL kinases.Sequence similarity between these 20 families varies from

very low (;20%) to almost undetectable. Sequence-profilemethods are generally required to align families within theoval clusters of Figure 1, while alignments between clustersrequire profile–profile methods. The diversity of this collec-tion is demonstrated by comparison with the automatedsequence- and profile-based clustering of the overall GOSanalysis [22], which assigns 93% of these sequences into 32clusters, each of which is largely specific to one of our 20families.

Key Conserved Residues Unify Diverse Kinase FamiliesComparison between all families reveals a set of ten key

residues that not only account for one-third of the residuesconserved within each family, but also are consistently

Table 1. The 20 PKL Families, Their Gene Counts in GOS and NCBI-nr, and Functional Notes

Family GOS Count/NR Count Description

ePK 2753/24924 Eukaryotic protein kinase, almost exclusively eukaryotic.

pknB 1525/1047 Bacterial-specific, enriched in several phyla, closely related to ePK.

BLRK 21/28 Bacterial leucine-rich kinase, an uncharacterized ePK-like family containing leucine-rich repeats. Mostly restricted to

proteobacteria.

GLK 38/17 Glycosylase-linked kinase. Previously unannotated. Sequences from bacterial phyla fused to or neighbors of a DNA

glycosylase domain. Archaeal members are neighbors of tRNA-associated genes.

HRK 259/78 Haspin-related kinase. Consists of eukaryotic haspin protein kinases [55] and two distinct sets of largely viral kinases,

one of which lacks the GxGxxG and VAIK motifs, suggesting that they may be catalytically inactive and/or interfere with

host kinase signaling.

Bub1 9/112 Pan-eukaryotic protein kinase, functions in mitotic spindle assembly.

Bud32 139/123 Universal/single copy gene in eukaryotes and archaeae. In vertebrates it phosphorylates p53, while in S. cerevisiae it is

involved in bud site selection, both unconserved processes [56]. Recently implicated in telomere regulation [57].

Rio 133/249 Universal eukaryotic/archaeal protein kinase, implicated in control of translation, a function that is highly conserved

between eukaryotes and archaeae [58]. Also found in some bacteria, particularly proteobacteria.

KdoK 389/199 Small family of bacterial kinases known to phosphorylate sugar moieties of LPS [59]. Reportedly autophosphorylates on

tyrosine [60]. High sequence variation, even at key motifs, suggests diverse functions.

CAK 3997/1427 Choline and aminoglycoside kinases. Includes many novel subfamilies. Bacterial choline kinase (ChoK, licA), modifies

LPS, enabling mucosal binding for several human-commensal and pathogenic bacteria [61]. Expression is controlled by

phase variation. Metazoan choline kinases are involved in the production of phosphatidyl choline and acetylcholine,

and metazoans also have related ethanolamine kinases. Aminoglycoside kinases (Aminoglycoside phosphotransferases,

APH) phosphorylate and inactivate aminoglycosides, antibiotics that target the bacterial ribosome [62]. They are pro-

duced as antidotes by aminoglycoside-producing bacteria, and by many of their targets.

HSK2 1649/93 One of several structurally distinct homoserine kinases, involved in threonine biosynthesis. Found mostly in a-proteo-

bacteria, mirroring the distribution of HSK1 in c-proteobacteria. Unlike HSK1, it does not chromosomally cluster with

other threonine biosynthesis genes, but is usually linked to the lytB gene involved in isoprenoid biosynthesis, to RNAse

H1 and to clusters of novel genes (NK, GM, unpublished data). These suggest additional functions for this family.

FruK 390/136 Fructosamine kinase. Initiates repair of aging proteins by phosphorylating residues damaged by glycosylation, leading

to their repair [63]. Found in most eukaryotes and many bacteria, and may also have sugar kinase activities [64].

MTRK 144/38 MethylThioRibose kinase. Involved in a sulphur salvage pathway of methionine synthesis. Expression is controlled by

methionine levels in K. pneumonia [65] and by starvation in B. subtilis [66]. Present in select bacteria and plants, but

not in higher eukaryotes.

UbiB 4110/623 UbiB (ABC1 in eukaryotes). Regulates the ubiquinone (co-enzyme Q) biosynthesis pathway in both prokaryotes and

yeast [67,68]. It is speculated to activate an unknown mono-oxygenase in the ubiquinone biosynthesis pathway, possi-

bly in response to aerobic induction. Ubiquitous in eukaryotes and widespread in bacteria.

MalK 29/80 Maltose kinase. Contains two members shown biochemically to be maltose kinases [69]. Most public members are

annotated as trehalose synthases, based on transitive annotation from a member that is fused to trehalose synthase.

RevK 116/77 Reverse kinase. Novel family that lacks the N-terminal ATP-binding GxGxxG loop, but on the C-terminus is usually fused

a P-type ATPase domain, including an ATP-binding GxxGxG motif. No functional annotation.

CapK 308/21 Capsule kinase. Uncharacterized family. Chromosomal neighbors in two bacterial phyla involved in capsule synthesis. A

viral subset lacks obvious motifs upstream of the H164xD motif, reminiscent of the viral HRK subfamily.

PI3K 79/702 Eukaryotic PI39 and PI49 lipid kinases and associated PIKK protein kinases.

AlphaK 13/100 Eukaryotic protein kinases with diverse functions.

IDHK 111/58 Isocitrate dehydrogenase kinase (AceK). Small, highly conserved family. Activates the glyoxylate bypass in E. coli, used

for survival on acetate or fatty acids, by phosphorylating and inhibiting isocitrate dehydrogenase [70].

doi:10.1371/journal.pbio.0050017.t001

PLoS Biology | www.plosbiology.org | S93 Special Section from March 2007 | Volume 5 | Issue 3 | e170469

Microbial Kinome

Page 100: Plos Biology Venter Collection Low

conserved between families, constituting a core pattern ofconservation that helps define this superfamily (Table 2,Figure 2, Figure 3). These residues are conserved across themajor divisions of life, which diverged one to two billion yearsago, and across diverse families, which presumably divergedeven earlier. Thus, they are likely to mediate core functions ofthe catalytic domain rather than merely maintaining theirstructures. Six of these residues are known to be involved inATP and substrate binding and catalysis (G52, K72, E91, D166,N171 D184; residues numbered based on PKA structure 1ATPexcept where otherwise noted; see Table 3). The full functionsof the other four remain unclear, though three of them(H158, H164, and D220) are part of a hydrogen-bondingnetwork that links the catalytically important DFG motif withsubstrate binding regions (Figure 2). The conservation of thisnetwork across diverse PKL structures suggested a role forthis network in coupling DFG motif-associated conforma-tional changes with substrate binding and release [16].Despite this ancient conservation, different families of ePKshave lost individual members of this triad without destroyingstructure or catalytic function: H164 is changed to a tyrosinein PKA and many other AGC families; H158 is lost in mosttyrosine kinases; and D220 is lost in the Pim family. The Pim1structure retains an ePK-like structure, perhaps in part dueto stabilization of the catalytic loop by the activation loop, afunction normally performed by D220 [26], suggesting a novelmode of coupling ATP and substrate binding in this family.The individual loss of each member of this triad suggests thatthey have independent functions yet to be understood.

Sequence and Structural DiversityFamily-specific functions are mediated by features that are

highly conserved within families, but that are divergentbetween families (Figure 4). Many family-selective residuesmap to the motifs surrounding the ten key residues, or to thedivergent C-terminal substrate-binding region (Tables 2 andS1). The proximity of these residues to the active site suggests

that they are key in selecting substrates or tuning mechanismof action. For instance, the 4–amino acid (aa) stretch betweenthe HxD166 and N171 residues is highly conserved but distinctbetween families (Figure 4), and provides a discriminativesignature that defines each family. Within ePKs, tyrosine andserine/threonine-specific kinases display distinct patterns ofconservation within this 4-aa stretch [27]. Serine/threoninekinases conserve a [LI]KPx motif within this stretch, whiletyrosine kinases conserve a [LI]AAR motif. These variationsalter the surface electrostatics of the substrate-bindingpocket, thereby contributing to substrate specificity [27].The C-terminal region of ;100 aa following the DFG motif

is highly divergent between families, apart from the con-served D220 at the beginning of the F-helix (Figure 2; DatasetS3). Secondary structure is generally predicted to be helical,but the poor sequence conservation and known structures[11] suggest that the overall orientation of the helices may bedifferent between families. Notably, in the crystal structuresof APH bound to its substrate, kanamycin [28], the relativepositioning of the substrate-binding helices (aH–aI) isdistinct from that of ePKs (Figure 2). The presence of uniquepatterns of conservation in each family (Table 2) also suggeststhat this region is involved in family-specific functions.Several families contain sizeable (;30–100 aa) insert

segments between core subdomains that are specific toclusters of families. Most CAK members have an insertsegment between subdomains VIa and VIb. There is very littlesequence similarity within this segment across CAK members,but structures of APH and ChoK indicate some structuralsimilarity and highlight its role in substrate binding [28,29].An equivalent insert is seen in the other CAK cluster families,FruK, HSK2, and MTRK. Similarly, KdoK and Rio contain aninsert between subdomains II and III, which shows somesequence similarity between these families. In the Rio2structure, this insert is disordered, but the presence of aconserved threonine suggests a possible regulatory role [14].This region also contains an insert in the distinct UbiB family.

Figure 2. The Conserved Core and Variable Regions of the Catalytic Domain

The conserved core in three distinct families, namely ePK (PKA [52]), Rio (A. fulgidis Rio2 [14]), and CAK (APH(39)-IIIa [12]). The conserved regions areshown in ribbon representation and the variable regions in surface representation. The illustrations were created in PyMOL (http://www.pymol.org).Some highly conserved residues (see Figure 3) and their associated interactions are shown.doi:10.1371/journal.pbio.0050017.g002

PLoS Biology | www.plosbiology.org | S94 Special Section from March 2007 | Volume 5 | Issue 3 | e170470

Microbial Kinome

Page 101: Plos Biology Venter Collection Low

Finally, the ePK, pknB, and HRK families contain anextended activation loop between subdomains VIII and IX.These kinases are generally activated by phosphorylation ofthis loop, the negative charge of which helps to coordinatekey structural elements during the activation process,including a family-selective HRD arginine in the catalyticloop [30,31].

Mechanistic Diversity of the Catalytic CoreA surprising finding was that while ten key residues are

conserved both within and between families, all but one ofthem was dispensable in one family or another (Figure 3),indicating that even catalytic residues are malleable in theappropriate context. Here we explore the effect of loss of the‘‘catalytic lysine’’ K72, which typically positions the a and bphosphates of ATP (Figure 5A). Mutation of this lysine in ePKsis a common method to make inactive kinases [32]. Yet thisresidue is conserved as an arginine (R111ChoK) in most CAKsubfamilies, as a methionine in the CAK-chloro subfamily, andas a threonine in the related HSK2 family (Figure 4).

In the two major CAK subfamilies with a conserved R72(FadE and choline kinase [ChoK]), we see correlated changes inthe glycine-rich andDFGmotifs (Figure 4). Specifically, the Phe

andGlywithin theGxGxFGmotif (F54 andG55) are changed toSer/Thr and Asn, respectively (S86ChoK, N87ChoK), and G186within the DFG motif is changed to E. Both the GxGxFG andDFG motifs are spatially proximal to K72 (Figure 5A). Thus,correlated changes in these two motifs could structurallyaccount for the K-to-R change. Indeed, in the ChoK crystalstructure [13], N55 protrudes into theATPbinding pocket, andhydrogen bonds to R72. In addition, the conserved E91 in helixC, which typically forms a salt bridge with K72, is hydrogenbonded (via a water molecule) to the covarying E186, thuslinking these three correlated changes and stabilizing R72 in aunique conformation (Figure 5B). By contrast, the two solvedAPH structures (1ND4 and 2BKK) retain the ‘‘ancestral’’sequence state with K72 and G186, and lack N55.Mutation of R72 or E186 to alanine in ChoK reduces the

catalytic rate by several fold [33]. To test the possible role ofthese residues in the ChoK catalytic mechanism, we modeledan ATP in the active site of ChoK (based on the nucleotide-bound structures of APH and PKA). This revealed that R72partially occludes the ATP binding site and is likely to moveupon ATP binding. Notably, a K72-to-R mutation in Erk2 [34]also exhibits a conformational change in R72 upon nucleotide

Figure 3. Conservation of Secondary Structure, Key Motifs, and Residues between Families

The ePK secondary structure is shown with standard annotations of subdomains [53] and structural elements. Subdomains I–IX are generally conservedin all PKLs. Key residues are bolded and numbered; dashed lines point to positions within secondary structure elements. The table below shows theconservation (% identity) of the ten key residues, showing their broad conservation across families, but the successful replacement of almost all of themin at least one family. Parentheses indicate changes to another conserved residue and dashes indicate unconserved positions. Key residues arenumbered based on their position in PKA: G52, K72, E91, P104 (VPKA), H158, H164 (YPKA), D166, N171, D184, and D220. More detailed figures are shownin Dataset S3.doi:10.1371/journal.pbio.0050017.g003

PLoS Biology | www.plosbiology.org | S95 Special Section from March 2007 | Volume 5 | Issue 3 | e170471

Microbial Kinome

Page 102: Plos Biology Venter Collection Low

binding (Figure 5C). A similar conformational change in ChoKupon ATP binding could result in formation of a R72–E91 saltbridge similar to the activation of ePKs (Figure 5A). In thisconformation, R72 could potentially hydrogen bond to bothE91 as well as to the covarying E186 in ChoKs, which mightexplain the covariation of R72 and E186 in these families.

Variation on a ThemeOther CAK members display distinct coordinated changes

at the G55, K72, and G186 positions. The chloro subfamily ofCAK loses the positive charge at position 72 altogether,replacing it with methionine, and has concurrent changes toR55 and Q186 (Figure 4). This may reflect a shift of thepositive charge from position 72 to 55, an event that alsohappened in Wnk kinases, the only functional ePK family thatlacks K72. The conserved K55 of Wnks is required forcatalysis and has been shown to interact with ATP similarly toK72 of PKA [35] (Figure 5D). Hence, two evolutionaryinventions may have converted the same core motif residuefrom one function to another. In CAK-chloro, the unpairedE91 position loses its charge to become a conserved Phe. Thefunction of this Phe is unknown, but is likely to be importantsince it is also conserved in HSK2, a related family, and theonly other kinase family to conserve a Phe at the E91 position(Figure 4).

Evolution of Conformational Flexibility and Regulation inePKsThe ePK catalytic domain is highly flexible and undergoes

extensive conformational changes upon ATP binding [36]. Incontrast, crystal structures of APH, solved in both ATP-bound and -unbound forms, revealed modest structuralchanges in the ATP-binding pocket [37]. This difference inconformational flexibility is reflected in the patterns ofconservation at key positions within the ATP-binding glycine-rich loop (Figure 4). Specifically, two conserved glycines (G50and G55), which contribute to the conformational flexibilityof this loop in ePKs, are replaced by non-glycines in APH.These two glycines are absent in several PKL families (Figure4) while G52, which is involved in catalysis, is present in most,suggesting that the conformational flexibility of the nucleo-tide-binding loop is a feature of selected PKL families such asePKs. Since conformational flexibility allows for regulation, itis likely that modest structural changes associated withnucleotide binding gradually evolved into quite dramaticstructural rearrangements required to ensure that key playersin various signaling pathways act only at the right place and atthe right time. The conserved glycine (G186) within thecatalytically important DFG motif may likewise have evolvedfor regulatory functions in ePKs [38]. This glycine is highlyconserved in the ePK cluster but is absent from most other

Table 2. Distribution of Residues That Are .90% Identical within Each Family

Family Key Residues Motif-Associated C-Term Unique Semiconserved Other Total

ePK 8 4 1 4 0 17

pknB 9 3 0 8 0 20

BLRK 8 12 4 4 18 46

HRK 5 2 0 1 0 8

Bub1 8 7 3 5 4 27

GLK 7 1 2 2 0 12

Bud32 10 6 3 2 2 23

Rio 9 4 0 1 0 14

KdoK 2 0 0 1 0 3

CAK 5 0 0 0 0 5

HSK2 9 9 11 7 4 40

FruK 9 6 5 5 2 27

MTRK 9 10 4 2 8 33

UbiB 8 6 0 1 4 19

MalK 8 13 15 0 10 46

revK 5 1 2 0 5 13

CapK 1 0 0 0 0 1

PI3K 4 3 1 1 2 11

AlphaK 5 4 6 1 7 23

IDHK 9 17 10 1 23 60

Total 138 (31%) 108 (24%) 67 (15%) 46 (10%) 89 (20%) 448 (100%)

Across the ;250-aa domain, almost one-third of .90% conserved residues map to the ten key residues that are also conserved between families, and more than half map to these keyresidues or their surrounding motifs (GxGxxGxxxx, vaiK, E, vP, LxxLH, xxHxDxxxNxx, xxDxGxx, DLA; boldfacing indicates the ten key residues). An additional 15% are found in the largelyunalignable region C-terminal of DxG, strongly suggesting family-specific functions, and 10% are semi-conserved, being found in some, but not most families. See Dataset S3 for details.doi:10.1371/journal.pbio.0050017.t002

Figure 4. Sequence Logos Depicting Conservation of Core Motifs and Neighboring Sequences across Most Kinase Families and Selected CAK

Subfamilies

Motifs are GxGxxGxxxx, VAIK, E, LxxLH, xxHxDxxxxNxx, xxDFGxx, and Dxx. The size of the letters corresponds to their information content [54]. Familieswith less than 100 members (BLRK, GLK) are omitted. The diverse CAK family is represented by four distinct subfamilies: APH contains manyaminoglycoside resistance kinases and ChoK includes most ChoKs, while FadE and chloro are less well described. For the HRK family, the first two motiflogos omit the viral subfamily that lacks these motifs.doi:10.1371/journal.pbio.0050017.g004

PLoS Biology | www.plosbiology.org | S96 Special Section from March 2007 | Volume 5 | Issue 3 | e170472

Microbial Kinome

Page 103: Plos Biology Venter Collection Low

PLoS Biology | www.plosbiology.org | S97 Special Section from March 2007 | Volume 5 | Issue 3 | e170473

Microbial Kinome

Page 104: Plos Biology Venter Collection Low

Figure 5. Mechanistic Diversity of the ATP-Binding Pocket.

(A) PKA showing structural interactions associated with K72 in active ATP-bound state. The salt bridge interaction between K72 and E91 is shown bydotted lines.(B) Structural interactions associated with Arg111ChoK in ChoK.(C) Conformational changes associated with Arg52Erk2 in the Erk2 mutant structure. Here, the arginine does not form a salt bridge interaction withGlu69Erk2 (E91), but moves closer towards Glu69Erk2 upon ATP binding.(D) Inactive state of Wnk1: K72 is shifted over to the G-loop (K233Wnk1) and E91 (Glu268Wnk1) hydrogen bonds to a conserved Arg (R348Wnk1 within theHRD motif) in the catalytic loop.(A–D) Residues conserved across all the major families are colored in magenta, while family-specific residues are colored in gold. Hydrogen bonds areindicated in dotted lines.doi:10.1371/journal.pbio.0050017.g005

Table 3. Structural/Functional Role of Highly Conserved Residues

Residue (PKA Number) Structural Location Structural/Functional Role

G52 Glycine-rich loop The backbone of this glycine coordinates the c-phosphate of ATP and facilitates

phosphoryl transfer [71].

K72 b3 strand Hydrogen bonds to the a-oxygen and b phosphate of ATP [72].

E91 C-helix Forms a salt bridge interaction with K72 and functions as a regulatory switch in

ePKs [73].

P104 aC-b4 loop Unknown function. Absent from ePKs.

H158 C-terminus of E helix Hydrogen bonds to D220 in the F-helix and is part of a hydrogen bond network

that couples ATP and substrate-binding regions [16].

H(Y164) Catalytic loop aE-b6 Hydrogen bonds to the backbone of the residue before the DFG-Asp and inte-

grates substrate and ATP binding regions [16].

D166 Catalytic loop Catalytic base [72].

N171 Catalytic loop Coordinates the second Mg2þ ion and involved in phosphoryl transfer.

D184 N-terminus of activation loop Coordinates the first Mg2þ ion.

D220 N-terminus of F helix Hydrogen bonds to the backbone of the catalytic loop and positions this loop re-

lative to substrate-binding regions [16].

doi:10.1371/journal.pbio.0050017.t003

PLoS Biology | www.plosbiology.org | S98 Special Section from March 2007 | Volume 5 | Issue 3 | e170474

Microbial Kinome

Page 105: Plos Biology Venter Collection Low

PKL families. However, within the small subfamily ofmagnesium-dependent Mnk ePK kinases, G185 is changedto aspartate (DFD). In the Mnk2 crystal structure, this DFDmotif adopts an ‘‘out’’ conformation in which F185 protrudesinto the ATP-binding site. This is in contrast to the ‘‘in’’conformation, where it packs up below the C-helix [39].Mutation of the Mnk2 D186 ‘‘back’’ to glycine results in bothin and out conformations of the DFG motif, supporting therole of G186 in DFG-associated conformational changes.Such conformational transitions may facilitate regulation ofactivity since the conformation of the catalytic aspartate isalso changed during this transition [38]. This may also explainwhy the ePK-specific extended activation loop, which isphosphorylated and undergoes dramatic conformationalchanges, is directly attached to the DFG motif (Figure 6A).

In addition to the flexible catalytic core, the substrate-binding regions appear to have evolved for tight regulation ofePK activity. In particular, the conserved G helix, which wasrecently shown to undergo a conformational changes uponsubstrate binding [40], is uniquely oriented in ePK/pknB(Figure 6A). Several ePK-conserved residues and motifs are atthe interface between the G helix and the catalytic core(Figure 6B). These include the APE motif, located at the C-

terminal end of the activation loop, a W-[SA]-X-[G] motif inthe F-helix, and an arginine (R280), at the beginning of the Ihelix (Figure 6B). These three motifs structurally interact witheach other and form a network that couples the substrate-and ATP-binding regions (Figure 6B). This network alsoinvolves conserved buried water molecules, which are knownto contribute to the conformational flexibility of proteins[41]. Thus, this ePK/pknB-conserved network may alsofacilitate regulation by increasing the conformational flexi-bility of the substrate-binding regions [16].

Discussion

Data from the GOS voyage provides a huge increase inavailable sequences for most prokaryotic gene families,enabling new studies in discovery, classification, and evolu-tionary and structural analysis of a wide array of genefamilies. Even for a eukaryotic family such as ePK kinases,GOS provides insights by greatly increasing understanding ofrelated PKL families. GOS increases the number of knownELK sequences more than 3-fold, and has enabled both thediscovery of novel families of kinases as well as a detailedanalysis of conservation patterns and subfamilies withinknown families. We believe that the GOS data, coupled with

Figure 6. ePK-Specific Motifs and Interactions in the Substrate-Binding Region

(A) The ePK-specific activation loop and G-helix are shown in PKA (PKA [52]). The corresponding regions are shown in Rio (A. fulgidis Rio2 [14]). Theactivation loop and G-helix are colored in red, and the core-conserved residues are shown in stick representation.(B) The three ePK-specific motifs in the C-terminal substrate-binding lobe and their structural interactions are shown. Hydrogen bonds are indicated bydotted lines. The conserved buried water is shown in CPK representation.doi:10.1371/journal.pbio.0050017.g006

PLoS Biology | www.plosbiology.org | S99 Special Section from March 2007 | Volume 5 | Issue 3 | e170475

Microbial Kinome

Page 106: Plos Biology Venter Collection Low

the recent strong growth in whole-genome sequencing,provide the opportunity for similar insights into virtuallyevery gene family with prokaryotic relatives.

PKL kinases are largely involved in regulatory functions, asopposed to the metabolic activities of other kinases withdifferent folds [25]. The characteristics of this fold that leadto the explosion of diverse regulatory functions of eukaryoticePKs have also been exploited for many different functionswithin prokaryotes. While these kinases reflect only ;0.25%of genes in both GOS and microbial genomes (ePKs represent;2% of eukaryotic genes [42]), indicating a simpler prokary-otic lifestyle, they now outnumber the count of ;12,000histidine kinases that we observe in GOS [22], suggesting thatELKs may be at least as important in bacterial cellularregulation as the ‘‘canonical’’ histidine kinases.

PKL kinases cross huge phylogenetic and functional spaceswhile still retaining a common fold and biochemical functionof ATP-dependent phosphorylation. The presence of Rio andBud32 genes in all eukaryotic and archaeal genomes suggeststhat at least this cluster dates back to the common ancestor ofthese domains of life. Similarly, the presence of UbiB in alleukaryotes and most bacterial groups, the close similarity ofpknB/ePK families, and the widespread bacterial/eukaryoticdistribution of FruK suggest their origins before theemergence of eukaryotes, or from an early horizontaltransfer. Their ancient divergence leaves little or no traceof their shared structure within their protein sequence otherthan at functional motifs, which include a set of ten keyresidues that are highly conserved across all PKLs.

Despite the huge attention paid to ePKs, four key residues(P104, H158, H164, D220), three of which are highly conserved inePKs, are still functionally obscure and worthy of greaterattention, both in ELKs and ePKs. Conversely, it appears thatnine of the ten key residues have been eliminated ortransformed in individual families while maintaining foldand function, showing that almost anything is malleable inevolution given the right context. That right context isfrequently a set of additional changes in the family-specificmotifs surrounding these key residues, and we see that in thecase of K72, a substitution to arginine triggers a cascade ofother core substitutions that serve to retain basic function,while a substitution to methionine involves a shift of thepositive charge normally provided by K72 to anotherconserved residue, in both CAK-chloro and Wnk kinases.Other core changes are also seen independently in verydistinct families, such as the G55-to-A change in UbiB and thechloro subfamily of CAK, or the E91-to-F change in bothchloro and HSK2, suggesting that these kinases are sampling alimited space of functional replacements.

These families vary greatly in diversity. While the ePKfamily has expanded to scores of deeply conserved functions[42], other families, including Bud32, Rio, Bub1, and UbiB,usually have just one or a handful of members per genome,suggesting critical function but an inability to innovate. Thelargely prokaryotic CAK family is also functionally andstructurally diverse, containing several known functions andmany distinct subfamilies likely to have novel functions. Thediversity of both CAK and KdoK sequences may be related totheir involvement in antibiotic resistance and immuneevasion, likely to be evolutionarily accelerated processes.Comparison of CAK to the related and more functionallyconstrained HSK2, FruK, and MTRK families may reveal

adaptive changes such as the ePK-specific flexibility changesthat may assist in its diversity of functions.GOS data are rich in highly divergent viral sequences, and

accordinglywefindanumberofnew subfamiliesof viral kinases,including two of the three subfamilies of HRK and a subfamilyof CapK. In both cases we see loss of N-terminal–conservedelements, suggesting that these kinases may have alternativefunctions or even act as inactive competitors to host kinases.These patterns of sequence conservation and diversity raise

many questions that can only be fully addressed by structuralmethods. The combination of structural and phylogeneticinsights for ChoK enabled insights that were not clear fromthe structure alone, and enabled us to reject other inferencesfrom the crystal structure that were not conserved within thisfamily, highlighting the value of combining these approaches.The relative ease of crystallization of PKL domains, theemergence of high-throughput structural genomics, and ourunderstanding of the diversity of these families make themattractive targets for structure determination of selectedmembers, and position this family as a model for analysis ofdeep structural and functional evolution.

Materials and Methods

Discovery and classification of kinase genes. Sequences usedconsisted of 17,422,766 open reading frames from GOS, 3,049,695predicted open reading frames from prokaryotic genomes, and2,317,995 protein sequences from NCBI-nr of February 10, 2005, asdescribed [22]. Profile HMM searches were performed with a TimeLogic Decypher system (Active Motif, http://timelogic.com) using in-house profiles for ePK, Haspin, Bub1, Bud32, Rio, ABC1 (UbiB), PI3K,and AlphaK domains, as well as Pfam profiles [43] for ChoK, APH,KdoK, and FruK, and TIGRFAM profiles [44] for HSK2 (thrB_alt),UbiB, and MTRK. A number (69) of additional ePK-annotated modelsfrom Superfamily 1.67 [45] were used to capture initial hits but not forfurther classification. Initial hits were clustered and re-run against allmodels, and each model was rebuilt and rerun three to seven timesusing ClustalW [46], MUSCLE [47], and hmmalign (http://hmmer.janelia.org) to align, followed by manual adjustment of alignmentsusing Clustal and Pfaat [48] and model building with hmmbuild. Low-scoring members of each family (e . 1 3 10�5) were used as seeds tobuild new putative families, and profile–profile and sequence–profilealignments were used tomerge families into aminimal set (Dataset S2).A motif-based Markov chain Monte Carlo multiple alignment model[49] based on the conserved motifs of Figure 3 was run independentlyand used to verify HMM hits and seed new potential families for blast-based clustering, model building, and examination for conservedresidues. Final family assignmentwas by scoring against the set ofHMMmodels, with manual examination of sequences with borderline scores(e . 1310�5 or difference in e-values between best two models ..01).

Family annotations. Annotations of chromosomal neighbors usedSMART [50] and a custom analysis of GOS neighbors ([22]; C. Miller,H. Li, D. Eisenberg, unpublished data). Annotation analysis was basedon GenBank annotations and PubMed references. Taxonomicanalysis used a mapping of GOS scaffolds to taxonomic groupings[22] and NCBI taxonomy tools.

Family alignments and logos. Residue conservation (Dataset S3)was counted from the final alignment using a custom script thatomitted gap counts. These counts were then used to construct familylogos using WebLogo (http://weblogo.berkeley.edu; [51]).

Family comparisons. Relatedness between families was estimatedusing several methods. HMM–HMM alignments and scores werecomputed using PRC (http://supfam.org/PRC), and sequence–profilealignments using hmmalign were analyzed using custom scripts andby inspection. Both full-length and motif multiple alignments werealso created and used for the family comparisons.

Supporting Information

Dataset S1. FastA-Formatted Sequence Files for Each of the 20 KinaseFamilies, Including Both GOS and Public Sequences

Found at doi:10.1371/journal.pbio.0050017.sd001 (10 MB BZ2).

Microbial Kinome

PLoS Biology | www.plosbiology.org | S100 Special Section from March 2007 | Volume 5 | Issue 3 | e170476

Page 107: Plos Biology Venter Collection Low

Dataset S2. HMM Profiles for the 20 Kinase Families in HMMerFormat

Found at doi:10.1371/journal.pbio.0050017.sd002 (2.4 MB HMM).

Dataset S3. Domain Profiles for 20 PKL Families

These 20 spreadsheets show the conservation profile at each residueof the kinase domain for each family, including annotations andclassifications of individual residues. Each worksheet details thealignment of one kinase family to its HMM. Every row corresponds toa position within the alignment, listing the four most common aminoacids (aa) in that row along with their fractional popularity. Thenumber of aa’s and number of gaps at that position within thealignment is also listed. The ‘‘Notes’’ column annotates conservationstatus of selected residues and other notes, while the ‘‘.90%Conserved’’ annotates those corresponding residues as to their class(Core, Motif, Motif-Associated, Semi-Conserved, C-terminal, Unique,or external to the kinase domain). A number of color highlights areused. (1) Positions with few aa’s in the alignment (typically insertswithin the domain that are not of great interest) are shaded gray:typically dark gray for �20 aa at that position, and light gray for .20but still low (the range varies depending on the depth of thealignment). Rows highlighted in gray have no highlights in any othercolumns and are assumed not to be part of the core domain. (2) Coremotifs are highlighted in bold and blue. (3) The fractional count forthe most popular aa is labeled green if 1, dark yellow if .0.9, and lightyellow if .0.8 and ,0.9.

Found at doi:10.1371/journal.pbio.0050017.sd003 (1.6 MB XLS).

Accession Numbers

The Protein Databank (http://www.pdb.org) accession numbers forthe structures discussed in this paper are PKA (1ATP), A. fulgidis Rio2(1TQP), C. elegans choline kinase (INW1), Erk2 (1GOL), Wnk1 (1T4H),and APH(39)-IIIa (1J7L). The Pfam (http://pfam.cgb.ki.se) accessionnumbers for the structures discussed in this paper are ChoK

(PF01633.8), APH (PF01636.9), KdoK (PF06293.3), and FruK(PF03881.4). The TIGRFAM (http://www.tigr.org/TIGRFAMs) acces-sion numbers for the structures discussed in this paper are HSK2(TIGR00938), UbiB (TIGR01982), and MTRK (TIGR01767).

Acknowledgments

We thank the Governments of Bermuda, Canada, Mexico, Honduras,Costa Rica, Panama, Ecuador, and French Polynesia for facilitatingsampling activities. All sequencing data collected from waters of theabove-named countries remain part of the genetic patrimony of thecountry from which they were obtained. We thank Chris Miller,Huiying Li, and David Eisenberg for access to analyses of chromo-somal neighbors for function prediction, and to Doug Rusch, ShibuYooseph, and other members of the Venter Institute GOS team fortaxonomic predictions, geographic analysis, and other data and tools.We thank Eric Scheeff and Tony Hunter for critical comments andstructural insights, and Nina Haste for help with PyMOL.

Author contributions. JCV proposed and enabled collaboration.SST conceived of collaboration and provided structural insights andcritical evaluation. NK and GM conceived and designed the experi-ments. NK, YZ, and GM performed the experiments. NK and GManalyzed the data. YZ and JCV contributed reagents/materials/analysistools. NK and GM wrote the paper.

Funding. This work was supported by funding from the Razavi-Newman Center for Bioinformatics to GM and National Institutes ofHealth grant IP01DK54441 to SST. We gratefully acknowledge the USDepartment of Energy (DOE) Genomics: GTL Program, Office ofScience (DE-FG02-02ER63453), the Gordon and Betty Moore Foun-dation, and the J. Craig Venter Science Foundation for funding of theGOS expedition.

Competing interests. The authors have declared that no competinginterests exist.

References1. Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S (2002) The

protein kinase complement of the human genome. Science 298: 1912–1934.2. Cohen P (2002) Protein kinases—The major drug targets of the twenty-first

century? Nat Rev Drug Discov 1: 309–315.3. Manning G, Plowman GD, Hunter T, Sudarsanam S (2002) Evolution of

protein kinase signaling from yeast toman. Trends Biochem Sci 27: 514–520.4. Parkinson JS (1993) Signal transduction schemes of bacteria. Cell 73:

857–871.5. Kennelly PJ, Potts M (1999) Life among the primitives: Protein O-

phosphatases in prokaryotes. Front Biosci 4: D372–D385.6. Leonard CJ, Aravind L, Koonin EV (1998) Novel families of putative

protein kinases in bacteria and archaea: Evolution of the ‘‘eukaryotic’’protein kinase superfamily. Genome Res 8: 1038–1047.

7. Krupa A, Srinivasan N (2005) Diversity in domain architectures of Ser/Thrkinases and their homologues in prokaryotes. BMC Genomics 6: 129.

8. Zhang CC (1996) Bacterial signalling involving eukaryotic-type proteinkinases. Mol Microbiol 20: 9–15.

9. Young TA, Delagoutte B, Endrizzi JA, Falick AM, Alber T (2003) Structureof Mycobacterium tuberculosis PknB supports a universal activation mecha-nism for Ser/Thr protein kinases. Nat Struct Biol 10: 168–174.

10. Kennelly PJ (2002) Protein kinases and protein phosphatases in prokar-yotes: A genomic perspective. FEMS Microbiol Lett 206: 1–8.

11. Scheeff ED, Bourne PE (2005) Structural evolution of the protein kinase-like superfamily. PLoS Comput Biol 1: e49.

12. Hon WC, McKay GA, Thompson PR, Sweet RM, Yang DS, et al. (1997)Structure of an enzyme required for aminoglycoside antibiotic resistancereveals homology to eukaryotic protein kinases. Cell 89: 887–895.

13. Peisach D, Gee P, Kent C, Xu Z (2003) The crystal structure of choline kinasereveals a eukaryotic protein kinase fold. Structure (Camb) 11: 703–713.

14. LaRonde-LeBlanc N, Wlodawer A (2004) Crystal structure of A. fulgidusRio2 defines a new family of serine protein kinases. Structure 12: 1585–1594.

15. Cheek S, Ginalski K, Zhang H, Grishin NV (2005) A comprehensive updateof the sequence and structure classification of kinases. BMC Struct Biol 5: 6.

16. Kannan N, Neuwald AF (2005) Did protein kinase regulatory mechanismsevolve through elaboration of a simple structural component? J Mol Biol351: 956–972.

17. Grishin NV (1999) Phosphatidylinositol phosphate kinase: A link betweenprotein kinase and glutathione synthase folds. J Mol Biol 291: 239–247.

18. Walker EH, Pacold ME, Perisic O, Stephens L, Hawkins PT, et al. (2000)Structural determinants of phosphoinositide 3-kinase inhibition bywortmannin, LY294002, quercetin, myricetin, and staurosporine. Mol Cell6: 909–919.

19. Steinbacher S, Hof P, Eichinger L, Schleicher M, Gettemans J, et al. (1999)The crystal structure of the Physarum polycephalum actin-fragmin kinase: Anatypical protein kinase with a specialized substrate-binding domain. EMBOJ 18: 2923–2929.

20. Yamaguchi H, Matsushita M, Nairn AC, Kuriyan J (2001) Crystal structureof the atypical protein kinase domain of a TRP channel with phospho-transferase activity. Mol Cell 7: 1047–1057.

21. Kannan N, Neuwald AF (2004) Evolutionary constraints associated withfunctional specificity of the CMGC protein kinases MAPK, CDK, GSK,SRPK, DYRK, and CK2falphag. Protein Sci 13: 2059–2077.

22. Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al. (2007)The Sorcerer II Global Ocean Sampling expedition: Expanding the universeof protein families. PLoS Biol 5: e16. doi:10.1371/journal.pbio.0050016

23. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al. (2007)The Sorcerer II Gobal Ocean Sampling expedition: Northwest Atlanticthrough eastern tropical Pacific. PLoS Biol 5: e77. doi:10.1371/journal.pbio.0050077

24. Neuwald AF, Liu JS, Lawrence CE (1995) Gibbs motif sampling: Detectionof bacterial outer membrane protein repeats. Protein Sci 4: 1618–1632.

25. Cheek S, Zhang H, Grishin NV (2002) Sequence and structure classificationof kinases. J Mol Biol 320: 855–881.

26. Qian KC, Wang L, Hickey ER, Studts J, Barringer K, et al. (2005) Structuralbasis of constitutive activity and a unique nucleotide binding mode ofhuman Pim-1 kinase. J Biol Chem 280: 6130–6137.

27. Taylor SS, Radzio-Andzelm E, Hunter T (1995) How do protein kinasesdiscriminate between serine/threonine and tyrosine? Structural insightsfrom the insulin receptor protein-tyrosine kinase. FASEB J 9: 1255–1266.

28. Nurizzo D, Shewry SC, Perlin MH, Brown SA, Dholakia JN, et al. (2003) Thecrystal structure of aminoglycoside-39-phosphotransferase-IIa, an enzymeresponsible for antibiotic resistance. J Mol Biol 327: 491–506.

29. Thompson PR, Schwartzenhauer J, Hughes DW, Berghuis AM, Wright GD(1999) The COOH terminus of aminoglycoside phosphotransferase (39)-IIIais critical for antibiotic recognition and resistance. J Biol Chem 274: 30697–30706.

30. Nolen B, Taylor S, Ghosh G (2004) Regulation of protein kinases:Controlling activity through activation segment conformation. Mol Cell15: 661–675.

31. Boitel B, Ortiz-Lombardia M, Duran R, Pompeo F, Cole ST, et al. (2003)PknB kinase activity is regulated by phosphorylation in two Thr residuesand dephosphorylation by PstP, the cognate phospho-Ser/Thr phosphatase,in Mycobacterium tuberculosis. Mol Microbiol 49: 1493–1508.

32. Gibbs CS, Zoller MJ (1991) Rational scanning mutagenesis of a proteinkinase identifies functional regions involved in catalysis and substrateinteractions. J Biol Chem 266: 8923–8931.

Microbial Kinome

PLoS Biology | www.plosbiology.org | S101 Special Section from March 2007 | Volume 5 | Issue 3 | e170477

Page 108: Plos Biology Venter Collection Low

33. Yuan C, Kent C (2004) Identification of critical residues of choline kinaseA2 from Caenorhabditis elegans. J Biol Chem 279: 17801–17809.

34. Robinson MJ, Harkins PC, Zhang J, Baer R, Haycock JW, et al. (1996)Mutation of position 52 in ERK2 creates a nonproductive binding mode foradenosine 59-triphosphate. Biochemistry 35: 5641–5646.

35. Xu B, English JM, Wilsbacher JL, Stippec S, Goldsmith EJ, et al. (2000)WNK1, a novel mammalian serine/threonine protein kinase lacking thecatalytic lysine in subdomain II. J Biol Chem 275: 16795–16801.

36. Akamine P, Madhusudan, Wu J, Xuong NH, Ten Eyck LF, et al. (2003)Dynamic features of cAMP-dependent protein kinase revealed by apoen-zyme crystal structure. J Mol Biol 327: 159–171.

37. Thompson PR, Boehr DD, Berghuis AM, Wright GD (2002) Mechanism ofaminoglycoside antibiotic kinase APH(39)-IIIa: Role of the nucleotidepositioning loop. Biochemistry 41: 7001–7007.

38. Levinson NM, Kuchment O, Shen K, Young MA, Koldobskiy M, et al. (2006)A SRC-like inactive conformation in the abl tyrosine kinase domain. PLoSBiol 4: e144.

39. Jauch R, Jakel S, Netter C, Schreiter K, Aicher B, et al. (2005) Crystalstructures of the Mnk2 kinase domain reveal an inhibitory conformationand a zinc binding site. Structure 13: 1559–1568.

40. Dar AC, Dever TE, Sicheri F (2005) Higher-order substrate recognition ofeIF2alpha by the RNA-dependent protein kinase PKR. Cell 122: 887–900.

41. Fischer S, Verma CS (1999) Binding of buried structural water increases theflexibility of proteins. Proc Natl Acad Sci U S A 96: 9613–9615.

42. Goldberg JM, Manning G, Liu A, Fey P, Pilcher KE, et al. (2006) Thedictyostelium kinome—Analysis of the protein kinases from a simple modelorganism. PLoS Genet 2: e38.

43. Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, et al. (2002) The Pfamprotein families database. Nucleic Acids Res 30: 276–280.

44. Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, et al. (2001)TIGRFAMs: A protein family resource for the functional identification ofproteins. Nucleic Acids Res 29: 41–43.

45. Gough J, Karplus K, Hughey R, Chothia C (2001) Assignment of homologyto genome sequences using a library of hidden Markov models thatrepresent all proteins of known structure. J Mol Biol 313: 903–919.

46. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: Improving thesensitivity of progressive multiple sequence alignment through sequenceweighting, position-specific gap penalties and weight matrix choice.Nucleic Acids Res 22: 4673–4680.

47. Edgar RC (2004) MUSCLE: Multiple sequence alignment with high accuracyand high throughput. Nucleic Acids Res 32: 1792–1797.

48. Johnson JM, Mason K, Moallemi C, Xi H, Somaroo S, et al. (2003) Proteinfamily annotation in a multiple alignment viewer. Bioinformatics 19:544–545.

49. Neuwald AF, Liu JS (2004) Gapped alignment of protein sequence motifsthrough Monte Carlo optimization of a hidden Markov model. BMCBioinformatics 5: 157.

50. Schultz J, Copley RR, Doerks T, Ponting CP, Bork P (2000) SMART: A web-based tool for the study of genetically mobile domains. Nucleic Acids Res28: 231–234.

51. Crooks GE, Hon G, Chandonia JM, Brenner SE (2004) WebLogo: Asequence logo generator. Genome Res 14: 1188–1190.

52. Knighton DR, Zheng JH, Ten Eyck LF, Ashford VA, Xuong NH, et al. (1991)Crystal structure of the catalytic subunit of cyclic adenosine mono-phosphate-dependent protein kinase. Science 253: 407–414.

53. Hanks SK, Quinn AM, Hunter T (1988) The protein kinase family:Conserved features and deduced phylogeny of the catalytic domains.Science 241: 42–52.

54. Neuwald AF, Kannan N, Poleksic A, Hata N, Liu JS (2003) Ran’s C-terminal,

basic patch, and nucleotide exchange mechanisms in light of a canonicalstructure for Rab, Rho, Ras, and Ran GTPases. Genome Res 13: 673–692.

55. Higgins JM (2001) Haspin-like proteins: A new family of evolutionarilyconserved putative eukaryotic protein kinases. Protein Sci 10: 1677–1684.

56. Facchin S, Lopreiato R, Ruzzene M, Marin O, Sartori G, et al. (2003)Functional homology between yeast piD261/Bud32 and human PRPK: Bothphosphorylate p53 and PRPK partially complements piD261/Bud32deficiency. FEBS Lett 549: 63–66.

57. Downey M, Houlsworth R, Maringele L, Rollie A, Brehme M, et al. (2006) Agenome-wide screen identifies the evolutionarily conserved KEOPScomplex as a telomere regulator. Cell 124: 1155–1168.

58. LaRonde-LeBlanc N, Wlodawer A (2005) The RIO kinases: An atypicalprotein kinase family required for ribosome biogenesis and cell cycleprogression. Biochim Biophys Acta 1754: 14–24.

59. White KA, Lin S, Cotter RJ, Raetz CR (1999) A Haemophilus influenzae genethat encodes a membrane bound 3-deoxy-D-manno-octulosonic acid (Kdo)kinase. Possible involvement of kdo phosphorylation in bacterial virulence.J Biol Chem 274: 31391–31400.

60. Zhao X, Lam JS (2002) WaaP of Pseudomonas aeruginosa is a novel eukaryotictype protein-tyrosine kinase as well as a sugar kinase essential for thebiosynthesis of core lipopolysaccharide. J Biol Chem 277: 4722–4730.

61. Serino L, Virji M (2000) Phosphorylcholine decoration of lipopolysaccharidedifferentiates commensal Neisseriae from pathogenic strains: Identificationof licA-type genes in commensal Neisseriae. Mol Microbiol 35: 1550–1559.

62. Wright GD, Thompson PR (1999) Aminoglycoside phosphotransferases:Proteins, structure, and mechanism. Front Biosci 4: D9–D21.

63. Delpierre G, Collard F, Fortpied J, Van Schaftingen E (2002) Fructosamine3-kinase is involved in an intracellular deglycation pathway in humanerythrocytes. Biochem J 365: 801–808.

64. Fortpied J, Gemayel R, Stroobant V, van Schaftingen E (2005) Plantribulosamine/erythrulosamine 3-kinase, a putative protein-repair enzyme.Biochem J 388: 795–802.

65. Tower PA, Alexander DB, Johnson LL, Riscoe MK (1993) Regulation ofmethylthioribose kinase by methionine in Klebsiella pneumoniae. J GenMicrobiol 139: 1027–1031.

66. Sekowska A, Mulard L, Krogh S, Tse JK, Danchin A (2001) MtnK,methylthioribose kinase, is a starvation-induced protein in Bacillus subtilis.BMC Microbiol 1: 15.

67. Poon WW, Davis DE, Ha HT, Jonassen T, Rather PN, et al. (2000)Identification of Escherichia coli ubiB, a gene required for the firstmonooxygenase step in ubiquinone biosynthesis. J Bacteriol 182: 5139–5146.

68. Do TQ, Hsu AY, Jonassen T, Lee PT, Clarke CF (2001) A defect in coenzymeQ biosynthesis is responsible for the respiratory deficiency in Saccharomycescerevisiae abc1 mutants. J Biol Chem 276: 18161–18168.

69. Jarling M, Cauvet T, Grundmeier M, Kuhnert K, Pape H (2004) Isolation ofmak1 from Actinoplanes missouriensis and evidence that Pep2 fromStreptomyces coelicolor is a maltokinase. J Basic Microbiol 44: 360–373.

70. Cozzone AJ, El-Mansi M (2005) Control of isocitrate dehydrogenasecatalytic activity by protein phosphorylation in Escherichia coli. J MolMicrobiol Biotechnol 9: 132–146.

71. Aimes RT, Hemmer W, Taylor SS (2000) Serine-53 at the tip of the glycine-rich loop of cAMP-dependent protein kinase: Role in catalysis, P-sitespecificity, and interaction with inhibitors. Biochemistry 39: 8325–8332.

72. Johnson DA, Akamine P, Radzio-Andzelm E, Madhusudan M, Taylor SS(2001) Dynamics of cAMP-dependent protein kinase. Chem Rev 101:2243–2270.

73. Huse M, Kuriyan J (2002) The conformational plasticity of protein kinases.Cell 109: 275–282.

Microbial Kinome

PLoS Biology | www.plosbiology.org | S102 Special Section from March 2007 | Volume 5 | Issue 3 | e170478

Page 109: Plos Biology Venter Collection Low
Page 110: Plos Biology Venter Collection Low
Page 111: Plos Biology Venter Collection Low
Page 112: Plos Biology Venter Collection Low

PU

BL

IC L

IBR

AR

Y o

f SC

IEN

CE

| SP

EC

IAL

OC

EA

NIC

ME

TA

GE

NO

MIC

S C

OL

LE

CT

ION

| MA

RC

H 2

00

7

committed to making scientifi c and medical literature a public resource

www.plos.org

PUBLIC LIBRARY of SCIENCE | plosbiology.org | Special Collection | MARCH 2007

Oceanic Metagenomics in

A collection of articles from the J. Craig Venter Institute’s

Global Ocean Sampling expedition