BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

1

BARM and BalticMicrobeDB, a reference metagenome and interface to meta-omic data for the Baltic Sea JohannesAlneberg,JohnSundh,ChristinBennke,SaraBeier,DanielLundin,LuisaW.Hugerth,JaronePinhassi,VeljoKisand,LasseRiemann,KlausJürgens,MatthiasLabrenz,AndersF.Andersson

Affiliations

KTHRoyalInstituteofTechnology,ScienceforLifeLaboratory,SchoolofBiotechnology,Stockholm,SwedenJohannesAlneberg,AndersF.Andersson,LuisaW.HugerthCentreforEcologyandEvolutioninMicrobialModelSystems,LinnaeusUniversity,Kalmar,SwedenDanielLundin,JaronePinhassiScienceforLifeLaboratory,DepartmentofBiochemistryandBiophysics,StockholmUniversity,Solna,SwedenJohnSundhLeibnizInstituteforBalticSeaResearch,Warnemünde,Germany ChristinBennke,SaraBeier,KlausJürgens,MatthiasLabrenz CentreforTranslationalMicrobiomeResearch,DepartmentofMolecular,TumorandCellBiology,KarolinskaInstitutet,ScienceforLifeLaboratory,Solna,Sweden LuisaW.HugerthUniversityofTartu,InstituteofTechnology,Tartu,Estonia VeljoKisandSectionforMarineBiologicalSection,DepartmentofBiology,UniversityofCopenhagen,Helsingør,Denmark LasseRiemann

2

Abstract TheBalticSeaisoneoftheworld’slargestbrackishwaterbodiesandischaracterisedbypronouncedphysicochemicalgradientswheremicrobesarethemainbiogeochemicalcatalysts.Meta-omicmethodsproviderichinformationonthecompositionof,andactivitieswithinmicrobialecosystems,butarecomputationallyheavytoperform.WeherepresenttheBAlticSeaReferenceMetagenome(BARM),completewithannotatedgenestofacilitatefurtherstudieswithmuchlesscomputationaleffort.Theassemblyisconstructedusing2.6billionmetagenomicreadsfrom81watersamples,spanningbothspatialandtemporaldimensions,andcontains6.8milliongenesthathavebeenannotatedforfunctionandtaxonomy.Theassemblyisusefulasareference,facilitatingtaxonomicandfunctionalannotationofadditionalsamplesbysimplymappingtheirreadsagainsttheassembly.Thiscapabilityisdemonstratedbythesuccessfulmappingandannotationof24externalsamples.Inaddition,wepresentapublicwebinterface,BalticMicrobeDB,forinteractiveexploratoryanalysisofthedataset.

Background & Summary TheBalticSeaisasemi-enclosedinlandseacharacterizedbystrongphysicochemicalgradients,inparticularhorizontalandverticalsalinityandoxygengradients,andpronouncedseasonaldynamics1.Thisecosystemisalsoheavilyimpactedbyanthropogeniceutrophication,manifestedine.g.harmfulphytoplanktonbloomsandlargeareaswithanoxicbottomwaters2.Duetotheirkeyrolesinbiogeochemicalcycles,microbialcommunitiesareparticularlyinterestingtostudyinthisecosystem3–11.Oneofthemostcomprehensivemethodstocharacterizethetaxonomicandfunctionalcompositionofmicrobialcommunitiesisthroughmetagenomics,andspecificallybymetagenomicassembly,whichenableshighprecisionandsensitivityforbothtaxonomicandfunctionalannotation12.Theseannotationscanbequantifiedinindividualsamplesbymappingshortreadsfromsamplesthateitherwereincludedintheassemblyorconstituteexternalsamples.Forsomemicrobiomes,particularlythoseassociatedwiththehumanbody,extensivesequencingeffortshavebeenundertakentoconstructreferencegenecataloguesthatarepubliclyavailableandcanbeutilizedbyothers13–15.Large-scalemetagenomicdatasetsalsoexistfortheglobalocean,suchastheTaraOceansdataset15.However,althoughthebrackishBalticSeaiscomposedofamixtureofmarine-andfreshwaterlikelineages5,7,10,thesearegeneticallydistinctfromtheirrelatives8,whichhindersefficientreadmappingtofresh-andmarinewatermetagenomes.

3

WeherepresentaBalticSeametagenomeco-assembly(BARM;BAlticseaReferenceMetagenome)withannotatedgenesconstructedfromthreesetsofsamples,selectedtocovervariationovergeography,depthandseason(Table1,Fig.1;DataCitation1).Afterpreprocessingofthereads,the81samplescombinedcontained586billionbasesin2.6billionreadpairs.Toallowtheassemblyofgenesalsofromgenomeshavinglowabundanceinindividualsamples,datafromallsampleswereco-assembled.Theresultingco-assemblyconsistedof14billionbasesdistributedover22millioncontigs.Outofthesecontigs,2.4millioncontigswerelongerorequalto1kilobase.Functionalandtaxonomicannotationofgenesiscomputationallydemanding.Forthisreason,andsincelongercontigsweredeemedtobemoretrustworthy,onlygenesfoundonthecontigs>1kilobaseweresubjectedtofunctionalandtaxonomicannotations;6.8milliongeneswerefoundonthese.Forfunctionalanalysis,severaldatabasesourceswerechosen;Pfam16,TIGRFAM(http://www.jcvi.org/cgi-bin/tigrfams/index.cgi),EggNOG17anddbCAN18.Additionally,enzymecommission(EC)numbers19wereextractedbasedontheEggNOGassignments.Throughmapping,theshortreadswerethenusedtoquantifytheindividualgenesoverallthedifferentsamples,whichweresummarizedperannotationidentifier(ID)foreachrespectiveannotationsource.ThemappingratesforthedifferentsamplegroupsandannotationsourcesaresummarizedinFig.2,wherealso24samplesfromapublishedmetagenomicstudy20(DataCitation2)oftheBalticSeaareincludedtoillustratethecapabilitiesforBARMtoworkasareferencegenecataloguefortheBalticSea.Alongwiththedataset,apublicwebinterface(BalticMicrobeDB)wasconstructedtofacilitateexploratoryanalysisofthedata(https://barm.scilifelab.se).Throughwhichitispossibletoviewcountsoffunctionalandtaxonomicannotationsoverthedifferentsamplegroups.Moreover,itispossibletosearchforfunctionalannotationsbasedontheirdescriptivetextsandchoosetoviewordownloadthecountsforonlythosematchingthesearchquery.Theannotatedassemblypresentedhereisarichresourceforfurtherexploitationofthepublisheddatasets,facilitatedthroughthewebinterface,butcouldalsofunctionasareferencemetagenomeassemblyfortheBalticSea,decreasingthecomputationaldemandsfortheanalysisofnewmetagenomeandmetatranscriptomesamples,andserveasreferenceformetaproteomeanalyses.

4

Methods

Sampling, DNA Extraction and Sequencing

37surface-water(2mdepth)samplesfromthe2012timeseries(MarchtoDecember)fromtheLinneausMicrobialObservatory(LMO)stationlocated10kmofftheeastcoastofÖlandandwherethemaximumdepthis47mhavebeendescribedinHugerthetal.(2015)8(DataCitation3)Briefly,afterprefiltrationthrough3.0μm,DNAwasextractedfrom0.2μmSterivex™cartridgefilters(Millipore)usingtheprotocoldescribedinRiemannetal.(2000)21andsequencedononeHiSeqhigh-outputflowcellwithanaverageof31.9millionpair-endreadspersample.The30transectsamplesweretakenduringacruiseinitiatedbyLeibnizInstituteforBalticSeaResearch,WarnemündeontheR/VAlkor,carriedoutfortheEU-BONUSBLUEPRINTprojectfromJune4toJune172014.SamplesforDNAanalyseswerecollectedusingacompactCTD(profilinginstrumentthatrecordsconductivity,oxygen,temperatureanddepth)SBE911PluswithaSBE-rosetteSBE32(SeaBirdElectronicsInc.,USA)equippedwith18x10LFreeFlow-PWS-samplers(HYDRO-BIOS,Kiel,Germany).Waterwassampledfromoxiczones,intherangefrom2to242mdepth,withinthesalinitygradientoftheBalticSea.ForDNAanalysis,1Lofseawaterwasdirectlyfilteredontoa47mmDuraporemembranefilterwith0.2µmporesize(GVWP04700,MerckMillipore,Darmstadt,Germany)byavacuumof<300mbar.Subsequently,thefilterswerefolded,flashfrozenusingliquidnitrogenandstoredat-80°Cuntilfurtherprocessing.DNAwasextractedusingamodifiedprotocoloftheQIAampDNAMiniKit(51304,Qiagen,Hilden,Germany)withaninitialbead-beatingstepandacleanupandconcentrationprocessusingtheZymogDNACleanandConcentratorKit(D4010,ZymoResearchEurope,Freiburg,Germany).TheconcentrationandqualityoftheelutedDNAwasassuredbygelelectrophoresisandBioanalyzerDNA12000kit(5067-1508,AgilentTechnologies,SantaClara,USA).ThesamplesweresequencedattheNationalGenomicsInfrastructureatScienceforLifeLaboratory,Stockholm,Sweden,usingafullHiSeq2500high-outputflowcellproducinganaverageof69.5millionpair-endreadspersample.TheredoxclinesamplesconsistofsamplesfromstationBoknisEck22,locatedattheentranceoftheEckernfordeBayinthesouthwesternBalticSea,andfromstationTF0271attheGotlandDeepintheeasternGotlandBasin.TheBoknisEckstationwassampledonSeptember23,2014ontheR/VLittorinaduringroutinemonitoringactivitiesperformedmonthlybytheGEOMARHelmholtzCentreforOceanResearchKiel.Duetowindyconditionsbeforethesamplingday,thewaterattheBoknisEckstationwas

5

mixedovermostofthewatercolumnandonlythebottomwaterwassulfidic.Waterwassampledfromthemixedoxygenatedlayerandfromthesulfidicbottomwater,whichwascapturedona3μmporesizemembranefilters(Whatman,Maidstone,UK)followedby0.2μmporesizeSterivex-GVfilters(MilliporeBillerica,Massachusetts,USA).TheGotlandBasinwassampledduringthecruiseEMB087ontheR/VElisabethMannBorgeseonOctober18andOctober26,2014.ThesamplesfromOctober18weretakeninthecontextofanexperimentclosetotheoxic-anoxicinterfacefromsuboxicandanoxicwaterlayersandwerecaptureddirectlyon0.2μmporesizeDuraporemembranefilters(Whatman,Maidstone,UK).ThesamplesfromOctober26weretakentocoverdifferentzonesintheredoxgradient(suboxic,oxic-anoxicinterface,uppersulfidic,lowersulfidic)andwerecapturedfirstona3μmporesizemembranefilters(Whatman,Maidstone,UK)followedby0.2μmporesizeSterivex-GVfilters.DNAfromwatercapturedon3μmporesizemembranefiltersand0.2μmSterivex-GVfilterswasextractedusingtheQIAmpDNAMiniKit(Qiagen,Hilden,Germany):ATLbufferwasaddedtofilterpiecestogetherwith200μmlow-bindingZirconiumbeads(OPSDiagnostics,Lebanon,NY,USA)andthesuspensionwasvortexedfor5minutesatmaximumspeed.SubsequentlyproteinaseKwasaddedandthesuspensionwasincubatedforapproximately1hat56°CbeforecontinuingDNAextractionbyfollowingthemanufacturer’sinstructions.NucleicacidsfromGotlandBasinwatersampledonOctober18on0.2μmporesizemembranefilterswereextractedusingtheAllPrepDNA/RNAMiniKit(Qiagen,Hilden,Germany).Similarasbefore,filterswerevortexedtogetherwithZirconiumbeadsinRTLbufferbeforecontinuingnucleicacidextractionbyfollowingthemanufacturer’sprotocol.TheconcentrationandqualityoftheelutedDNAwasassuredbygelelectrophoresis.ThesamplesweresequencedonasingleHiSeq2500laneproducinganaverageof20.7millionpair-endreadspersample.Allsequencinglibraries(includingLMO)werepreparedwiththeRubiconThruPlexkit(RubiconGenomics,AnnArbor,Michigan,USA)accordingtotheinstructionsofthemanufacturer.

Preprocessing and Assembly ThequalityofthereadswerecheckedandvisualizedwithFastQC23throughMultiQC24andtrimmedfromlowqualitybaseswithcutadapt25usingPhredscore15asacutoff.Adaptersequenceswerealsoremovedusingcutadapt,keepingonlyreadpairswherebothreadsinthepairwerelongerthan31bases.PreprocessedreadswerethenassembledusingMegahit26version1.0.2withdefaultparametersincludingkmers21,41,61,81and99.Exclusivelytothe30samplesfromthetransectcruise,genomicmaterial(20ngperLofseawater)fromaknowngenomeofThermusthermophilus

6

(strainHB8),whichisnotexpectedtobepresentintheBalticSeanaturally,wasaddedafterfiltrationbutpriortotheDNAextraction,servingasinternalstandardtoenableabsolutequantifications.Aligningallcontigsfromthemetagenomeassemblyagainstthisreferencegenomeshowedthat84.1%ofthegenomewasrecoveredwithincontigsaligningwithaverage99.82%identity.Theseadditionalgenomecontigswerekeptinthereferenceassemblybutreadsaligningtothereferencegenomewerefilteredoutbeforethequantificationsteps,andbeforeuploadingtheprocessedreadstotheEuropeanNucleotideArchive(ENA)(DataCitation4).

Functional Annotation GeneswerepredictedonallcontigsusingProdigal27version2.6.3withthe‘--meta’tagwhichpotentiallyusesdifferentcodingtablesfordifferentcontigs.Geneslocatedoncontigslongerorequalto1kilobase,identifiedwiththescripttoolbox/scripts/fasta_lengths.py,wereusedforfunctionalandtaxonomicannotation.Forfunctionalannotation,thedatabasesEggNOG17,Pfam16,TIGRFAM(http://www.jcvi.org/cgi-bin/tigrfams/index.cgi)anddbCAN18werechosen.Furthermore,EC-numbers19wereextractedfromtheEggNOGannotations.ToannotategeneswithEggNOG17ids,theEggNOGhmmfileforallorganisms,NOG.hmm.tar.gz,version4.5wasdownloadedfromhttp://eggnogdb.embl.de/download/eggnog_4.5/data/NOG/.Forperformancereasons,hmmsearchwasusedinsteadofhmmscan28,initiallyremovingallhitswithanE-value>0.0001.Toselectamaximumofoneannotationpergene,thehitwithhighestscorewaschosenusingthescripttoolbox/scripts/hmmer_filtering/keep_top_score.py.Informationabouteachannotationwasdownloadedfromhttp://eggnogdb.embl.de/download/eggnog_4.5/data/NOG/NOG.annotations.tsv.gz.AnEnzymeCommision(EC)number19wasassignedtoeachEggNOGthroughtheUniprot29proteinsincludedintheEggNOGmodel,ifamajorityofitsEC-assignedmemberswereassignedtothatEC.NotethatproteinscouldhavemultipleECnumbersassignedandthereforesomeEggNOGswereassignedmultipleECnumbers.Thefilesneededfortheconversionwereeggnog4.protein_id_conversion.tsv.gz(downloadedfromhttp://eggnogdb.embl.de/download/eggnog_4.5/onJanuary9th2017)andNOG.members.tsv.gz(downloadedfromhttp://eggnogdb.embl.de/download/eggnog_4.5/data/NOG/onJanuary9th2017).TheproteinidconversionfilegivesECnumbersperreferenceproteinandthemembersfilegivesthereferenceproteinsthatbuildeachmodel.Theproteinwithtaxaid400682andproteinid“PAC”wasremovedfromtheproteinidconversionfilesinceithad695ECentries.Likewisefor

7

taxaid7070andproteinid“TCOGS2”,with686ECentries.Theproteinidwiththethirdmostentrieshad6entriesandthereforethetwoothersweredeemedasoutliers.ThesuspectedreasonisthattheseentriesbelongtodifferentgenesforthesegenomesbuttherewerenowaytoresolvethisandtheEC-numberassignmentforeachEggNOGwasdeemedtonotbeaffectedbythis.GiventheassignmentofEC-numbersperEggNOG,theassignmentpergenewasdonewithtoolbox/scripts/assign_ec_from_nog.py.AnnotationagainstthedbCAN18(DataBaseforautomatedCarbohydrate-activeenzymeANnotation)databasewasperformedusingversion5(downloadedfromhttp://csbl.bmb.uga.edu/dbCAN/download.php).FollowingtheinstructionsfromdbCAN(downloadedfromhttp://csbl.bmb.uga.edu/dbCAN/download/readme.txt),hmmscan28wasusedwiththeoption--domtbloutandtheresultwasfurthertreatedwiththerecommendedscripthmmscan-parser.sh(referenceofusedscriptavailablewithintoolbox/third_party_scripts/dbcan/hmmscan-parser.sh)fromdbCANrequiringacoveredfractionoftheHMMlargerthan0.3andkeepinglongalignments(>80aminoacids)iftheE-valuewaslessthan1e-5andshortalignmentsiftheE-valuewaslessthan1e-3.Anadditionalscripttoolbox/hmmer_filtering/dbcan_strict_filtering.pywasapplied,choosingtofollowrecommendationsforbacteriafromdbCAN,keepingannotationswithe-valuelessthan1e-18andalignmentcoveragegreaterthan0.35.Toallowformorethanasingledomainwithinagene,anyannotationwhichfulfilledthesecriteriawaskept.Informationabouteachannotationwascollected(downloadedfromhttp://csbl.bmb.uga.edu/dbCAN/download/FamInfo.txt).AnnotationagainstPfam16version30.0wasconductedwiththescriptpfam_scan.plsuppliedfromtheftp://ftp.ebi.ac.uk/pub/databases/Pfam/Toolsforversion28.0,usinghmmerversion3.1b128.Toallowformorethanasingledomainwithinagene,anyannotationwhichfulfilledthesecriteriawaskept.Informationabouteachannotationwascollectedascolumns1,2and4fromthefilepfamA.txt.gzdownloadedfromftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam30.0/database_files/onJanuary11th2017.AnnotationagainstTIGRFAMversion15,wasperformedusinghmmsearch(v.3.1b2)28with--cut_tcargumenttofiltermodelsbytrustedcutoff.Foreachproteinsequence,thebestscoringHMMwasselectedusinghmmparse.pyavailableathttps://github.com/johnne/biotools/blob/master/scripts/hmmparse.py

8

Taxonomic annotation Themethodusedtoassigntaxonomywaschoseninordertoassignasmanycontigsaspossibletoataxonomywhilestillkeepingfalsepositivestoalowlevel.Asthenumberofsequencesinreferencedatabasescloselyrelatedtothegenomesinoursampleswasexpectedtobelow8,aminoacidsequencesfromtheassemblywereusedtocompareagainstotheraminoacidsequencesinthereferencedatabase,enablinghighersensitivity(duetothemoreconservednatureofaminoacidsequences).ThiscomparisonwasdoneusingDiamondversion0.8.2630withtheparameters“--segyes”,”--sensitive”and“--top10”againsttheNCBInrdatabasedownloadedDecember2nd2016.ThecodeusedtoassigntaxonomyfromtheDiamondsearchwasbasedonanoriginalavailableintheDESMANpackage31andthemodifiedversionofthecodeisavailableasthescripttoolbox/scripts/taxonomy_from_genes_to_contigs/lca_per_contig.py.Theassignmentwasdoneasfollows:allreportedhitsfromtheDiamondsearchweregivenaweightbasedonthealignedfractionofthequeryandthepercentageidentityofthealignment.Ateachtaxonomiclevel,ifthesumoftheweightsforonetaxonwasgreaterthanhalfthesumofallweights,thegenewasassignedtothattaxonaslongasthepercentageidentitywashighenough.Thelevelsforthepercentageidentityweresetto40%atsuperkingdomlevel,50%atphylumlevel,60%atclasslevel,70%atorderlevel,80%atfamilylevel,90%atgenusleveland95%atspecieslevel.Taxonomicassignmentsweresetpercontigtothemostdetailedlevelwhereconsensusforatleast50%oftheweightsofthepreliminarygeneassignmentscouldbeachieved.Geneswithouttaxonomicannotationwereignored.Thesharedassignmentwaspropagatedtoallgenespresentonthatcontig.Inthisway,allgenespresentononecontigwillalwayssharethetaxonomicassignment.Ifnosinglesuperkingdomaccountedforamajorityofthegeneassignmentweightsforacontig,thecontigwasleftunassigned.

Quantification and Normalization Tousethemetagenomeassemblyasareferenceassembly,individualsamplesarefunctionallyandtaxonomicallyannotatedbyquantifyingthedifferentannotationspresentintheassembly.Thisisdonebymappingallshortreadsagainsttheassemblyandquantifyinggenes,andtherebyanyassociatedannotation,withthenumberofreadsmappingtothem.Morespecificallybowtie232version2.2.6wasusedwiththeparameter“--local”formapping,duplicatedreadswereremovedwithpicardversion1.118,bam-filesortingwasdonewithSamtools33version1.3,andthehtseq-countscriptfromhtseq34version0.6.1wasusedtogetrawcountspergene.

9

Countsperannotationwasachievedbysummingallcountsforgenesannotatedwitheachrespectiveannotation.Whenquantifyingannotationtypeswheremultipleannotationswereallowedforasinglegene(dbCANandPfam),somegenescontributedseveraltimestothequantities.Thiswaskeptinordertofacilitateanalysisofdifferentialabundancefortheindividualannotations.Alongwithrawcountsofreadsforeachannotationtypeandtaxonomy,acountnormalizedbygenelengthandnumberofreadswasalsocalculated.Borrowingtheformulaandthetermfromtranscriptomics,wecalculateTPM(TranscriptsPerMillion)values35:TPM

𝑇𝑃𝑀 =𝑟! ⋅ 𝑟𝑙 ⋅ 10!

𝑓𝑙! ⋅ 𝑇

𝑇 =!∈!

𝑟! ⋅ 𝑟𝑙𝑓𝑙!

Where𝑟!isthenumberofreadsmappedtogene𝑔fromthesample,𝑟𝑙istheaveragereadlengthforthesample,𝑓𝑙!isthelengthofthegeneand𝐺isthesetofallgenes.Tisaconveniencevariablefortheindicatedsumoverallgenes.

Code availability Codeusedtopreprocessreads,assemblecontigsandannotategenesispubliclyavailableathttps://github.com/EnvGen/BLUEPRINT_pipeline,containingthepipelinedefinitionoftheworkflowsused,https://github.com/EnvGen/snakemake-workflows,wherethesnakemakerulesarespecifiedinordertobuildthecommandusedforeachstep,andthebranchBARM_publicationofhttps://github.com/EnvGen/toolbox,forcustomscripts.Scriptswithinthelatterrepositorythathavebeenusedhavebeenindicatedthroughoutthetext.

Data Records ThepreprocessedsequencingreadsfromtheTransectandRedoxclinesamplesweresubmittedtoENAhostedbyEMBL-EBIunderthestudyaccessionnumberPRJEB22997(DataCitation4).TherawreadsfromLMOwerepublishedelsewhere8andareaccessibleatNCBI(DataCitation3).Contig,geneandproteinsequencesfromtheco-assemblyoftheTransect,RedoxclineandLMOsamples,aswellasquantificationtables,contextual

10

dataforthesamples,andtheannotationsforeachgeneareaccessibleonFigshare(DataCitation1).Therawsequencingreadsfromtheexternalsamplesusedforevaluationwerealsopublishedelsewhere36andareaccessibleatNCBI(DataCitation2).

Technical Validation ThemappingratesforallsamplesincludedinthereferenceassemblyareshowninFig.2,wherethemajorityofsamplesincludedintheassemblyreachesalevelabove80%.Thisservesasavalidationofthecompletenessofthemetagenomeassembly.Thefractionofreadsthatdidnotmaptothecoassembly,andwerehencenotassembledpastthe200baseslengthcutoffmostlikelyoriginatefromlowabundancespecies,orspecieswithhighintraspeciesdiversitygeneratingfragmentedassemblies.ThemappingrateoftheexternalsamplesshowsthecapabilityforthisassemblytoserveasareferencemetagenomeassemblyfortheBalticSea.Theseexternalsamples36werecollectedinadifferentyear(2011)andastation(58.82N17.63E)separatefromwherethesamplesincludedintheassemblyweretaken.ThisrepresentsarealisticscenariowhereBARMisusedasareferencemetagenomefortheBalticSea.Themappingratesvarywiththefilterfractions,wherereadsoriginatingfromthelargest(3.0-200μm)andsmallest(<0.1μm)fractionsdisplayedlowerratesthanthetwointermediatefractions(0.1-0.8μmand0.8-3.0μm),indicatingthatpicoplanktonarebetterrepresentedinBARMthanlargereukaryoticplanktonandviruses.Assignmentratesfordifferentannotationtypes,asshowninFig.3,areinthemajorityofcasesbelow10%ofthetotalnumberofreads,whichisexpectedsinceonlygenesonlongcontigs(representing40%ofthebasesofthetotalassembly)werepredictedandsubjectedtoannotation.Thefractionofreadsannotatedamongreadsmappingtogenesincludedintheannotationprocedurereacheswellover30%forPfamandshowsthegeneralityofthatdatabaseascomparedtoi.e.dbCAN,amuchmorenichedresource,whichreachesonlyaround2%ofreadsmappingtogenesincludedintheannotation.ThefunctionalannotationwasfurthervalidatedthroughanNMDSplot(Fig.4)basedontheEggNOGannotationsofthetransectdata.Depthwasfoundtobenegativelycorrelatedwiththefirstdimension(Spearman’srankcorrelationρ=-0.73,P=5.4-06)andsalinitywasnegativelycorrelatedwiththeseconddimension(Spearman’srankcorrelationρ=-0.77,P=2.4-06).ThesetwoenvironmentalparametershavepreviouslybeenfoundtocorrelatestronglywiththemicrobialcommunityintheBalticSea5whichstrengthensourtrustintheEggNOGannotations.Furthermore,analyzinga

11

singleannotationwithaknownfunction,namelythephotosyntheticreactioncentreprotein(PF00124)wecouldseeastrongnegativecorrelationwithsamplingdepthoverthethirtytransectsamples(Spearmancorrelationcoefficientρ=-0.87,P=3.1-10).Thetaxonomicannotationwasvalidatedbyinspectingthetaxonomicprofileofthetransectsamples.Thesamedominantprokaryotictaxonomicgroupswereobservedasinpreviouspan-Balticampliconsequencingandmetagenomicstudies5,7,10,11,andtheoveralltrendswereconservedwithanincreaseinAlpha-andGammaproteobacteriaandadecreaseinActinobacteriaandBetaproteobacteriawithincreasingsalinitylevels(Fig.5).AmongthepredictedproteinsinBARM,98%lackedhitswithaminoacididentitiesabove95%,hencepotentiallyrepresentingspeciesforwhichsequencedgenomesarelacking37.31%ofthesequenceslackedsignificanthits(E-value>1)andpotentiallycorrespondtonovelproteinfamilies.

Usage Notes Apubliclyavailablerepositoryathttps://github.com/EnvGen/BARM_toolshostsinstructionsandapipelineonhowtoquantifygenesandtheirannotationswithinBARMforanykindofBalticSeametagenomicandmetatranscriptomicsamples.ThewebinterfaceBalticMicrobeDB,availabletothepublicathttp://barm.scilifelab.se,canbeusedtoexploreandaccessdataforthethreesamplesetsthattheassemblyisbasedupon.Attheindexpage,theusercanchoosewhethertoaccessfunctionalannotationsortaxonomicannotations.Forthefunctionalannotations,theusercanselectspecificannotationsourcesandidentifiersandselectthesamplegroupsforwhichthecountswillbedisplayed.Furthermore,atextsearchovertheidentifiersandthedescriptionsoftheannotationscanbeusedtocreateacustomtableofcountsovertheselectedsamples.Fortaxonomicannotations,countsforthetoplevelsuperkingdomarefirstpresentedbuttheusercanunfoldataxonomictreetoselectanytaxontoviewcountsfor.

Acknowledgements ThisworkresultedfromtheBONUSBlueprintprojectsupportedbyBONUS(Art185),fundedjointlybytheEUandtheSwedishResearchCouncilFORMAS,TheFederalMinistryofEducationandResearch(BMBF),EstonianResearchCouncil,andDanishCouncilforIndependentResearch.ItwasalsofundedbytheSwedishResearchCouncilVR(grant2011-5689)througha

12

granttoA.F.A.ComputationswereperformedonresourcesprovidedbytheSwedishNationalInfrastructureforComputing(SNIC)throughtheUppsalaMultidisciplinaryCenterforAdvancedComputationalScience(UPPMAX).DNAsequencingwasconductedattheSwedishNationalGenomicsInfrastructure(NGI)atScienceforLifeLaboratory(SciLifeLab)inStockholm.TheRedoxclinesamplesweremadepossiblethroughgeneroussupportfromtheBoknisEckcoastaltimeseriesstation(http://www.bokniseck.de)runbytheChemicalOceanographyResearchUnitatGEOMAR.

Author Contributions J.A.,J.S.andA.F.A.designedtheanalysisworkflow.J.A.,D.L.andA.F.A.designedtheBalticMicrobeDBstructure.J.A.andJ.S.implementedtheanalysisworkflow.V.Ktestedandimprovedtheworkflow.J.A.implementedBalticMicrobeDB.J.A.andJ.S.conductedtheanalysis.C.B.,S.B.,M.L.,andK.J.conductedsamplingandextractedDNA.J.A.andA.F.A.wrotemanuscriptwithcontributionsfromallauthors.Allauthorsreadandapprovedthefinalversionofthemanuscript.

Competing interests Theauthorsdeclarenocompetingfinancialinterests.

Corresponding author CorrespondencetoAndersF.Andersson.

13

Figures

Figure1:Amapshowingthelocationsforallstationswheresamplesweretaken.

Figure1.Thethreesamplegroupsincludedintheassembly(Transect,LMOandRedoxcline)aredisplayedtogetherwiththeexternalsampleset20(External),allgroupsindicatedwithdifferentmarkers.Thecolourofthemarkerindicatesthesalinityofthewatersamplewhilethesizeindicatesthedepthatwhichitwastaken.Thebackgroundcolorindicatesdepth(fromwhitetodarkblue),withcontourlinesdrawnwith50mintervals.ThemapwasgeneratedusingtheMarmappackage38inR39withbathymetricdatafromtheETOPO1datasethostedontheNOAAserver40.

14

Figure2:Mappingratesdividedondifferentsamplegroups.

Figure2.MappingratesarecalculatedbyBowtie232asthe“overallalignmentrate”.Thethreefirstsamplegroups;LMO2012(N=37,0.2-3.0μm),BalticTransect2014(N=30,>0.2μm)andBalticRedoxcline2014(N=6,0.2-3.0μm;N=6,>3.0μm;N=2,>0.2μm)wereincludedintheassembly,whilethefourlastsamplegroups;ExternalSamples<0.1μm(N=6),ExternalSamples0.1-0.8μm(N=6),ExternalSamples0.8-3.0μm(N=6)andExternalSamples3.0-200μm(N=6)werenot.Thesizeintervalsoftheexternalsamplesindicatefilterporesizesusedtotentativelyseparateviruses,free-livingprokaryotes,andsmallandlargerparticlesaswellasEukaryoticcells,respectively36.CreatedusingMatplotlib41andSeaborn42

15

Figure3:Fractionofreadsmappingtogenesannotatedwithrespectivedatabase.

Figure3.Onlygenesidentifiedoncontigslongerthan1kilobaseweresubjectedtoannotation,definingthe‘includedgenes’category.N=81forallcategories.CreatedusingMatplotlib41andSeaborn42

16

Figure4:NMDSofEggNOGannotstions.

Figure4.Non-metricdimensionalscaling(NMDS)ofthe30samplesincludedintheTransectsamplegroupbasedonEggNOGannotation.Samplesarecoloredandsizedaccordingtosalinityanddepth,respectively.CreatedusingMatplotlib41andSeaborn42

17

Figure5:Taxonomicprofilesofthe10transectsamplesobtainedfromsurfacewaters.

Figure5:Numbersonx-axisindicatesalinity,giveninpracticalsalinityunits(PSU),andaresortedwithincreasingsalinitytotheright.CreatedusingMatplotlib41andSeaborn42

References 1. Snoeijs-Leijonmalm,P.&Andrén,E.WhyistheBalticSeasospecialtolivein?

BiologicalOceanographyoftheBalticSea23–84(Springer,Dordrecht,2017).2. Blenckner,T.,Österblom,H.,Larsson,P.,Andersson,A.&Elmgren,R.BalticSea

ecosystem-basedmanagementunderclimatechange:Synthesisandfuturechallenges.Ambio44Suppl3,507–515(2015).

3. Riemann,L.etal.Thenativebacterioplanktoncommunityinthecentralbalticseaisinfluencedbyfreshwaterbacterialspecies.Appl.Environ.Microbiol.74,503–515(2008).

4. Andersson,A.F.,Riemann,L.&Bertilsson,S.PyrosequencingrevealscontrastingseasonaldynamicsoftaxawithinBalticSeabacterioplanktoncommunities.ISMEJ.4,171–181(2010).

5. Herlemann,D.P.etal.Transitionsinbacterialcommunitiesalongthe2000kmsalinitygradientoftheBalticSea.ISMEJ.5,1571–1579(2011).

6. Thureborn,P.etal.AmetagenomicstransectintothedeepestpointoftheBalticSearevealsclearstratificationofmicrobialfunctionalcapacities.PLoSOne8,e74983(2013).

18

7. Dupont,C.L.etal.Functionaltradeoffsunderpinsalinity-drivendivergenceinmicrobialcommunitycomposition.PLoSOne9,e89549(2014).

8. Hugerth,L.W.etal.Metagenome-assembledgenomesuncoveraglobalbrackishmicrobiome.GenomeBiol.16,279(2015).

9. Lindh,M.V.etal.Disentanglingseasonalbacterioplanktonpopulationdynamicsbyhigh-frequencysampling.Environ.Microbiol.17,2459–2476(2015).

10. Hu,Y.O.O.,Karlson,B.,Charvet,S.&Andersson,A.F.DiversityofPico-toMesoplanktonalongthe2000kmSalinityGradientoftheBalticSea.Front.Microbiol.7,679(2016).

11. Herlemann,D.P.R.,Lundin,D.,Andersson,A.F.,Labrenz,M.&Jürgens,K.PhylogeneticSignalsofSalinityandSeasoninBacterialCommunityCompositionAcrosstheSalinityGradientoftheBalticSea.Front.Microbiol.7,1883(2016).

12. Darzi,Y.,Falony,G.,Vieira-Silva,S.&Raes,J.Towardsbiome-specificanalysisofmeta-omicsdata.ISMEJ.10,1025–1028(2016).

13. HumanMicrobiomeProjectConsortium.Structure,functionanddiversityofthehealthyhumanmicrobiome.Nature486,207–214(2012).

14. Li,J.etal.Anintegratedcatalogofreferencegenesinthehumangutmicrobiome.Nat.Biotechnol.32,834–841(2014).

15. Sunagawa,S.etal.Structureandfunctionoftheglobaloceanmicrobiome.Science348,1261359(2015).

16. Finn,R.D.etal.ThePfamproteinfamiliesdatabase:towardsamoresustainablefuture.NucleicAcidsRes.44,D279–85(2016).

17. Huerta-Cepas,J.etal.EGGNOG4.5:Ahierarchicalorthologyframeworkwithimprovedfunctionalannotationsforeukaryotic,prokaryoticandviralsequences.NucleicAcidsRes.44,D286–D293(2016).

18. Yin,Y.etal.dbCAN:awebresourceforautomatedcarbohydrate-activeenzymeannotation.NucleicAcidsRes.40,W445–51(2012).

19. Webb,E.C.&Others.Enzymenomenclature1992.RecommendationsoftheNomenclatureCommitteeoftheInternationalUnionofBiochemistryandMolecularBiologyontheNomenclatureandClassificationofEnzymes.(AcademicPress,1992).

20. Asplund-Samuelsson,J.etal.DiversityandExpressionofBacterialMetacaspasesinanAquaticEcosystem.Front.Microbiol.7,1043(2016).

21. Riemann,L.,Steward,G.F.&Azam,F.Dynamicsofbacterialcommunitycompositionandactivityduringamesocosmdiatombloom.Appl.Environ.Microbiol.66,578–587(2000).

22. Bange,H.W.&Malien,F.HydrochemistryfromtimeseriesstationBoknisEckfrom1957to2014.doi:10.1594/PANGAEA.855693(2015).

23. Andrews,S.&Others.FastQC:aqualitycontroltoolforhighthroughputsequencedata.(2010).

24. Ewels,P.,Magnusson,M.,Lundin,S.&Käller,M.MultiQC:summarizeanalysisresultsformultipletoolsandsamplesinasinglereport.Bioinformatics32,3047–3048(2016).

25. Martin,M.Cutadaptremovesadaptersequencesfromhigh-throughputsequencingreads.EMBnet.journal17,10–12(2011).

26. Li,D.,Liu,C.-M.,Luo,R.,Sadakane,K.&Lam,T.-W.MEGAHIT:anultra-fastsingle-nodesolutionforlargeandcomplexmetagenomicsassemblyviasuccinctdeBruijngraph.Bioinformatics31,1674–1676(2015).

19

27. Hyatt,D.etal.Prodigal:prokaryoticgenerecognitionandtranslationinitiationsiteidentification.BMCBioinformatics11,119(2010).

28. Eddy,S.R.AcceleratedProfileHMMSearches.PLoSComput.Biol.7,e1002195(2011).

29. UniProtConsortium.UniProt:ahubforproteininformation.NucleicAcidsRes.43,D204–12(2015).

30. Buchfink,B.,Xie,C.&Huson,D.H.FastandsensitiveproteinalignmentusingDIAMOND.Nat.Methods12,59–60(2015).

31. Quince,C.etal.DESMAN:anewtoolfordenovoextractionofstrainsfrommetagenomes.GenomeBiol.18,181(2017).

32. Langmead,B.&Salzberg,S.L.Fastgapped-readalignmentwithBowtie2.Nat.Methods9,357–359(2012).

33. Li,H.etal.TheSequenceAlignment/MapformatandSAMtools.Bioinformatics25,2078–2079(2009).

34. Anders,S.,Pyl,P.T.&Huber,W.HTSeq—aPythonframeworktoworkwithhigh-throughputsequencingdata.Bioinformatics31,166–169(2015).

35. Wagner,G.P.,Kin,K.&Lynch,V.J.MeasurementofmRNAabundanceusingRNA-seqdata:RPKMmeasureisinconsistentamongsamples.TheoryBiosci.131,281–285(2012).

36. Larsson,J.etal.PicocyanobacteriacontaininganovelpigmentgeneclusterdominatethebrackishwaterBalticSea.ISMEJ.8,1892–1903(2014).

37. Konstantinidis,K.T.&Tiedje,J.M.Towardsagenome-basedtaxonomyforprokaryotes.J.Bacteriol.187,6258–6264(2005).

38. Pante,E.&Simon-Bouhet,B.marmap:Apackageforimporting,plottingandanalyzingbathymetricandtopographicdatainR.PLoSOne8,e73051(2013).

39. TheRCore-team.R:Alanguageandenvironmentforstatisticalcomputing.(RFoundationforStatisticalComputing,2014).

40. Amante,C.&Eakins,B.W.ETOPO11arc-minuteglobalreliefmodel:procedures,datasourcesandanalysis.(USDepartmentofCommerce,NationalOceanicandAtmosphericAdministration,NationalEnvironmentalSatellite,Data,andInformationService,NationalGeophysicalDataCenter,MarineGeologyandGeophysicsDivisionColorado,2009).

41. Hunter,J.D.Matplotlib:A2DGraphicsEnvironment.Comput.Sci.Eng.9,90–95(2007).

42. Waskom,M.etal.seabornv0.7.0.Zenodo,https://doi.org/10.5281/zenodo.45133,(2016).

Data Citations 1. Alneberg,J.AnderssonA.F.Figshare

https://doi.org/10.6084/m9.figshare.c.3831631(2017)preview-link:https://figshare.com/s/3171d2e7c0f470fc488a

2. NCBIBioProjectPRJNA322246(2016)3. NCBIBioProjectPRJNA273799(2015)4. EuropeanNucleotideArchivePRJEB22997(2017)

Documents

BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and