13
Supplementary Material for article “Genomic patterns of recombination, clonal divergence, and environment in deep- sea microbial populations” by Konstantinos T. Konstantinidis and Edward F. DeLong SUPPLEMENTARY MATERIAL INCLUDES: 1. Table S1. Identifying crenarchaeal vs. non-crenarchaeal fosmids in the 4,000m library. 2. Table S2. The crenarchaeal fosmids composing the genomic scaffold in Fig. 1. 3. Table S3. Vertical distribution of Crenarchaea in the Pacific Ocean and their genetic relatedness to their counterparts in the Sargasso Sea. 4. Table S4. Homologous recombination levels detected in different microbial natural populations. 5. Table S5. Recombination levels in natural populations based on LDhat software. 6. Figure S1. Coverage of the Cenarchaeum symbiosum genome by the fifty crenarchaeal fosmids used in this study. 7. Figure S2. Pelagibacter-like microorganisms form a distinct sequence-based population in Sargasso Sea metagenome. 8. Figure S3. Population structure of deep-sea Crenarchaea based on phylogenetic analysis of overlapping clone sequences. 9. Figure S4. Population structure of Prochlorococcus based on phylogenetic analysis of overlapping clone sequences. 10. Figure S5. Manual detection of recombination events within the Crenarchaeal population from 4,000m depth in the Pacific Ocean. 11. Figure S6. Dependence of the sensitivity in detecting recombination upon the diversity of the sequences examined.

Konstan and DeLong SUPPL - Nature Research...Table S3. Vertical distribution of Crenarchaea in the Pacific Ocean and their geneticrelatedness to their counterparts in the Sargasso

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Konstan and DeLong SUPPL - Nature Research...Table S3. Vertical distribution of Crenarchaea in the Pacific Ocean and their geneticrelatedness to their counterparts in the Sargasso

Supplementary Material

for article“Genomic patterns of recombination, clonal divergence, and environment in deep-

sea microbial populations”

byKonstantinos T. Konstantinidis and Edward F. DeLong

SUPPLEMENTARY MATERIAL INCLUDES:1. Table S1. Identifying crenarchaeal vs. non-crenarchaeal fosmids in the 4,000mlibrary.2. Table S2. The crenarchaeal fosmids composing the genomic scaffold in Fig. 1.3. Table S3. Vertical distribution of Crenarchaea in the Pacific Ocean and theirgenetic relatedness to their counterparts in the Sargasso Sea.4. Table S4. Homologous recombination levels detected in different microbialnatural populations.5. Table S5. Recombination levels in natural populations based on LDhat software.6. Figure S1. Coverage of the Cenarchaeum symbiosum genome by the fiftycrenarchaeal fosmids used in this study.7. Figure S2. Pelagibacter-like microorganisms form a distinct sequence-basedpopulation in Sargasso Sea metagenome.8. Figure S3. Population structure of deep-sea Crenarchaea based on phylogeneticanalysis of overlapping clone sequences.9. Figure S4. Population structure of Prochlorococcus based on phylogeneticanalysis of overlapping clone sequences.10. Figure S5. Manual detection of recombination events within the Crenarchaealpopulation from 4,000m depth in the Pacific Ocean.11. Figure S6. Dependence of the sensitivity in detecting recombination upon thediversity of the sequences examined.

Page 2: Konstan and DeLong SUPPL - Nature Research...Table S3. Vertical distribution of Crenarchaea in the Pacific Ocean and their geneticrelatedness to their counterparts in the Sargasso

Table S1. Identifying crenarchaeal vs. non-crenarchaeal fosmids in the 4,000m library.

The genes annotated on a fosmid sequence were searched against the C. symbiosum genome (Hallam et al.,2006a) and the C. symbiosum matching gene (3rd column) as well as the amino acid identity of the match(2nd column) are shown. A typical fosmid representative of the planktonic C r e n a r c h a e a(HF4000APKG8G15, top) and a non-crenarchaeal fosmid (HF4000ANIW137G21, bottom), in terms ofconservation of their genes in C. symbiosum, are shown for comparison. Note that the difference in terms ofthe degree of conservation against C. symbiosum genome is really obvious. Note also the consecutivenumbers on 3rd column that are indicative of genes that are in close proximity in the C. symbiosum genome,revealing that the fosmids of the free-living Crenarchaea share large blocks (5-10 genes long) of genesynteny with their symbiotic relative, C. symbiosum.

Page 3: Konstan and DeLong SUPPL - Nature Research...Table S3. Vertical distribution of Crenarchaea in the Pacific Ocean and their geneticrelatedness to their counterparts in the Sargasso

Table S2. The crenarchaeal fosmids composing the genomic scaffold in Fig. 1.

The average nt identity (2rd column), approximate start (3th column) and end (4th column)positions and the length (5th column) of the total overlap of an individual fosmidsequence against the genomic scaffold used in figure 1 of the article are shown. Overlapsmay represent either a single continuous Blastn match or the concatenated summary ofmore than one matching fragments that were interrupted by small gene insertions ordeletions (very rarely to rearrangements; see also Fig. 6) in the fosmid compared to theconsensus genomic scaffold. ANI was calculated as the average nt identity of all blastnmatching fragments between an individual fosmid and the scaffold. Only matchingfragments longer than 5 Kbp were considered. Also, identical overlaps (>99%) typicallyindicate areas of the scaffold composed by the corresponding fosmid sequence. Thenumber of shotgun reads (6th column) mapping on each fosmid at 98% nucleotide identitycut-off are also shown. Rare fosmids (HF4000APKG8I13 & HF4000APKG8O8) do notrepresent islands, since they share a large overlap with the scaffold (and hence, with otherfosmids) and the nucleotide identity of the overlap is uniform. Blast settings were asdescribed in the methods section for WGS reads.

Page 4: Konstan and DeLong SUPPL - Nature Research...Table S3. Vertical distribution of Crenarchaea in the Pacific Ocean and their geneticrelatedness to their counterparts in the Sargasso

Table S3. Vertical distribution of Crenarchaea in the Pacific Ocean and their geneticrelatedness to their counterparts in the Sargasso Sea.

The number of crenarchaeal sequences identified in the previously published (DeLong etal., 2006) end-sequences of seven fosmid libraries originated from different depths (1st

column) in the Pacific Ocean is shown in the 2nd column. The average nucleotide (nt)identity of the crenarchaeal sequences from each depth against the 4,000m genomicscaffold (3rd column) as well as the number of crenarchaeal sequences that were at least95% identical (nt level) to the crenarchaeal sequences from the Sargasso Sea (4th column)and the 4,000m scaffold (5th column) are also shown. Note that the crenarchaealpopulation in the Sargasso Sea at the time of sampling by Venter and colleagues (Venteret al., 2004) was highly related to the crenarchaeal population from 130 meters depth,and to a lesser extend to the population from 200 meters, in the Pacific Ocean. Incontrast, Crenarchaea from shallower or deeper waters in the Pacific Ocean were moredivergent (reflected by the low numbers of highly-related WGS reads shown in the 4th

column). Also, the deeper the sampling depth the more related the crenarchaealsequences were to the 4,000m scaffold (3rd column), revealing that the crenarchaealpopulations are genetically discrete at different depths in the Pacific Ocean. The averageidentity of the crenarchaeal end-sequences from 4,000m depth against the genomicscaffold from the same depth (same DNA sample) may seem a bit low compared to whatshown in figure 2.A (i.e., 88% ANI vs. ~95%). This is due to the fact that our approachfor identifying crenarchaeal end-sequence targeted divergent crenarchaeal populations aswell as crenarchaeal close relatives in addition to the dominant crenarchaeal population ateach depth (see methods for details).

Page 5: Konstan and DeLong SUPPL - Nature Research...Table S3. Vertical distribution of Crenarchaea in the Pacific Ocean and their geneticrelatedness to their counterparts in the Sargasso

Table S4. Homologous recombination levels detected in different microbial naturalpopulations.

The GARD algorithm for recombination detection (Kosakovsky Pond et al., 2006) wasrun on gene alignments composed of fully overlapping 800 bp-long fosmid or WGSclone sequences representing natural populations. Three population were evaluated,Crenarchaea from 4,000 meter depth in Pacific Ocean (2nd column), Prochlorococcus insurface waters of Sargasso Sea (3rd column), and Ferroplasma (Euryarchaea) in the AcidMine Drainage biofilm community (4th column). Delta AIC values higher than ~10represent significant evidence that the evaluated gene has undergone recombination(personal communication with the author of GARD); the higher the Delta AIC value themore dramatic the effect of recombination it likely was (see also methods). Genes wererandomly selected, provided that they included a similar number of sequences (four toseven), they were single-copy, and not mobile or hypothetical. Top: results based on allgenes evaluated for each population. Bottom: Results based on a subset of all genes thatalso showed comparable genetic diversity (see Average divergence row) and similarnumber of sequences (see Average coverage row) in their alignments.

The latter data represent a more accurate and direct comparison of recombinationfrequency between populations because the sensitivity in recombination detection islimited between very closely related sequences (e.g., >98% identical) and theFerroplasma genes showed significantly less divergence compared to the genes in theCrenarchaea or the Prochlorococcus populations (see Average divergence row, in top;for the relationship describing the dependence of recombination detection on sequencedivergence see Fig. S5). Notice the four times higher Delta AIC values in the genes of theAMD population compared to the deep-sea crenarchaeal one as well as the three-foldhigher percentage of genes that have possibly undergone recombination in the AMDpopulation (bottom results). Also, the frequency of recombination in the Prochlorococcuspopulation was similar, or only slightly lower, compared to the Crenarchaea population(top results). *ND-Not Determined.

Page 6: Konstan and DeLong SUPPL - Nature Research...Table S3. Vertical distribution of Crenarchaea in the Pacific Ocean and their geneticrelatedness to their counterparts in the Sargasso

Table S5. Recombination levels in natural populations based on LDhat software.

The LDhat software (McVean et al., 2002) was run on 400 bp-long gene alignmentscomposed of fully overlapping WGS sequences representing the Ferroplasma naturalpopulation. The fifteen genes evaluated were randomly selected provided that they weresingle copy, not hypothetical or mobile in function, and had about 20X coverage in theWGS dataset. θ values were calculated using the Watterson estimate for θ per site ascalculate by LDhat. The LDhat estimates of ρ are per fragment (not per site), thus, theρ/θ ratio per site for each gene was derived from the equation [(ρ/400)/θ]/2. All estimateswere divided by 2 as the LDhat program assumes a eukaryotic model of bidirectionalrecombination and the prokaryotic system studied here is more likely a unidirectionaltransfer similar to a gene conversion.The average ρ/θ ratio for the fifteen genes evaluated (aver. 0.29, StDev = 0.3) issignificantly lower than the estimate of Eppley and colleagues for the same populationbased on a different approach (2:1 to 4:1) (Eppley et al., 2007), but might be high enoughto account for sexually reproducing populations according to a recent computersimulation study (Fraser et al., 2007). In any case, these results highlight the considerablediversity concerning the accuracy of the different methods for estimating recombinationrates in natural microbial populations. The ρ/θ ratio for genes of the deep-dwellingCrenarchaea, using a very similar approach, was significantly smaller than that of theFerroplasma genes (aver. ~0.10) as expected based on our visual inspection forrecombination events (Fig. S4) and the results of the GARD analysis (Fig. 5). Ourestimate for Crenarchaea, however, is less reliable than that for Ferroplasma becauseLDhad requires at least ~20 sequences per gene for robust results (personalcommunication with the author of LDhat) and such high-coverage genes could not befound in deep-sea dataset (crenarchaeal genes evaluated showed 10-12X coverage). Forthis reason, results for the Ferroplasma genes were preferentially presented in the table.

Page 7: Konstan and DeLong SUPPL - Nature Research...Table S3. Vertical distribution of Crenarchaea in the Pacific Ocean and their geneticrelatedness to their counterparts in the Sargasso

Figure S1. Coverage of the Cenarchaeum symbiosum genome by the fiftycrenarchaeal fosmids used in this study. Figure shows the sections of the C. symbiosumgenome (Hallam et al., 2006b), a close relative of the plantkonic Crenarchaea, that wereshared with at least one of the fifty crenarchaeal fosmids (3rd circle, in blue; no blue barmeans section not shared) as well as the number of fosmid clones sharing thecorresponding section, i.e., the coverage (4th circle, in red; the height of the red bars isproportional to the number of fosmid clones). The analysis shows that the crenarchaealfosmids covered the majority of the C. symbiosum genome, with the exception of the bighypothetical proteins that previous analyses have shown to be C. symbiosum-specific incomparisons to all sequenced microbial genomes and the crenarchaeal population in theSargasso Sea metagenome (Hallam et al., 2006b). Further, the coverage of the C .symbiosum genome by the fosmid clones is relatively even, e.g., there is on average 2Xcoverage (standard deviation is 1.17) in the areas of the C. symbiosum genome sharedwith the planktonic Crenarchaea. Two exceptions to this pattern are noted at the 2 and 10o’clock areas of the circular genome, where coverage is slightly higher, i.e., 5 to 6X. C.symbiousm gene complement, annotated based on the Cluster of Orthologous Genesdatabase (Tatusov et al., 2003) as described previously (Hallam et al., 2006b), is shownin the two outermost circles.

Page 8: Konstan and DeLong SUPPL - Nature Research...Table S3. Vertical distribution of Crenarchaea in the Pacific Ocean and their geneticrelatedness to their counterparts in the Sargasso

Figure S2. Pelagibacter-like microorganisms form a distinct sequence-basedpopulation in Sargasso Sea metagenome. The annotation of figure S1 is identical to theone of figure 2 of the article and refers to the Pelagibacter-like population in theSargasso Sea metagenome. The Pelagibacter-like contig #2217436, assembled from theSargasso Sea metagenome (Venter et al., 2004), was used as the reference genomicsequence in the analysis. Note that contrary to figure 2, which refers to the analysis of areference genomic sequence against a single metagenomic sample, figure S1 refers to ananalysis against all samples of the Sargasso Sea metagenome combined, which explainsthe higher coverage (y-axis) for the Pelagibacter-like population relatively to SAR86 orProchlorococcus in figure 2. The Pelagibacter-11 case provides also an example how ourapproach can be applied to a metagenomic library without any external referencesequence available, i.e., by using as a reference sequence a reliably assembled genomicfragment from within the metagenomic library.

Page 9: Konstan and DeLong SUPPL - Nature Research...Table S3. Vertical distribution of Crenarchaea in the Pacific Ocean and their geneticrelatedness to their counterparts in the Sargasso

Figure S3. Population structure of deep-sea Crenarchaea based on phylogeneticanalysis of overlapping clone sequences. Phylogenetic trees of all availablecrenarchaeal fosmid and WGS sequences from 4,000 meter depth in Pacific Ocean thatwere fully overlapping over the same 500nt-long sequence segment were built asdescribed in methods section. Clones were not pre-selected; the only requirement forusing clones in the analysis was for the clone sequences to be fully overlapping over thesame 500nt-long sequence segment. No clones were found in intermediate geneticdistance between the outgroup and the remaining clones, which is consistent with thegenetic discontinuity identified in Fig. 2A; the outgroup was chosen to be the closestrelated clone found from the clones that were distantly related (i.e., not part of thedominant Crenarchaea population). The two genes shown had the highest coverage byclones in our combined shotgun and fosmid libraries. Ten more different fragments wereevaluated in total and all produced very similar results to the ones shown here (data notshown).Notice that all sequences were at least 90% identical at the nucleotide level (see scalebar); therefore, they are all members of the right-most sequence-based cluster shown inFig. 2A.

Page 10: Konstan and DeLong SUPPL - Nature Research...Table S3. Vertical distribution of Crenarchaea in the Pacific Ocean and their geneticrelatedness to their counterparts in the Sargasso

Figure S4. Population structure of Prochlorococcus based on phylogenetic analysisof overlapping clone sequences. Phylogenetic trees of all available Prochlorococcusfosmid (from Pacific Ocean fosmid libraries, (Coleman et al., 2006)) or WGS (fromSargasso Sea metagenome, (Venter et al., 2004), see figure key) clone sequences thatwere fully overlapping over the same 500nt-long sequence segment were built asdescribed in methods section. Clones were not pre-selected; the only requirement forusing clones in the analysis was for the clone sequences to be fully overlapping over thesame 500nt-long sequence segment. No clones were found in intermediate geneticdistance between the outgroup and the remaining clones, which is consistent with thegenetic discontinuity identified in Fig. 2C; the outgroup was chosen to be the closestrelated clone found from the clones that were distantly related (i.e., not part of thedominant Prochlorococcus population). Ten more different fragments were evaluated intotal and all produced very similar results to the ones shown here (data not shown).Notice that all Sargasso Sea clone sequences were at least 90% identical at the nucleotidelevel (see scale bar); therefore, they are all members of the right-most sequence-basedcluster shown in Fig. 2C. A couple fosmids from the Pacific Ocean station ALOHAreported previously (Coleman et al., 2006) would have been also members of the sameclusters based on their high identity to clones from the Sargasso Sea. These findingsindicate that genetically identical populations may be shared between the Pacific Oceanand the Sargasso Sea at the areas sampled (panmictic population).

Page 11: Konstan and DeLong SUPPL - Nature Research...Table S3. Vertical distribution of Crenarchaea in the Pacific Ocean and their geneticrelatedness to their counterparts in the Sargasso

Figure S5. Manual detection of recombination events within the Crenarchaealpopulation from 4,000m depth in the Pacific Ocean. Recombination amongoverlapping sequences was assessed essentially as described previously (Tyson et al.,2004). Briefly, WGS and fosmid clone sequences were aligned against the consensusgenomic scaffold followed by manually inspection for recombination among theoverlapping sequences. Alignments were viewed using the Jalview software, version 2.1(Clamp et al., 2004). About ~250Kb of the genomic scaffold, representing different partsof the genome of the natural Crenarchaeal population, were evaluated in total, and therewere, on average, 4-6 sequences at every position in the scaffold (4-6X coverage). PanelB shows an overview of the coverage of 150Kb of the genomic scaffold by clonesequences; the detailed alignments of a potential recombination event found in the red-highlighted area is shown in Panel A. Notice that the last clone has initially the same SNPpattern with four other clones and then switches to the pattern that characterizes theremaining three clones in the alignment at around the position of the fourth SNPencountered, moving from left to right (i.e., the second Adenine; follow the red vs. theyellow colored box).The analysis revealed, consistent with the results from the GARD algorithm (e.g., Fig. 5of the article), several potential recombination events similar to the one shown in PanelA. However, the frequency of such events was significantly lower compared to whatfound for the Ferroplasma population in the AMD biofilm community using a verycomparable dataset and methodology.

B: Overview window of the alignment of shotgun reads against thegenomic scaffold that represents the 4,000m Crenarchaeal population

150Kb

A: Detailed alignment of the red-highlighted area shown in Panel B below

Page 12: Konstan and DeLong SUPPL - Nature Research...Table S3. Vertical distribution of Crenarchaea in the Pacific Ocean and their geneticrelatedness to their counterparts in the Sargasso

Figure S6. Dependence of the sensitivity in detecting recombination upon thediversity of the sequences examined. The GARD algorithm was run on ~100 randomlyselected gene alignments representing the deep-sea crenarchaea population and the DeltaAIC value for each alignment (y-axis) is plotted against the average sequence divergencein the alignment (as calculated by GARD). The significant correlation between Delta AICvalues and mean divergence (r-squared = 0.2, p value < 0.001) suggest that the sensitivityin detecting recombination is depending upon the degree of divergence of the sequencesin the alignment, and it is particularly weak for sequences of low diversity, i.e., < 2%nucleotide sequence divergence (see graph). For sequences showing higher than 2%sequence dissimilarity, no significant decrease in the sensitivity was detected. To accountfor the effect of sequence divergence in making direct comparisons of the recombinationlevels between the Ferroplasma and the deep-sea Crenarchaea populations only geneswith average sequence divergence between 2% and 6% were used in the results reportedin Table S4, bottom (r-squared = 0.01 in this range of divergence, see red-markeddatapoints).

Page 13: Konstan and DeLong SUPPL - Nature Research...Table S3. Vertical distribution of Crenarchaea in the Pacific Ocean and their geneticrelatedness to their counterparts in the Sargasso

REFERENCESClamp M, Cuff J, Searle SM, Barton GJ (2004). The Jalview Java alignment editor.Bioinformatics 20: 426-7.

Coleman ML, Sullivan MB, Martiny AC, Steglich C, Barry K, Delong EF et al (2006).Genomic islands and the ecology and evolution of Prochlorococcus. Science 311: 1768-70.

DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, Frigaard NU et al (2006).Community genomics among stratified microbial assemblages in the ocean's interior.Science 311: 496-503.

Eppley JM, Tyson GW, Getz WM, Banfield JF (2007). Genetic exchange across aspecies boundary in the archaeal genus ferroplasma. Genetics 177: 407-16.

Fraser C, Hanage WP, Spratt BG (2007). Recombination and the nature of bacterialspeciation. Science 315: 476-80.

Hallam SJ, Konstantinidis KT, Putnam N, Schleper C, Watanabe Y, Sugahara J et al(2006a). Genomic analysis of the uncultivated marine crenarchaeote Cenarchaeumsymbiosum. Proc Natl Acad Sci U S A 103: 18296-301.

Hallam SJ, Mincer TJ, Schleper C, Preston CM, Roberts K, Richardson PM et al (2006b).Pathways of carbon assimilation and ammonia oxidation suggested by environmentalgenomic analyses of marine Crenarchaeota. PLoS Biol 4: e95.

Kosakovsky Pond SL, Posada D, Gravenor MB, Woelk CH, Frost SD (2006). GARD: agenetic algorithm for recombination detection. Bioinformatics 22: 3096-8.

McVean G, Awadalla P, Fearnhead P (2002). A coalescent-based method for detectingand estimating recombination from gene sequences. Genetics 160: 1231-41.

Tatusov R, Fedorova N, Jackson J, Jacobs A, Kiryutin B, Koonin E et al (2003). TheCOG database: an updated version includes eukaryotes. BMC Bioinformatics 4: 41.

Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM et al (2004).Community structure and metabolism through reconstruction of microbial genomes fromthe environment. Nature 428: 37-43.

Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA et al (2004).Environmental genome shotgun sequencing of the Sargasso Sea. Science 304: 66-74.