Upload
trinhdieu
View
214
Download
0
Embed Size (px)
Citation preview
Current Biology
Supplemental Information
Genomic Expansion of Domain Archaea
Highlights Roles for Organisms from
New Phyla in Anaerobic Carbon Cycling
Cindy J. Castelle, Kelly C. Wrighton, Brian C. Thomas, Laura A. Hug,
Christopher T. Brown, Michael J. Wilkins, Kyle R. Frischkorn, Susannah G. Tringe,
Andrea Singh, Lye Meng Markillie, Ronald C. Taylor, Kenneth H. Williams,
and Jillian F. Banfield
Figure S1, related to Figure 2. Maximum likelihood phylogeny of the 16S rRNA gene placing the Woesearchaeota and Pacearchaeota as novel phylum-level lineages. The final alignment contained 449 sequences, 60 from this study, and was generated using the SILVA SINA alignment tool and the SILVA reference alignment. The tree was constructed using RAxML-HPC under the GTRCAT model of evolution with 100 bootstrap resamplings. Bootstrap values greater than 50 are noted on the tree.
To bacteria
Parvarchaeota
Micrarchaeota
SM1K20SAGMEGMSBL1
Aenigmarchaeota
pBRKC134NanohaloarchaeotaHalobacteriaNanoarchaeota
DPANN
TACK
EuryarchaeotaAigarc
haeo
taKo
rarch
aeota
Gp TMCG
Gp 1
MHVG1
Marine
Benth
ic Gp
BGp
2 MH
VG 2
Ancie
nt Ar
chae
al Gp
AAG
Methanosarcinales
AR5
AR4AR16
AR18AR20
AR17
AR6
AR13
AR10
Groundwater-associated Archaea4m Sediment-associated Archaea
Archaeal genome bins
5m Sediment-associated Archaea6m Sediment-associated Archaea
Thaumarchaeota AK31
uncultured crenarchaeote
AB01
9717
1 14
29 A
rchae
a unid
entifi
ed ar
chae
on
AB189390 1 904 Group MHVG
Crenarchaeota
Miscellaneous Crenarchaeotic Gr
Group C3
Marine Benthic Gr A
Group SCG
Diaphe
rotrite
s
Thau
march
aeota
ma
rine G
roup
I
Pacearchaeota Woesearchaeota
AR15
AR11
Figure S2 related to Table 1. ESOM constructed using tetranucleotide frequency information for genome fragments with archaeal signatures > 5 kb in length from the groundwater metagenome assemblies. A, Boundaries (dark bands) separate clusters of fragments with similar signatures (each dot represents a 5 kb fragment). B, Map was colored based on GC content, coverage and scaffold connections to highlight the different genome bins, annotated from AR1 to AR21.
AR1
AR3AR4
AR5
AR6
AR9
AR10
AR11
AR13AR15
AR16AR17
AR18
AR19
AR20
AR21
A
B
Figure S3 related to Table 1. Genome completeness estimated for all recovered genome bins based on 54 conserved single-copy genes (SCG). Numbers in circles correspond to the number of genes per genome matching the annotation. This figure and the FASTA files that support it can be accessed here: http://ggkbase.berkeley.edu/genome_summaries/107-genome_completeness_2.
Genome_c
omple
teness
based
on 54
SCG
54 27 50 49 48 42 40 54 53 47 42 41 41 41 34 33
DNA repair
and rec
ombin
ation p
rotein R
adA
1 1 1 1 1 1 1 1 1 1 1 1 1 1
Chapero
nin GroE
L
1 1 1 1 1 1 1 1 1 1 1 1
DNA-direct
ed RNA po
lymera
se sub
unit B
1 1 1 1 1 1 1 1 1 1 1 1
Methion
ine am
inopep
tidase
1 1 1 1 1 1 1 1 1 1 1
Transcr
iptiona
l antite
rminator
1 1 1 1 1 1 1 1 1 1 1
Preprote
in tran
slocas
e subu
nit SecY
1 1 1 1 1 1 1 1 1 1 1 1 1 1
Ribonuc
lease
HII
1 1 1 1 1 1 1 1 1 1
Alanyl-t
RNA synth
etase
1 1 1 1 1 1 1 1 2 1 1 1
Arginyl-t
RNA synth
etase
1 1 1 1 1 1 1 1 1 1 1 1 1
Aspartyl
-tRNA sy
ntheta
se
1 1 1 1 1 1 1 1 2 1 1 1 1 1
Cystein
yl-tRNA sy
ntheta
se
1 1 1 1 1 1 1 1 1 1 1
Glutaminy
l-tRNA sy
ntheta
se
1 1 1 1 1 1 1 1 1 1 1
Glycyl-tR
NA synth
etase
1 1 1 1 1 1 1 1 1 1 1 1
Histidyl-t
RNA synth
etase
1 1 1 1 1 1 1 1 1 1
Isoleu
cyl--tR
NA synth
etase
1 1 1 1 1 1 1 1 1 1 1 1 1
Leucyl-
tRNA synth
etase
1 1 1 1 1 1 1 1 1 1 1 1
Lysyl-t
RNA synth
etase
1 1 1 1 1 1 1 1 1 1 1 1 1 1
Methion
yl-tRNA sy
ntheta
se
1 1 1 1 1 1 1 1 1 1 1
Phenyla
lanyl-t
RNA synth
etase
alpha
subuni
t
1 1 1 1 1 1 1 1 1 1 1 1 1 1Phen
ylalan
yl-tRNA sy
ntheta
se bet
a subu
nit
1 1 1 1 1 1 1 1 1 1 1 1 1
Prolyl-tR
NA synth
etase
1 1 1 1 1 1 1 1 1 1 1 1 1
Seryl-tR
NA synth
etase
1 1 1 1 1 1 1 1 1
Threony
l-tRNA sy
ntheta
se
1 1 1 1 1 1 1
Tryptop
hanyl-t
RNA synth
etase
1 1 1 1 1 1 1 1 1 1 1
Tyrosy
l-tRNA sy
ntheta
se
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Valyl-tR
NA synth
etase
1 1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in L1
1 1 1 1 1 1 1 1 1 1 2 1
Ribosom
al prote
in L2
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in L3
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in L4
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in L5
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in L6P/L9
E1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in L11
1 1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in L13
1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in L14
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in L15
1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1
Ribosom
al prote
in L16/
L10E
1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in L18
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in L22
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in L24
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in S2
1 1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in S3
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in S4
1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in S5
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in S7
1 1 1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in S8
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in S9
1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in S10
1 1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in S11
1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in S12
1 1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in S13
1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in S15P
/S13e
1 1 2 1 1 1 1 1 1
Ribosom
al prote
in S17
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Ribosom
al prote
in S19
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Orga
nism
AR10
AR20
AR
15
AR5
AR19
AR1
AR18
AR4
AR9
AR6
AR17
AR3
AR16
AR11
AR21
AR13
Diap
hero
trite
s
Aeni
gmar
chae
ota
Pace
arch
aeot
a
Woe
sear
chae
ota
Figure S4 related to Figure 4. Comparative metabolic analyses of the 16 DPANN archaeal genome bins generated by the interface ggkbase (http://ggkbase.berkeley.edu/). Genes identified belong to central carbon metabolism, proteins putatively involved in electron transfer, oxidative stress, stringent response (RelA/SpoT,
Protease
s and
peptida
ses
12
4
12
7
4
8
8
12
16
12
13
20
6
9
12
3
Amylolyti
c_Degr
adation
_Amyla
se
1
1
1
Glycosy
l_hydr
olase
1
3
1
1
3
3
2
1
Glycosy
l_Tran
sferas
e
32
9
16
11
7
7
9
27
29
17
14
22
5
9
21
2
Glycoge
n _(sta
rch) m
etabol
ism
1
2
2
1
6
5
Glycoge
n_starc
h_synt
hase
1
2
Nucleot
ide_sa
lvage_
Pathway_
AMP phosp
horyla
se
1
2
1
1
1
1
1
1
1
1
1
Nucleot
ide_sa
lvage_
Pathway_
Ribose-
1,5-bis
P_isom
erase
1
1
1
1
1
1
Nucleot
ide_sa
lvage_
pathw
ay_RuB
isCO
1
1
1
1
Glycoly
sis_gl
ucokin
ase
1
1
1
2
1
Glycoly
sis_ph
osphog
lucom
utase
1
1
1
2
Glycoly
sis_gl
ucose-
6-P/mann
ose-6-
P_isom
erase
1
1
1
1
1
Glycoly
sis_P
hospho
fructok
inase
1
Glycoly
sis_fru
ctose
bispho
sphate
aldola
se
1
1
1
1
Glycoly
sis_tri
ose ph
osphat
e isom
erase
1
1
1
1
Glycoly
sis_gl
ycerald
ehyde-
3-phos
phate_
DH
2
1
2
1
1
1
1
1
1
Glycoly
sis_ph
osphog
lycera
te_kin
ase
1
1
1
2
1
1
Glycoly
sis_ph
osphog
lycero
mutase_
(PGM)
2
1
1
1
1
1
1
2
2
2
2
1
1
1
2
Glycoly
sis_en
olase
1
1
1
1
1
1
1
1
1
1
1
Glycoly
sis_py
ruvate
_kinas
e
1
2
1
1
Pyruvat
e_phos
phate_
dikina
se/PEP_sy
nthase
1
6
1
2
2
3
2
1
6
1
1
Glucone
ogenes
is_fou
r_key_
enzym
es
2
6
1
2
3
2
2
1
7
1
2
1
PP_6-pho
sphogl
uconol
actona
se
1
PP_gluco
se-6-P
-1-deh
ydroge
nase
1
1
PP_6-pho
sphogl
uconat
e_dehy
drogen
ase
1
1
PP_ribose
-5-P_is
omera
se
2
1
1
1
1
1
1
PP_ribulo
se-P_3-
epimera
se
1
1
1
1
1
1
1
1
PP_transa
ldolas
e
1
1
1
1
1
1
PP_transk
etolas
e
2
2
1
1
2
4
PP_ribose
_phosp
hate_p
yropho
sphoki
nase
1
1
1
1
1
2
1
1
1
2
PP_Ribo
kinase
2
1
1
2
1
1
2
1
Pyruvat
e_PFOR_al
pha
1
1
Pyruvat
e_PFOR_be
ta
1
1
Pyruvat
e_Malic
_Enzy
me
1
1
Pyruvat
e_dehy
drogen
ase
3
2
3
3
3
Phospho
enolpy
ruvate
_carbo
xylase
1
1
1
2
TCA_Cycl
e
2
2
2
1
1
Acetate
Product
ion/Utiliz
ation (A
CK-PTA)
1
1
2
2
Acetate
Product
ion/Utiliz
ation *
alterna
te*
1
1
2
2
1
Aldehyd
e_DH_E
thanol
_Prod
uction
1
1
Alcohol
_DH_E
thanol
_Prod
uction
1
1
Lactate
_Dehy
drogen
ase
1
1
1
1
1
NiFe_Hydr
ogenas
e
2
2
2
Redox_F
erredo
xin
3
3
1
1
1
4
2
3
1
1
2
1
Blue_co
pper_p
rotein
1
1
1
1
Cytochr
ome_o
xidase
_subun
it II
1
1
1
1
1
2
1
1
1
1
ATP synth
ase
9
9
9
6
8
6
Pyropho
sphata
se_pum
ping
1
1
1
1
Pyropho
sphata
se_non
-pumpin
g
1
1
Sulfate_
Metabol
ism
2
1
1
ggKbas
e_oxid
ative_s
tress
3
1
4
1
1
1
1
1
2
2
2
4
2
2
RelA/SpoT
Homolo
g (RSH)
1
1
1
1
1
2
1
Lytic m
urein t
ransgl
ycosyl
ases
1
1
Permeas
e_prote
in
12
1
4
3
4
5
2
8
8
5
2
13
3
4
6
5
heavy
metal tra
nsporte
rs
4
1
3
1
1
1
4
1
1
1
1
ABC transp
orter re
lated p
rotein
14
1
2
2
3
2
1
9
10
1
4
21
3
3
8
2
Transp
ort_Sec_
system
6
1
6
3
4
6
6
6
6
6
4
4
3
3
5
3
Transp
ort_sod
ium-hy
drogen
_antipo
rter
1
2
1
1
1
4
2
Transp
orters
31
5
5
8
4
8
5
24
36
4
5
34
3
13
18
5
Lectins
/pectin
s/conc
avalins
/glucan
ases
3
1
2
2
4
1
1
5
2
3
2
Cell_sur
face_P
KD-domain
3
1
1
2
1
1
2
Cell_sur
face_p
rotein
10
3
3
3
1
2
2
3
7
2
6
2
3
2
Cell wall
hydrola
se/aut
olysin
1
Flagella
r_prote
in_FlaI
5
2
1
1
1
1
1
1
1
2
1
1
Flagella
r_prote
in_FlaJ
6
1
4
2
1
2
2
1
1
3
3
1
Preflage
llin_pe
ptidase
_FlaK
1
1
1
1
1
1
Flagelli
n-like
protein
3
1
Sugar_R
hamnos
e_rmlA
4
1
1
2
1
1
2
1
1
1
Sugar_R
hamnos
e_rmlB
1
1
1
1
2
1
1
2
Sugar_R
hamnos
e_rmlC
1
1
1
Sugar_R
hamnos
e_rmlD
2
1
1
2
Pyrimidin
e meta
bolism
17
5
2
2
4
7
3
15
7
2
16
3
9
14
8
Purine m
etabol
ism
15
12
2
2
3
4
2
2
8
3
14
13
1
Asp_Glu_
Gln Meta
bolism
6
1
1
1
2
1
4
1
4
3
1
3
5
1
Organism
AR10
AR20 AR15
AR5 AR19AR1
AR18
AR4AR9
AR6
AR17
AR3AR16
AR11
AR21
AR13
Diapherotrites
Aenigmarchaeota
Pacearchaeota
Woesearchaeota
Organism
AR10
AR20 AR15
AR5 AR19AR1
AR18
AR4AR9
AR6
AR17
AR3AR16
AR11
AR21
AR13
Diapherotrites
Aenigmarchaeota
Pacearchaeota
Woesearchaeota
Carbon Degradation
Glycogen (starch)
Metabolism Nucleotid
e Salvage
pathway
Glycolysis
Gluconeogenesis
Pentose Phosphate
p
athway
Pyruvate
metabolism
TCA Cycle
Ferm
entation
Hydrogenases
ETC
Oxidativ
e stress
Transp
orters
Cell surfa
ce
Flagella
Purin
e
Pyrimidine m
etabolism
described in the main text), proteins involved in cell wall and cell surface biosynthesis as well as transporters and flagella components. Also represented are the proteins involved in purine and pyrimidine metabolisms as well as amino acids metabolism (including aspartate, glutamate and glutamine). Numbers in circles correspond to the number of genes per genome matching the annotation. The lists are not mutually exclusive as a given protein can have more than one function or domain and was counted in each appropriate category. This figure and the FASTA files that support it can be accessed here: http://ggkbase.berkeley.edu/genome_summaries/104-genome_metabolism_ARCH2011_2014.
Supplemental Table
Table S1 related to Table 1. Samples, sequencing and assembly information.
Sample name
Date collected
Days since
acetate
Filter size
(µm) Groundwater
filtered (L)
# high-quality paired DNA reads
(x10^6)
high-quality DNA
sequence (Gbp)
DNA assembly
length (Mbp)
DNA assembly
length (scaffolds ≥5Kbp) (Mbp)
# high-quality RNA reads
(x10^6)
high-quality RNA
sequence (Gbp)
GWA1 8/25/11 0 0.1 142 65 8.95 618 207 n/a n/a GWA2 8/25/11 0 0.2 142 258 33.64 2,045 700 69 28.21 GWB1 9/3/11 9 0.1 100 116 15.63 853 286 n/a n/a GWB2 9/3/11 9 0.2 100 86 10.71 580 197 77 31.43 GWC1 9/16/11 22 0.1 55 102 13.82 611 193 n/a n/a GWC2 9/16/11 22 0.2 55 4 0.30 1,573 487 73 30.00 GWD1 9/30/11 36 0.1 100 271 36.51 305 72 n/a n/a GWD2 9/30/11 36 0.2 100 85 11.70 426 131 78 31.05 GWE1 10/31/11 67 0.1 45 73 9.88 277 96 n/a n/a GWE2 10/31/11 67 0.2 45 78 10.78 460 149 77 32.31 GWF1 11/28/11 95 0.1 100 63 8.71 370 128 n/a n/a GWF2 11/28/11 95 0.2 100 214 29.02 1,175 436 78 32.64
total - 0.1 µm filter n/a n/a 0.1 542 691 93.49 3,035 981 n/a n/a total - 0.2 µm filter n/a n/a 0.2 542 725 96.15 6,260 2,099 452 185.65
total - both filters n/a n/a n/a 542 1,416 190 9,295 3,081 452 185.65
Table S1. Samples, sequencing and assembly information.
Supplemental Experimental Procedures
The sampling, sequencing, assembly, and annotation of the sediment-associated microbial communities discussed here were described previously [S1, S2]. Briefly, DNA samples from sediment core (well D04 [S1]) at different depths (4, 5 and 6 m) were extracted and sequenced using the Illumina HiSeq paired-end technology. Assembly and annotation are similar to the methods described below.
Groundwater field experiment and sample collection. The field experiment was carried out between August 25 and December 12, 2011 at the Rifle Integrated Field Research Challenge (IFRC) site adjacent to the Colorado River, Colorado, USA. For biostimulation experiments at the site, acetate was added to groundwater to provide approximately 15 mM acetate to the groundwater over the course of 72 days as previously described [S3]. Microbial community samples (A, B, C, D, E, and F) were collected such that a range of geochemical conditions from iron reduction to sulfate reduction was sampled at 0, 7, 20, 34, 65, and 93 days after the start of acetate addition to the aquifer between August and November of 2011 (Figure 1). Acetate addition was ceased after 72 days. Samples for measurement of aqueous geochemistry were taken 5 meters below ground surface. Ferrous iron and sulfide concentrations (Figure 1) were analyzed immediately after sampling using the HACH phenanthroline assay and a sulfide reagent kit, respectively (HACH, CO). Acetate and sulfate concentrations were determined using a Dionex ICS-2100 ion chromatograph equipped with an AS-18 guard and analytical column; details included in Williams et al., 2011, [S3]). Microbial cells from pumped groundwater that passed through a 1.2 µm pre-filter (Pall, NY), were retained on 0.1 and 0.2-µm filters, and were flash-frozen in liquid nitrogen immediately upon collection for DNA extractions.
DNA extraction and sequencing. For genomic DNA extraction, approximately 1.5 g of each of the A, B, C, D, E, and F frozen filter samples were collected and DNA was extracted using a modified PowerSoil DNA Isolation Kit (MO-BIO) as follows. Approximately 1.5 grams worth of filter was cut into strips and vortexed with provided beads and 5 mL of PowerBead Solution. Tubes were then flash frozen and thawed with intermittent vortexing. After thawing, the remaining PowerBead Solution was added and the manufacturer’s protocol was followed, with the addition of a 30 minutes incubation at 65°C with intermittent shaking. The resulting DNA elution was concentrated by sodium acetate/ethanol precipitation with glycogen. DNA was resuspended in 50 µl elution buffer from the PowerSoil Kit prior to being submitted for library preparation. Illumina HiSeq 2000 2X150 paired-end sequencing was conducted by the Joint Genome Institute. Sequencing amounts for samples A to F from 0.1 and 0.2 µm filters ranged from 9.5 to 40.7 Gbp (Table S1).
Assembly and functional annotations. Reads were quality trimmed using Sickle (https://github.com/najoshi/sickle) with default settings. Trimmed paired-end reads were assembled using IDBA_UD with default parameters [S4]. Genes on scaffolds >5 kb in length were predicted using Prodigal with the metagenome option [S5]. For each scaffold, we determined the GC content, coverage, genetic code, and profile of phylogenetic affiliation based on the best hit for each gene against the Uniref90 database [S6]. Predicted ORFs were run through a multidatabase search pipeline for functional prediction as previously described [S7]. In addition, UniRef90 and KEGG were searched back against the amino acid sequences to identify reciprocal best-blast matches. Reciprocal best blast matches were filtered at a 300 bit score threshold. One-way blast matches were filtered at a 60 bit score threshold. Transfer RNA sequences were predicted using tRNAscan-SE [S8].
Genome binning and reconstruction. We identified segregated clusters of scaffolds corresponding to individual genomes based on scaffold tetranucleotide sequence composition. Clusters were identified using the Databionics implementation of emergent self-organizing map (ESOM) analysis [S9]. The primary map structure was established using 5-kb fragments (all fragments >10kb were subdivided into 5-kb segments). Sequencing reads were mapped to scaffolds from genome bins of interest for multiple rounds of assembly curation. Mapping was carried out using Bowtie2 [S10]. Paired read mappings were visualized using Geneious 7.0.6 (Biomatters). Assembly errors that are likely introduced in the scaffolding step of the IDBA_UD assembly were
identified based on local sequence inconsistencies. These were corrected by visual inspection or by alteration of the consensus sequence following insertion of Ns and remapping of unplaced read pairs. A similar approach was used to fill scaffolding gaps and extend scaffold ends.
Genome bin completion estimates and phylogenetic assignment. Our primary method for assessing genome completeness was based on the presence or absence of orthologous genes representing a core gene set that are widely conserved as single-copy genes among archaea [S11]. For ESOM bins containing scaffolds from more than one genome, scaffolds were assigned to specific organisms within the bin by coverage, phylogenetic identity, and GC content.
Concatenated ribosomal protein phylogeny. A set of 15 syntenic ribosomal proteins was selected as a stable genome marker region based on published lateral gene transfer frequencies of zero (rpL2, 3, 4, 5, 6, 14, 15, 18, 22, 24 and rpS3, 8, 10, 17, 19) [S12, S13]. Scaffolds from the groundwater and sediment metagenomes containing at minimum 50% of the 15 archaeal genes were identified. An existing reference set of ribosomal proteins was augmented through mining the NCBI and Joint Genome Institute IMG databases for recently sequenced genomes from the Archaeal domain, as well as through identification and addition of the closest sequenced genome to each ribosomal protein sequence from the metagenomes in this study [S14]. The complete data set contained 312 taxa. Each individual ribosomal protein data set was aligned using Muscle version 3.8.31 [S15, S16] and then manually curated to remove end gaps and single taxon insertions. A maximum likelihood tree was constructed with RAxML_HPC under the PROTGAMMALG evolutionary model with 100 bootstraps [S17].
16S rRNA gene phylogenetic analyses. Phylogenetic placement of archaeal genome bins was further confirmed using full-length or near full-length 16S rRNA gene sequences (>600 bp) derived from groundwater dataset gwa2 (Sample A; filter 0.2 um). 16S rRNA genes were identified through BLASTn of the metagenome scaffolds against a 16S rRNA reference database. Fifty-nine sequences longer than 600 bp were identified (Figure S1). 16S rRNA genes were excised from the scaffolds and aligned to the SILVA database using the SINA alignment (http://www.arb-silva.de/aligner) [S18], simultaneously identifying all SILVA sequences with >70% identity to the archaeal 16S rRNA genes from this study. The SILVA alignment of the metagenome-derived and related 16S rRNA sequences was streamlined by removing conserved gaps (>99% gaps) with GapStreeze v.2.1.0 (www.hiv.lanl .gov/content/sequence/GAPSTREEZE/gap.html). The alignment was further trimmed to remove ambiguously aligned regions. A maximum-likelihood tree was constructed with RAxML-HPC under the GTRCAT model with 100 bootstraps.
Protein phylogenetic analyses. For specific functional genes of interest, reference datasets were generated from sequences mined from NCBI databases. Alignments were generated using MUSCLE v. 3.8.31 [S15, S16], curated manually, and phylogenies conducted using PhyML [S19] under the LG+Gamma model of evolution with 100 bootstrap resamplings.
RNA Extraction. RNA was extracted using Invitrogen TRIzol® Reagent (cat#15596018), followed by genomic DNA removal and cleaning using Qiagen RNase-Free DNase Set kit (cat#79254) and Qiagen Mini RNeasy™ kit (cat#74104). Agilent 2100 Bioanalyzer was used to assess the integrity of the RNA samples. Only RNA samples having RNA Integrity Number between 8-10 were used. RNA Sequencing. The Applied Biosystems SOLiDTM Total RNA-Seq kit (catalog number 4445374) was used to generate the cDNA template library. The SOLiDTM EZ Bead system was used to perform emulsion clonal bead amplification to generate bead templates for SOLiDTM platform sequencing. Samples were sequenced on the 5500XL SOLiDTM platform. The 75 bp sequences produced by the 5500XL SOLiDTM sequencer were mapped in color space using SOLiDTM LifeScopeTM software version 2.5 using the default parameters against the reference genome set of 2,302,715 separate gene FASTA entries. Corresponding GTF files were built to record the position of each gene on the set of artificial chromosomes.
Open-access database for genome analyses. A summary of ESOM bin size, GC content, and phylogenetic identity is located at http://ggkbase.berkeley.edu/GW2011_AR/organisms. All genomic information is publicly accessible via the website. We used the lists and genome summary functions to assess genome completeness and profile metabolic traits. ggKbase is designed around live data, whereby projects are continuously updated and improved (updates may include bin content and improvements to functional predictions; the project name, organism names and gene names remain consistent). All fasta files for our figures are hosted on this site. Raw data are available at the JGI website: http://genome.jgi.doe.gov/pages/dynamicOrganismDownload.jsf?organism=Terseqemediation_2.
Supplemental References
S1. Castelle, C. J., Hug, L. A., Wrighton, K. C., Thomas, B. C., Williams, K. H., Wu, D., Tringe, S. G., Singer, S. W., Eisen, J. A., and Banfield, J. F. (2013). Extraordinary phylogenetic diversity and metabolic versatility in aquifer sediment. Nat. Commun. 4, 2120.
S2. Hug, L. A., Castelle, C. J., Wrighton, K. C., Thomas, B. C., Sharon, I., Frischkorn, K. R., Williams, K. H., Tringe, S. G., and Banfield, J. F. (2013). Community genomic analyses constrain the distribution of metabolic traits across the Chloroflexi phylum and indicate roles in sediment carbon cycling. Microbiome 1, 22.
S3. Williams, K. H., Long, P. E., Davis, J. a., Wilkins, M. J., N’Guessan, A. L., Steefel, C. I., Yang, L., Newcomer, D., Spane, F. a., Kerkhof, L. J., et al. (2011). Acetate Availability and its Influence on Sustainable Bioremediation of Uranium-Contaminated Groundwater. Geomicrobiol. J. 28, 519-539.
S4. Peng, Y., Leung, H. C. M., Yiu, S. M., and Chin, F. Y. L. (2012). IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28, 1420-8.
S5. Hyatt, D., LoCascio, P. F., Hauser, L. J., and Uberbacher, E. C. (2012). Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics 28, 2223-30.
S6. Suzek, B. E., Huang, H., McGarvey, P., Mazumder, R., and Wu, C. H. (2007). UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 1282-8.
S7. Wrighton, K. C., Thomas, B. C., Sharon, I., Miller, C. S., Castelle, C. J., VerBerkmoes, N. C., Wilkins, M. J., Hettich, R. L., Lipton, M. S., Williams, K. H., et al. (2012). Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla. Science 337, 1661-1665.
S8. Lowe, T. M., and Eddy, S. R. (1997). tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955-64.
S9. Dick, G. J., Andersson, A. F., Baker, B. J., Simmons, S. L., Thomas, B. C., Yelton, A. P., and Banfield, J. F. (2009). Community-wide analysis of microbial genome sequence signatures. Genome Biol. 10, R85.
S10. Langmead, B., and Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357-9.
S11. Puigbò, P., Wolf, Y. I., and Koonin, E. V (2009). Search for a “Tree of Life” in the thicket of the phylogenetic forest. J. Biol. 8, 59.
S12. Sorek, R., Zhu, Y., Creevey, C. J., Francino, M. P., Bork, P., and Rubin, E. M. (2007). Genome-wide experimental determination of barriers to horizontal gene transfer. Science 318, 1449-52.
S13. Wu, D., Hartman, A., Ward, N., and Eisen, J. A. (2008). An automated phylogenetic tree-based small subunit rRNA taxonomy and alignment pipeline (STAP). PLoS One 3, e2566.
S14. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403-10.
S15. Edgar, R. C. (2004). MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5, 113.
S16. Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792-7.
S17. Stamatakis, A. (2006). RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688-90.
S18. Pruesse, E., Quast, C., Knittel, K., Fuchs, B. M., Ludwig, W., Peplies, J., and Glöckner, F. O. (2007). SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 35, 7188-96.
S19. Guindon, S., and Gascuel, O. (2003). A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52, 696-704.