Genomic Expansion of Domain Archaea Highlights Roles for

Current Biology

Supplemental Information

Genomic Expansion of Domain Archaea

Highlights Roles for Organisms from

New Phyla in Anaerobic Carbon Cycling

Cindy J. Castelle, Kelly C. Wrighton, Brian C. Thomas, Laura A. Hug,

Christopher T. Brown, Michael J. Wilkins, Kyle R. Frischkorn, Susannah G. Tringe,

Andrea Singh, Lye Meng Markillie, Ronald C. Taylor, Kenneth H. Williams,

and Jillian F. Banfield

Figure S1, related to Figure 2. Maximum likelihood phylogeny of the 16S rRNA gene placing the Woesearchaeota and Pacearchaeota as novel phylum-level lineages. The final alignment contained 449 sequences, 60 from this study, and was generated using the SILVA SINA alignment tool and the SILVA reference alignment. The tree was constructed using RAxML-HPC under the GTRCAT model of evolution with 100 bootstrap resamplings. Bootstrap values greater than 50 are noted on the tree.

To bacteria

Parvarchaeota

Micrarchaeota

SM1K20SAGMEGMSBL1

Aenigmarchaeota

pBRKC134NanohaloarchaeotaHalobacteriaNanoarchaeota

DPANN

TACK

EuryarchaeotaAigarc

haeo

taKo

rarch

aeota

Gp TMCG

Gp 1

MHVG1

Marine

Benth

ic Gp

BGp

2 MH

VG 2

Ancie

nt Ar

chae

al Gp

AAG

Methanosarcinales

AR5

AR4AR16

AR18AR20

AR17

AR6

AR13

AR10

Groundwater-associated Archaea4m Sediment-associated Archaea

Archaeal genome bins

5m Sediment-associated Archaea6m Sediment-associated Archaea

Thaumarchaeota AK31

uncultured crenarchaeote

AB01

9717

1 14

29 A

rchae

a unid

entifi

ed ar

chae

on

AB189390 1 904 Group MHVG

Crenarchaeota

Miscellaneous Crenarchaeotic Gr

Group C3

Marine Benthic Gr A

Group SCG

Diaphe

rotrite

s

Thau

march

aeota

ma

rine G

roup

I

Pacearchaeota Woesearchaeota

AR15

AR11

Figure S2 related to Table 1. ESOM constructed using tetranucleotide frequency information for genome fragments with archaeal signatures > 5 kb in length from the groundwater metagenome assemblies. A, Boundaries (dark bands) separate clusters of fragments with similar signatures (each dot represents a 5 kb fragment). B, Map was colored based on GC content, coverage and scaffold connections to highlight the different genome bins, annotated from AR1 to AR21.

AR1

AR3AR4

AR5

AR6

AR9

AR10

AR11

AR13AR15

AR16AR17

AR18

AR19

AR20

AR21

A

B

Figure S3 related to Table 1. Genome completeness estimated for all recovered genome bins based on 54 conserved single-copy genes (SCG). Numbers in circles correspond to the number of genes per genome matching the annotation. This figure and the FASTA files that support it can be accessed here: http://ggkbase.berkeley.edu/genome_summaries/107-genome_completeness_2.

Genome_c

omple

teness

based

on 54

SCG

54 27 50 49 48 42 40 54 53 47 42 41 41 41 34 33

DNA repair

and rec

ombin

ation p

rotein R

adA

1 1 1 1 1 1 1 1 1 1 1 1 1 1

Chapero

nin GroE

L

1 1 1 1 1 1 1 1 1 1 1 1

DNA-direct

ed RNA po

lymera

se sub

unit B

1 1 1 1 1 1 1 1 1 1 1 1

Methion

ine am

inopep

tidase

1 1 1 1 1 1 1 1 1 1 1

Transcr

iptiona

l antite

rminator

1 1 1 1 1 1 1 1 1 1 1

Preprote

in tran

slocas

e subu

nit SecY

1 1 1 1 1 1 1 1 1 1 1 1 1 1

Ribonuc

lease

HII

1 1 1 1 1 1 1 1 1 1

Alanyl-t

RNA synth

etase

1 1 1 1 1 1 1 1 2 1 1 1

Arginyl-t

RNA synth

etase

1 1 1 1 1 1 1 1 1 1 1 1 1

Aspartyl

-tRNA sy

ntheta

se

1 1 1 1 1 1 1 1 2 1 1 1 1 1

Cystein

yl-tRNA sy

ntheta

se

1 1 1 1 1 1 1 1 1 1 1

Glutaminy

l-tRNA sy

ntheta

se

1 1 1 1 1 1 1 1 1 1 1

Glycyl-tR

NA synth

etase

1 1 1 1 1 1 1 1 1 1 1 1

Histidyl-t

RNA synth

etase

1 1 1 1 1 1 1 1 1 1

Isoleu

cyl--tR

NA synth

etase

1 1 1 1 1 1 1 1 1 1 1 1 1

Leucyl-

tRNA synth

etase

1 1 1 1 1 1 1 1 1 1 1 1

Lysyl-t

RNA synth

etase

1 1 1 1 1 1 1 1 1 1 1 1 1 1

Methion

yl-tRNA sy

ntheta

se

1 1 1 1 1 1 1 1 1 1 1

Phenyla

lanyl-t

RNA synth

etase

alpha

subuni

t

1 1 1 1 1 1 1 1 1 1 1 1 1 1Phen

ylalan

yl-tRNA sy

ntheta

se bet

a subu

nit

1 1 1 1 1 1 1 1 1 1 1 1 1

Prolyl-tR

NA synth

etase

1 1 1 1 1 1 1 1 1 1 1 1 1

Seryl-tR

NA synth

etase

1 1 1 1 1 1 1 1 1

Threony

l-tRNA sy

ntheta

se

1 1 1 1 1 1 1

Tryptop

hanyl-t

RNA synth

etase

1 1 1 1 1 1 1 1 1 1 1

Tyrosy

l-tRNA sy

ntheta

se

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Valyl-tR

NA synth

etase

1 1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in L1

1 1 1 1 1 1 1 1 1 1 2 1

Ribosom

al prote

in L2

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in L3

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in L4

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in L5

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in L6P/L9

E1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in L11

1 1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in L13

1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in L14

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in L15

1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1

Ribosom

al prote

in L16/

L10E

1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in L18

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in L22

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in L24

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in S2

1 1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in S3

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in S4

1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in S5

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in S7

1 1 1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in S8

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in S9

1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in S10

1 1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in S11

1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in S12

1 1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in S13

1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in S15P

/S13e

1 1 2 1 1 1 1 1 1

Ribosom

al prote

in S17

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Ribosom

al prote

in S19

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Orga

nism

AR10

AR20

AR

15

AR5

AR19

AR1

AR18

AR4

AR9

AR6

AR17

AR3

AR16

AR11

AR21

AR13

Diap

hero

trite

s

Aeni

gmar

chae

ota

Pace

arch

aeot

a

Woe

sear

chae

ota

Figure S4 related to Figure 4. Comparative metabolic analyses of the 16 DPANN archaeal genome bins generated by the interface ggkbase (http://ggkbase.berkeley.edu/). Genes identified belong to central carbon metabolism, proteins putatively involved in electron transfer, oxidative stress, stringent response (RelA/SpoT,

Protease

s and

peptida

ses

12

4

12

7

4

8

8

12

16

12

13

20

6

9

12

3

Amylolyti

c_Degr

adation

_Amyla

se

1

1

1

Glycosy

l_hydr

olase

1

3

1

1

3

3

2

1

Glycosy

l_Tran

sferas

e

32

9

16

11

7

7

9

27

29

17

14

22

5

9

21

2

Glycoge

n _(sta

rch) m

etabol

ism

1

2

2

1

6

5

Glycoge

n_starc

h_synt

hase

1

2

Nucleot

ide_sa

lvage_

Pathway_

AMP phosp

horyla

se

1

2

1

1

1

1

1

1

1

1

1

Nucleot

ide_sa

lvage_

Pathway_

Ribose-

1,5-bis

P_isom

erase

1

1

1

1

1

1

Nucleot

ide_sa

lvage_

pathw

ay_RuB

isCO

1

1

1

1

Glycoly

sis_gl

ucokin

ase

1

1

1

2

1

Glycoly

sis_ph

osphog

lucom

utase

1

1

1

2

Glycoly

sis_gl

ucose-

6-P/mann

ose-6-

P_isom

erase

1

1

1

1

1

Glycoly

sis_P

hospho

fructok

inase

1

Glycoly

sis_fru

ctose

bispho

sphate

aldola

se

1

1

1

1

Glycoly

sis_tri

ose ph

osphat

e isom

erase

1

1

1

1

Glycoly

sis_gl

ycerald

ehyde-

3-phos

phate_

DH

2

1

2

1

1

1

1

1

1

Glycoly

sis_ph

osphog

lycera

te_kin

ase

1

1

1

2

1

1

Glycoly

sis_ph

osphog

lycero

mutase_

(PGM)

2

1

1

1

1

1

1

2

2

2

2

1

1

1

2

Glycoly

sis_en

olase

1

1

1

1

1

1

1

1

1

1

1

Glycoly

sis_py

ruvate

_kinas

e

1

2

1

1

Pyruvat

e_phos

phate_

dikina

se/PEP_sy

nthase

1

6

1

2

2

3

2

1

6

1

1

Glucone

ogenes

is_fou

r_key_

enzym

es

2

6

1

2

3

2

2

1

7

1

2

1

PP_6-pho

sphogl

uconol

actona

se

1

PP_gluco

se-6-P

-1-deh

ydroge

nase

1

1

PP_6-pho

sphogl

uconat

e_dehy

drogen

ase

1

1

PP_ribose

-5-P_is

omera

se

2

1

1

1

1

1

1

PP_ribulo

se-P_3-

epimera

se

1

1

1

1

1

1

1

1

PP_transa

ldolas

e

1

1

1

1

1

1

PP_transk

etolas

e

2

2

1

1

2

4

PP_ribose

_phosp

hate_p

yropho

sphoki

nase

1

1

1

1

1

2

1

1

1

2

PP_Ribo

kinase

2

1

1

2

1

1

2

1

Pyruvat

e_PFOR_al

pha

1

1

Pyruvat

e_PFOR_be

ta

1

1

Pyruvat

e_Malic

_Enzy

me

1

1

Pyruvat

e_dehy

drogen

ase

3

2

3

3

3

Phospho

enolpy

ruvate

_carbo

xylase

1

1

1

2

TCA_Cycl

e

2

2

2

1

1

Acetate

Product

ion/Utiliz

ation (A

CK-PTA)

1

1

2

2

Acetate

Product

ion/Utiliz

ation *

alterna

te*

1

1

2

2

1

Aldehyd

e_DH_E

thanol

_Prod

uction

1

1

Alcohol

_DH_E

thanol

_Prod

uction

1

1

Lactate

_Dehy

drogen

ase

1

1

1

1

1

NiFe_Hydr

ogenas

e

2

2

2

Redox_F

erredo

xin

3

3

1

1

1

4

2

3

1

1

2

1

Blue_co

pper_p

rotein

1

1

1

1

Cytochr

ome_o

xidase

_subun

it II

1

1

1

1

1

2

1

1

1

1

ATP synth

ase

9

9

9

6

8

6

Pyropho

sphata

se_pum

ping

1

1

1

1

Pyropho

sphata

se_non

-pumpin

g

1

1

Sulfate_

Metabol

ism

2

1

1

ggKbas

e_oxid

ative_s

tress

3

1

4

1

1

1

1

1

2

2

2

4

2

2

RelA/SpoT

Homolo

g (RSH)

1

1

1

1

1

2

1

Lytic m

urein t

ransgl

ycosyl

ases

1

1

Permeas

e_prote

in

12

1

4

3

4

5

2

8

8

5

2

13

3

4

6

5

heavy

metal tra

nsporte

rs

4

1

3

1

1

1

4

1

1

1

1

ABC transp

orter re

lated p

rotein

14

1

2

2

3

2

1

9

10

1

4

21

3

3

8

2

Transp

ort_Sec_

system

6

1

6

3

4

6

6

6

6

6

4

4

3

3

5

3

Transp

ort_sod

ium-hy

drogen

_antipo

rter

1

2

1

1

1

4

2

Transp

orters

31

5

5

8

4

8

5

24

36

4

5

34

3

13

18

5

Lectins

/pectin

s/conc

avalins

/glucan

ases

3

1

2

2

4

1

1

5

2

3

2

Cell_sur

face_P

KD-domain

3

1

1

2

1

1

2

Cell_sur

face_p

rotein

10

3

3

3

1

2

2

3

7

2

6

2

3

2

Cell wall

hydrola

se/aut

olysin

1

Flagella

r_prote

in_FlaI

5

2

1

1

1

1

1

1

1

2

1

1

Flagella

r_prote

in_FlaJ

6

1

4

2

1

2

2

1

1

3

3

1

Preflage

llin_pe

ptidase

_FlaK

1

1

1

1

1

1

Flagelli

n-like

protein

3

1

Sugar_R

hamnos

e_rmlA

4

1

1

2

1

1

2

1

1

1

Sugar_R

hamnos

e_rmlB

1

1

1

1

2

1

1

2

Sugar_R

hamnos

e_rmlC

1

1

1

Sugar_R

hamnos

e_rmlD

2

1

1

2

Pyrimidin

e meta

bolism

17

5

2

2

4

7

3

15

7

2

16

3

9

14

8

Purine m

etabol

ism

15

12

2

2

3

4

2

2

8

3

14

13

1

Asp_Glu_

Gln Meta

bolism

6

1

1

1

2

1

4

1

4

3

1

3

5

1

Organism

AR10

AR20 AR15

AR5 AR19AR1

AR18

AR4AR9

AR6

AR17

AR3AR16

AR11

AR21

AR13

Diapherotrites

Aenigmarchaeota

Pacearchaeota

Woesearchaeota

Organism

AR10

AR20 AR15

AR5 AR19AR1

AR18

AR4AR9

AR6

AR17

AR3AR16

AR11

AR21

AR13

Diapherotrites

Aenigmarchaeota

Pacearchaeota

Woesearchaeota

Carbon Degradation

Glycogen (starch)

Metabolism Nucleotid

e Salvage

pathway

Glycolysis

Gluconeogenesis

Pentose Phosphate

p

athway

Pyruvate

metabolism

TCA Cycle

Ferm

entation

Hydrogenases

ETC

Oxidativ

e stress

Transp

orters

Cell surfa

ce

Flagella

Purin

e

Pyrimidine m

etabolism

described in the main text), proteins involved in cell wall and cell surface biosynthesis as well as transporters and flagella components. Also represented are the proteins involved in purine and pyrimidine metabolisms as well as amino acids metabolism (including aspartate, glutamate and glutamine). Numbers in circles correspond to the number of genes per genome matching the annotation. The lists are not mutually exclusive as a given protein can have more than one function or domain and was counted in each appropriate category. This figure and the FASTA files that support it can be accessed here: http://ggkbase.berkeley.edu/genome_summaries/104-genome_metabolism_ARCH2011_2014.

Supplemental Table

Table S1 related to Table 1. Samples, sequencing and assembly information.

Sample name

Date collected

Days since

acetate

Filter size

(µm) Groundwater

filtered (L)

# high-quality paired DNA reads

(x10^6)

high-quality DNA

sequence (Gbp)

DNA assembly

length (Mbp)

DNA assembly

length (scaffolds ≥5Kbp) (Mbp)

# high-quality RNA reads

(x10^6)

high-quality RNA

sequence (Gbp)

GWA1 8/25/11 0 0.1 142 65 8.95 618 207 n/a n/a GWA2 8/25/11 0 0.2 142 258 33.64 2,045 700 69 28.21 GWB1 9/3/11 9 0.1 100 116 15.63 853 286 n/a n/a GWB2 9/3/11 9 0.2 100 86 10.71 580 197 77 31.43 GWC1 9/16/11 22 0.1 55 102 13.82 611 193 n/a n/a GWC2 9/16/11 22 0.2 55 4 0.30 1,573 487 73 30.00 GWD1 9/30/11 36 0.1 100 271 36.51 305 72 n/a n/a GWD2 9/30/11 36 0.2 100 85 11.70 426 131 78 31.05 GWE1 10/31/11 67 0.1 45 73 9.88 277 96 n/a n/a GWE2 10/31/11 67 0.2 45 78 10.78 460 149 77 32.31 GWF1 11/28/11 95 0.1 100 63 8.71 370 128 n/a n/a GWF2 11/28/11 95 0.2 100 214 29.02 1,175 436 78 32.64

total - 0.1 µm filter n/a n/a 0.1 542 691 93.49 3,035 981 n/a n/a total - 0.2 µm filter n/a n/a 0.2 542 725 96.15 6,260 2,099 452 185.65

total - both filters n/a n/a n/a 542 1,416 190 9,295 3,081 452 185.65

Table S1. Samples, sequencing and assembly information.

Supplemental Experimental Procedures

The sampling, sequencing, assembly, and annotation of the sediment-associated microbial communities discussed here were described previously [S1, S2]. Briefly, DNA samples from sediment core (well D04 [S1]) at different depths (4, 5 and 6 m) were extracted and sequenced using the Illumina HiSeq paired-end technology. Assembly and annotation are similar to the methods described below.

Groundwater field experiment and sample collection. The field experiment was carried out between August 25 and December 12, 2011 at the Rifle Integrated Field Research Challenge (IFRC) site adjacent to the Colorado River, Colorado, USA. For biostimulation experiments at the site, acetate was added to groundwater to provide approximately 15 mM acetate to the groundwater over the course of 72 days as previously described [S3]. Microbial community samples (A, B, C, D, E, and F) were collected such that a range of geochemical conditions from iron reduction to sulfate reduction was sampled at 0, 7, 20, 34, 65, and 93 days after the start of acetate addition to the aquifer between August and November of 2011 (Figure 1). Acetate addition was ceased after 72 days. Samples for measurement of aqueous geochemistry were taken 5 meters below ground surface. Ferrous iron and sulfide concentrations (Figure 1) were analyzed immediately after sampling using the HACH phenanthroline assay and a sulfide reagent kit, respectively (HACH, CO). Acetate and sulfate concentrations were determined using a Dionex ICS-2100 ion chromatograph equipped with an AS-18 guard and analytical column; details included in Williams et al., 2011, [S3]). Microbial cells from pumped groundwater that passed through a 1.2 µm pre-filter (Pall, NY), were retained on 0.1 and 0.2-µm filters, and were flash-frozen in liquid nitrogen immediately upon collection for DNA extractions.

DNA extraction and sequencing. For genomic DNA extraction, approximately 1.5 g of each of the A, B, C, D, E, and F frozen filter samples were collected and DNA was extracted using a modified PowerSoil DNA Isolation Kit (MO-BIO) as follows. Approximately 1.5 grams worth of filter was cut into strips and vortexed with provided beads and 5 mL of PowerBead Solution. Tubes were then flash frozen and thawed with intermittent vortexing. After thawing, the remaining PowerBead Solution was added and the manufacturer’s protocol was followed, with the addition of a 30 minutes incubation at 65°C with intermittent shaking. The resulting DNA elution was concentrated by sodium acetate/ethanol precipitation with glycogen. DNA was resuspended in 50 µl elution buffer from the PowerSoil Kit prior to being submitted for library preparation. Illumina HiSeq 2000 2X150 paired-end sequencing was conducted by the Joint Genome Institute. Sequencing amounts for samples A to F from 0.1 and 0.2 µm filters ranged from 9.5 to 40.7 Gbp (Table S1).

Assembly and functional annotations. Reads were quality trimmed using Sickle (https://github.com/najoshi/sickle) with default settings. Trimmed paired-end reads were assembled using IDBA_UD with default parameters [S4]. Genes on scaffolds >5 kb in length were predicted using Prodigal with the metagenome option [S5]. For each scaffold, we determined the GC content, coverage, genetic code, and profile of phylogenetic affiliation based on the best hit for each gene against the Uniref90 database [S6]. Predicted ORFs were run through a multidatabase search pipeline for functional prediction as previously described [S7]. In addition, UniRef90 and KEGG were searched back against the amino acid sequences to identify reciprocal best-blast matches. Reciprocal best blast matches were filtered at a 300 bit score threshold. One-way blast matches were filtered at a 60 bit score threshold. Transfer RNA sequences were predicted using tRNAscan-SE [S8].

Genome binning and reconstruction. We identified segregated clusters of scaffolds corresponding to individual genomes based on scaffold tetranucleotide sequence composition. Clusters were identified using the Databionics implementation of emergent self-organizing map (ESOM) analysis [S9]. The primary map structure was established using 5-kb fragments (all fragments >10kb were subdivided into 5-kb segments). Sequencing reads were mapped to scaffolds from genome bins of interest for multiple rounds of assembly curation. Mapping was carried out using Bowtie2 [S10]. Paired read mappings were visualized using Geneious 7.0.6 (Biomatters). Assembly errors that are likely introduced in the scaffolding step of the IDBA_UD assembly were

identified based on local sequence inconsistencies. These were corrected by visual inspection or by alteration of the consensus sequence following insertion of Ns and remapping of unplaced read pairs. A similar approach was used to fill scaffolding gaps and extend scaffold ends.

Genome bin completion estimates and phylogenetic assignment. Our primary method for assessing genome completeness was based on the presence or absence of orthologous genes representing a core gene set that are widely conserved as single-copy genes among archaea [S11]. For ESOM bins containing scaffolds from more than one genome, scaffolds were assigned to specific organisms within the bin by coverage, phylogenetic identity, and GC content.

Concatenated ribosomal protein phylogeny. A set of 15 syntenic ribosomal proteins was selected as a stable genome marker region based on published lateral gene transfer frequencies of zero (rpL2, 3, 4, 5, 6, 14, 15, 18, 22, 24 and rpS3, 8, 10, 17, 19) [S12, S13]. Scaffolds from the groundwater and sediment metagenomes containing at minimum 50% of the 15 archaeal genes were identified. An existing reference set of ribosomal proteins was augmented through mining the NCBI and Joint Genome Institute IMG databases for recently sequenced genomes from the Archaeal domain, as well as through identification and addition of the closest sequenced genome to each ribosomal protein sequence from the metagenomes in this study [S14]. The complete data set contained 312 taxa. Each individual ribosomal protein data set was aligned using Muscle version 3.8.31 [S15, S16] and then manually curated to remove end gaps and single taxon insertions. A maximum likelihood tree was constructed with RAxML_HPC under the PROTGAMMALG evolutionary model with 100 bootstraps [S17].

16S rRNA gene phylogenetic analyses. Phylogenetic placement of archaeal genome bins was further confirmed using full-length or near full-length 16S rRNA gene sequences (>600 bp) derived from groundwater dataset gwa2 (Sample A; filter 0.2 um). 16S rRNA genes were identified through BLASTn of the metagenome scaffolds against a 16S rRNA reference database. Fifty-nine sequences longer than 600 bp were identified (Figure S1). 16S rRNA genes were excised from the scaffolds and aligned to the SILVA database using the SINA alignment (http://www.arb-silva.de/aligner) [S18], simultaneously identifying all SILVA sequences with >70% identity to the archaeal 16S rRNA genes from this study. The SILVA alignment of the metagenome-derived and related 16S rRNA sequences was streamlined by removing conserved gaps (>99% gaps) with GapStreeze v.2.1.0 (www.hiv.lanl .gov/content/sequence/GAPSTREEZE/gap.html). The alignment was further trimmed to remove ambiguously aligned regions. A maximum-likelihood tree was constructed with RAxML-HPC under the GTRCAT model with 100 bootstraps.

Protein phylogenetic analyses. For specific functional genes of interest, reference datasets were generated from sequences mined from NCBI databases. Alignments were generated using MUSCLE v. 3.8.31 [S15, S16], curated manually, and phylogenies conducted using PhyML [S19] under the LG+Gamma model of evolution with 100 bootstrap resamplings.

RNA Extraction. RNA was extracted using Invitrogen TRIzol® Reagent (cat#15596018), followed by genomic DNA removal and cleaning using Qiagen RNase-Free DNase Set kit (cat#79254) and Qiagen Mini RNeasy™ kit (cat#74104). Agilent 2100 Bioanalyzer was used to assess the integrity of the RNA samples. Only RNA samples having RNA Integrity Number between 8-10 were used. RNA Sequencing. The Applied Biosystems SOLiDTM Total RNA-Seq kit (catalog number 4445374) was used to generate the cDNA template library. The SOLiDTM EZ Bead system was used to perform emulsion clonal bead amplification to generate bead templates for SOLiDTM platform sequencing. Samples were sequenced on the 5500XL SOLiDTM platform. The 75 bp sequences produced by the 5500XL SOLiDTM sequencer were mapped in color space using SOLiDTM LifeScopeTM software version 2.5 using the default parameters against the reference genome set of 2,302,715 separate gene FASTA entries. Corresponding GTF files were built to record the position of each gene on the set of artificial chromosomes.

Open-access database for genome analyses. A summary of ESOM bin size, GC content, and phylogenetic identity is located at http://ggkbase.berkeley.edu/GW2011_AR/organisms. All genomic information is publicly accessible via the website. We used the lists and genome summary functions to assess genome completeness and profile metabolic traits. ggKbase is designed around live data, whereby projects are continuously updated and improved (updates may include bin content and improvements to functional predictions; the project name, organism names and gene names remain consistent). All fasta files for our figures are hosted on this site. Raw data are available at the JGI website: http://genome.jgi.doe.gov/pages/dynamicOrganismDownload.jsf?organism=Terseqemediation_2.

Supplemental References

S1. Castelle, C. J., Hug, L. A., Wrighton, K. C., Thomas, B. C., Williams, K. H., Wu, D., Tringe, S. G., Singer, S. W., Eisen, J. A., and Banfield, J. F. (2013). Extraordinary phylogenetic diversity and metabolic versatility in aquifer sediment. Nat. Commun. 4, 2120.

S2. Hug, L. A., Castelle, C. J., Wrighton, K. C., Thomas, B. C., Sharon, I., Frischkorn, K. R., Williams, K. H., Tringe, S. G., and Banfield, J. F. (2013). Community genomic analyses constrain the distribution of metabolic traits across the Chloroflexi phylum and indicate roles in sediment carbon cycling. Microbiome 1, 22.

S3. Williams, K. H., Long, P. E., Davis, J. a., Wilkins, M. J., N’Guessan, A. L., Steefel, C. I., Yang, L., Newcomer, D., Spane, F. a., Kerkhof, L. J., et al. (2011). Acetate Availability and its Influence on Sustainable Bioremediation of Uranium-Contaminated Groundwater. Geomicrobiol. J. 28, 519-539.

S4. Peng, Y., Leung, H. C. M., Yiu, S. M., and Chin, F. Y. L. (2012). IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28, 1420-8.

S5. Hyatt, D., LoCascio, P. F., Hauser, L. J., and Uberbacher, E. C. (2012). Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics 28, 2223-30.

S6. Suzek, B. E., Huang, H., McGarvey, P., Mazumder, R., and Wu, C. H. (2007). UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 1282-8.

S7. Wrighton, K. C., Thomas, B. C., Sharon, I., Miller, C. S., Castelle, C. J., VerBerkmoes, N. C., Wilkins, M. J., Hettich, R. L., Lipton, M. S., Williams, K. H., et al. (2012). Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla. Science 337, 1661-1665.

S8. Lowe, T. M., and Eddy, S. R. (1997). tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955-64.

S9. Dick, G. J., Andersson, A. F., Baker, B. J., Simmons, S. L., Thomas, B. C., Yelton, A. P., and Banfield, J. F. (2009). Community-wide analysis of microbial genome sequence signatures. Genome Biol. 10, R85.

S10. Langmead, B., and Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357-9.

S11. Puigbò, P., Wolf, Y. I., and Koonin, E. V (2009). Search for a “Tree of Life” in the thicket of the phylogenetic forest. J. Biol. 8, 59.

S12. Sorek, R., Zhu, Y., Creevey, C. J., Francino, M. P., Bork, P., and Rubin, E. M. (2007). Genome-wide experimental determination of barriers to horizontal gene transfer. Science 318, 1449-52.

S13. Wu, D., Hartman, A., Ward, N., and Eisen, J. A. (2008). An automated phylogenetic tree-based small subunit rRNA taxonomy and alignment pipeline (STAP). PLoS One 3, e2566.

S14. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403-10.

S15. Edgar, R. C. (2004). MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5, 113.

S16. Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792-7.

S17. Stamatakis, A. (2006). RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688-90.

S18. Pruesse, E., Quast, C., Knittel, K., Fuchs, B. M., Ludwig, W., Peplies, J., and Glöckner, F. O. (2007). SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 35, 7188-96.

S19. Guindon, S., and Gascuel, O. (2003). A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52, 696-704.

Documents

Genomic Expansion of Domain Archaea Highlights Roles for