Upload
gerald-hodges
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Accurate estimation of microbial communities using 16S tags
Julien Tremblay, [email protected]
16S rRNA as phylogenetic marker gene
Escherichia coli 16S rRNA
Primary and Secondary Structure
70S Ribosome
subunits
30S
50S
34 proteins
21 proteins
5S rRNA
23S rRNA
16S rRNA
Falk Warneckehighly conserved between different species of bacteria and archaea
16S rRNA in environmental microbiology(Sanger clone libraries)
Falk Warnecke
900-1100 bp length
Next generation sequencing (NGS)Illumina454
10-400M 150bp reads/lane$
0.5M 450bp reads$$
Rea
d l
eng
th
Throughput
Game plan to survey microbial diversity
V1 V2 V3 V4 V5 V6 V7 V8 V916S rRNA
Reduce dataset bydereplication/clustering
X 10,000X 800X 1,200X 200
X 2,000X 1,000X 1X 10
Identification(BLAST, RDP classifier)
Generate amplicons of agiven variable regionfrom bacterial community(many millions of sequences) Amplicon tags =
Deeper, cheaper, faster
Rare biosphere
Rank
Abun
danc
e
Rare biosphere
Sequencing error? Chimeras? Background noise?Relative small size of amplicons
High abundance
Low abundance
High sequencing depth of NGS reveals “rare” OTUs
Rare bias sphere?
Control experiment: estimate rare biospherein a single strain of E.coli
27F 342R 1114F 1392R
Is rare biosphere an artifact of the NGS error?
V1 & V2 V8
Kunin et al., (2009), Environ. Microbiol.
It should not, if relatively stringent clustering parameters are applied
Subject to controversy – Is rare always real?
Quince et al., (2009), Nat. Methods
PyroTagger (for 454 amplicons)
Unzip, validate
Remove low-quality reads
Redundancy removal
PyroClust & Uclust
Remove chimeras
Samples comparison,post-processing
pyrotagger.jgi-psf.org
Classification and barcode separation
• Sequences of cluster (OTU) representatives
• Blast vs GreenGenes and Silva databases, dereplicated at 99.5%
• Distribution of microbial phyla in the dataset
• Also see the Qiime pipeline
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
10GP1 1PS1 1PS2 2A?1
Proteobacteria Metazoa Firmicutes Bacteroidetes Spirochaetes
1 2 3 4 5 6 % identityAlignment LengthMismatches Gaps Query StartQuery EndHit StartHit End E-value Score ID Full Name TaxonomyCluster1 6732 1 267 7 14532 97.7 345 8 0 1 345 1336 992 1.00E-176 620 among geographically regions central Tibet geothermal spring mat clone DTM42210319 Bacteria Firmicutes Clostridia ClostridialesCluster2 1 3464 322 1981 1 9372 98.8 345 4 0 1 345 1247 903 0 652 Microbial consortia fermentor methanogenic bioreactor clone EBR-02E-0436134800 Bacteria Firmicutes Clostridia Clostridiales CoprococcusCluster3 6303 1 2330 7 726 100 345 0 0 1 345 1345 1001 0 684 Bacteroides sp. str. 253c73683 Bacteria Bacteroidetes Bacteroidetes (class)Bacteroidales BacteroidaceaeBacteroidesCluster4 4464 58 2750 2153 97.7 345 8 0 1 345 1338 994 1.00E-176 620 Geobacillus sp. D195366419 Bacteria Firmicutes Bacillales Bacillaceae GeobacillusCluster5 8836 4 5052 7 266 99.7 345 1 0 1 345 1382 1038 0 676 Electricigen Enrichment MFC full-scale anaerobic bioreactor sludge treating brewery waste clone 31f06226581 Bacteria Bacteroidetes Bacteroidales BacteroidaceaeCluster7 4218 530 2304 1 4690 99.1 345 1 1 1 345 1273 931 0 654 Thermoanaerobacterium saccharolyticum str. B6A356922 Bacteria Firmicutes Clostridia ThermoanaerobacteralesThermoanaerobacterales Family III. Incertae SedisCluster8 2628 55 7769 99.7 345 1 0 1 345 1151 807 0 676 Portuguese dry smoked sausages (chouricos) type Ribatejano isolate str. Te16R100979 BacteriaCluster9 1111 88 8102 13 96.5 345 10 2 1 345 1365 1023 3.00E-162 573 Clostridium stercorarium str. DSM 8532T31278 Bacteria Firmicutes Clostridia Clostridiales Clostridiaceae ClostridiumCluster10 1 648 12 1358 2837 93.3 345 23 0 1 345 1288 944 8.00E-141 502 packed-bed reactor clone CFB-4204245 Bacteria Firmicutes ClostridiaCluster13 1737 1 128 115 3971 11 100 345 0 0 1 345 1388 1044 0 684 Bacillus circulans str. X3345413 Bacteria Firmicutes Bacillales Bacillaceae BacillusCluster15 2676 1 467 2885 2 98.8 345 4 0 1 345 1277 933 0 652 mesophilic anaerobic BSA digester clone BSA1B-05107466 Bacteria Firmicutes ClostridiaCluster17 4706 663 104 153 100 345 0 0 1 345 1326 982 0 684 Bacillus sp. str. SL177336159 Bacteria Firmicutes Bacillales Bacillaceae BacillusCluster18 828 1 1629 29 118 96.8 347 9 1 1 345 1359 1013 8.00E-169 595 Actinobaculum sp. P1 str. P2P_1983011 Bacteria ActinobacteriaActinobacteridaeActinomycetalesActinomycineaeActinomycetaceaeCluster19 1353 23 388 960 99.4 345 2 0 1 345 1245 901 0 668 Guguan hot spring isolate str. K1L1103882 Bacteria Firmicutes ClostridiaCluster20 2303 2 378 3214 100 345 0 0 1 345 1351 1007 0 684 Clostridium cellulosi16059 Bacteria Firmicutes Clostridia Clostridiales Clostridiaceae ClostridiumCluster21 1446 147 2777 97.7 345 7 1 1 345 1362 1019 3.00E-174 613 Clostridiaceae bacterium SN021217060 Bacteria Firmicutes Clostridia ClostridialesCluster22 4062 138 8319 99.1 345 3 0 1 345 1261 917 0 660 Clostridiaceae str. 80Wc28479 Bacteria Firmicutes Clostridia ClostridialesCluster23 2593 722 3 5204 98.3 345 6 0 1 345 1375 1031 0 636 intestinal that activate dietary lignan secoisolariciresinol diglucoside human feces isolate ED-Mt61/PYG-s6anaerobic str. ED-Mt61/PYG-s6142216 BacteriaCluster24 1098 1 1354 43 1065 100 345 0 0 1 345 1356 1012 0 684 Klebsiella pneumoniae str. FIUMS1358763 Bacteria ProteobacteriaGammaproteobacteriaEnterobacterialesEnterobacteriaceaeKlebsiellaCluster29 150 1321 12 1 100 345 0 0 1 345 1378 1034 0 684 Pseudomonas indica38217 Bacteria ProteobacteriaGammaproteobacteriaPseudomonadalesPseudomonadaceaeCluster31 1203 479 4 28 97.1 345 10 0 1 345 1274 930 8.00E-172 605 Clostridium sp. str. IMSNU 40011102328 Bacteria Firmicutes Clostridia Clostridiales Clostridiaceae ClostridiumCluster33 86 247 98 1680 28 86 96.8 345 11 0 1 345 1347 1003 2.00E-169 597 Symbiobacterium sp. str. KA13344098 Bacteria Firmicutes Clostridia Clostridiales Clostridiales Family XVIII. Incertae SedisCluster34 1079 322 7 139 100 345 0 0 1 345 1376 1032 0 684 human fecal clone SJTU_G_05_26204458 Bacteria ProteobacteriaGammaproteobacteriaBetaproteobacteriaCluster35 625 165 1 2470 1 96.8 345 11 0 1 345 1366 1022 3.00E-159 563 on -Arctic peninsula Svalbard Norway determined genes and rumen isolates reindeer fed pelleted concentrates (RF-80) clone AF11153291 Bacteria Firmicutes Clostridia Clostridiales RF30 RF6Cluster41 772 378 19 96.8 345 11 0 1 345 1379 1035 2.00E-169 597 human fecal clone SJTU_C_03_72204117 Bacteria Bacteroidetes BacteroidalesCluster44 353 6 1436 98.8 346 3 1 1 345 1348 1003 0 646 Clostridium jejuense str. HY-35-12T104447 Bacteria Firmicutes Clostridia Clostridiales Clostridiaceae ClostridiumCluster50 347 4 64 17 592 3 100 345 0 0 1 345 1382 1038 0 684 Enterococcus casseliflavus str. eS852312359 Bacteria Firmicutes LactobacillalesEnterococcaceaeEnterococcusCluster53 354 33 1 543 100 345 0 0 1 345 1334 990 0 684 Clostridium sp. str. BG-C66357585 Bacteria Firmicutes Clostridia Clostridiales Clostridiaceae ClostridiumCluster67 490 2 758 97.7 345 8 0 1 345 1372 1028 1.00E-176 620 mesophilic anaerobic digester clone G35_D8_H_B_E11222733 Bacteria Firmicutes Clostridia Clostridiales
Clostridiaceae
PrevotellaceaeThermoanaerobacterium
Peptostreptococcaceae
ActinobaculumThermoanaerobacteriales
ClostridiaceaeClostridiaceae
Pseudomonas
SymbiobacteriumSJTU_B_02_45
Bacteroidaceae
Illumina tags (itags)
• Typical 454 run 450,000 – 500,000 reads• “Typical” Illumina run:
• GAIIx 10,000,000 – 40,000,000 reads/lane• Hiseq ~ 350,000,000 reads/lane• Miseq (available soon) ~4,000,000 reads/lane
• Move 16S tags sequencing to Illumina platform• HiSeq = huge output compared to 454 (suitable for big
projects 1000+ indexes(barcodes)/libraries• MiSeq = moderatly high throughput (More suitable?)
• throughput more efficient clustering algorithm (SeqObs).
Illumina tags (itags)
• 454 = “1” read• Illumina = “2” reads => have to be assembled
• Both reads need to be of good quality
ACGTGGTACTACGTGAT….
~200-220 bp
AC
GT
GG
TA
CT
AC
GT
GA
TA
GT
GT
AT
~252 bp
454 Illumina
itags clustering
Sort by alphabetical order 100% identityReduces dataset by 80%
Edward Kirton, JGI
97%
97%
Number of reads >> number of clusters
I llumina rRNA Amplicon Sequencing
0
5000000
10000000
15000000
20000000
25000000
30000000
RAW BARCODE OVERLAP ASSEM CLUSTERS
Nu
mb
er
of
Sequ
ences
Edward Kirton, JGI
Nu
mb
er o
f re
ads
(mil
lio
ns)
0
10
20
5
15
25
30
Clu
ster
ing
happ
ens
here
!
SeqObs Datasize vs Runtimes
0
5
10
15
20
25
Millions of Sequences
Min
ute
s
Benefits of parallelizationP
roce
ssin
g t
ime
(min
.)
Number of reads (millions)
0
10
20
5
15
25
20015010050
Edward Kirton, JGI
MiSeq validation
• Exploratory experiments using 11 wetlands samples.
• Validate reproducibility between runs
MiSeq validation
• Beta diversity (UniFrac Distances)Run 1 Run 2
itags
Validating SeqObs output by comparing with pyrotagger results
Synthetic communitiesTermite gutSurface SedimentsCompostSludge
454 Pyrotagger (V8 region)
Illumina GAIIx SeqObs pipeline (V4, V5 and V9 regions)
Illumina Miseq SeqObs pipeline (V4 region)
Comparing 454 with illumina
GAIIx vs 454 region
Comparing 454 with illumina
• Primer pair of variable region is likely to affect outcome of results.
In silico PCR on 16S Greengenes database.
itags – confidence level
454220 bp
GAIIx~110 bp
Miseq 5’ reads150 bp
Miseqassembled reads
~250 bp
E values
Challenges
• Short size of amplicon• What filtering parameters to use (stringency level)?
• balance between stringency filter and keeping as much data as we can
• Whole new dimension for rare biosphere?• Handling large numbers of sample (tens of
thousand magnitude)• Cost of barcoded primers (will need lots of
barcodes), handling• Huge ammount of samples statistics models…
Acknowledgments
• Susannah Tringe• Edward Kirton• Feng Chen• Kanwar Singh• Rob Knight lab (Univ. of Colorado)
Thanks!
16S rRNA
Dangl lab, UNC