Advancing Science with DNA Sequence
Natalia IvanovaNatalia Ivanova
MGM WorkshopMGM Workshop
September 12, 2012September 12, 2012
Metagenome analysis: use case
Advancing Science with DNA Sequence
Minoan eruption and metagenomics
…it seemed as though the sea was being sucked backwards, as if it were being pushed back by the shaking of the land…Behind us were frightening dark clouds, rent by lightning twisted and hurled, opening to reveal huge figures of flame. These were like lightning, but bigger.
From Pliny the Younger’s Letter
Advancing Science with DNA Sequence
Apart from Minoan eruption…
from Chernicoff & Stanley, Geology, 2007
Diagram by Gary Massoth/PMEL
Advancing Science with DNA Sequence
Sampling sites
white mat
red mat
Key gradients white vs red:Temperature 60 vs 18oCCO2 tension >99% vs <1%
Advancing Science with DNA Sequence
This is what it looks like
Advancing Science with DNA Sequence
Chimney material may be of biological origin
Advancing Science with DNA Sequence
Standard JGI metagenome pipeline
DNA sample
DNA QC
SSU pyrotags
shotgun libraries
http://pyrotagger.jgi-psf.orgCommunity compositionSemi-quantitative – OTU abundance
Illumina long mate pair
Illumina standard
454 standard
454 long mate pair
Metagenome IMG/M-ERcontigs + unassembled readsCommunity compositionFunctional analysis
Assembly
Analysis
Advancing Science with DNA Sequence
Pyrotag results – BLASTn against Greengenes database
Pyrotags - phylum level, filtered at 0.1% of all clusters
0 5 10 15 20 25 30 35 40
ProteobacteriaBacteroidetes
PlanctomycetesChloroflexi
Marine_group_AThaumarchaeota
UnknownOP11
AcidobacteriaActinobacteria
Caldithrix_KSB1OP3OP8WS3
VerrucomicrobiaChlorobi
GemmatimonadetesMBMPE71
NitrospiraepMC2A209ABY1_OD1
Thermoplasmata_EuryFirmicutes
NKB19VHS-B5-50
SpirochaetespMC1
LentisphaeraeOP5TM6
MAT-CR-M3-H11Chlamydiae
TM7DHVE3
CyanobacteriapMC2A384
pMC2A15WS6
SM2F11C2
BRC1Thermotogae
ThermosulfidobacteriumEM3
ph
ylu
m
% pyrotag clusters
Kolumbo_volcano_whiteKolumbo_volcano_red
Advancing Science with DNA SequencePhyloDistribution results – BLASTp of metagenome CDSs against isolates in
IMGPhyloDistribution of CDSs - phylum level, filtered at 0.1% abundance
0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 50.00%
Proteobacteria
Bacteroidetes
Planctomycetes
Chlorofl exi
Thaumarchaeota
Acidobacteria
Actinobacteria
Caldi thrix_KSB1
Verrucomicrobia
Chlorobi
Thermoplasmata_Eury
Fi rmicutes
Spirochaetes
Cyanobacteria
Thermotogae
Thermosulfi dobacterium
Lentisphaerae
ph
ylu
m
% CDS hits
Kolumbo_volcano_white_grey
Kolumbo_volcano_red
Advancing Science with DNA Sequence
Pyrotags vs PhyloDistribution – white mat
0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 50.00%
BacteroidetesProteobacteria
ThermotogaeOP5
PlanctomycetesThermosulfidobacteri
EM3Thaumarchaeota
Thermoplasmata_EuryVerrucomicrobia
ChlorobiUnknown
AcidobacteriaActinobacteria
Caldithrix_KSB1Lentisphaerae
Marine_group_AKolumbo_volcano_white_grey_PhyloDist
Kolumbo_volcano_white_grey_Pyro
Big differences in abundance (an order of magnitude or more) of Bacteroidetes and Thermotogae
Advancing Science with DNA Sequence
Possible explanations
• Amplification artifacts in pyrotags – well known for metagenome data
• Sequencing GC bias in the metagenome – low and high (<30% and >65%) are underrepresented in Illumina data
• K-mer assembler problems: abundant populations may be undrrepresented in assembly if incorrect k-mer/coverage parameters selected
• Primer bias in pyrotags (against Proteobacteria)?
Advancing Science with DNA Sequence
PCR artifacts in metagenome data
12
Reason: presence of free beads during the library prep step; escaped emPCR products bind to free beads and are disproportionately amplified
454 technology includes an emulsion PCR step, which may lead to artificial overrepresentation of certain sequences
Advancing Science with DNA Sequence
Low GC (Brachyspira)
What about GC bias?
Medium GC (Arcanobacterium) High GC (Cellulomonas)
Question: how do you find average/max/min GC content for a clade?Answer: IMG=>Genome Browser=>View Phylogenetically=>click on green + to select the clade, then “Add selected to Genome Cart”=>Compare Genomes=>Genome Statistics
Result: Thermotogae GC percent 41 average/47 max/31 minBacteroidetes GC percent 42.5 average/66 max/31 min
Advancing Science with DNA Sequence
Let’s take a closer look at the unassembled reads
White mat Red mat
454 reads total 299,975 1,429,091
Illumina reads total 49,227,146 45,337,178
Assembled contigs 195,590 88,776
N50, bp 659 869
Longest contig, bp 28,145 75,483
Illumina reads mapped to assembly, % total
42.3 12.5
454 reads mapped to assembly, % total
62.1 15.30 10 20 30 40 50 60 70
Actinobacteria
Aquificae
Bacteroidetes
Chlorobi
Chloroflexi
Cyanobacteria
Euryarchaeota
Firmicutes
Planctomycetes
Proteobacteria
Spirochaetes
Thaumarchaeota
Thermotogae
Verrucomicrobia
454Illumina
Advancing Science with DNA Sequence
It’s pyrotag bias after all!
• JGI uses primer pair 946F-1492R1492R primerTACGCYTACCTTGTTACGACTTTACGGTTACCTTGTTACGACTTSequence in the metagenome• CG mismatch• JGI did extensive testing on artificial
communities – this problem not detected
Advancing Science with DNA Sequence
Functional analysis: metagenome as a bag of functions
• Red mat is taxonomically more diverse• Is it more diverse functionally?
White mat Red mat
COG clusters 3631 3402
Pfam clusters 3847 3505
Question: where do you find this information?Answer: IMG=>Taxon Details=>Metagenome Statistics; Genes with
Pfam=>Display as a list =>Export
10000 20000 30000 40000 50000 60000 70000 80000
Specimens
0
400
800
1200
1600
2000
2400
2800
3200
3600
Taxa
(95%
con
fidence) Rarefaction curves: white mat
is expected to have ~4000 different Pfams; red mat ~3600
Advancing Science with DNA Sequence
Abundance Comparisons
Motility and chemotaxis genes are overrepresented in white mat (detected by both Pfams and COG Categories)
white mat red mat
Advancing Science with DNA Sequence
Is motility/chemotaxis common to all organisms in white mat?
• Scenario 1: the function/pathway is overrepresented because it is present in all members of the community, possibly at higher copy number
• Scenario 2: the function/pathway is overrepresented because it is present in one clade, which is absent from the second sample
Question: can we distinguish between the two scenarios?Answer: click on the gene count for protein family/functional category, add all genes to Gene
Cart=>add scaffolds to Scaffold Cart=>PhyloDistribution of all scaffolds in the Scaffold Cart
Advancing Science with DNA Sequence
Carbon fixation pathways
Advancing Science with DNA Sequence
Conclusions
Two communities have different composition; white mat sampled next to the hydrothermal vent has lower complexity
Community composition as sampled by pyrotags and the metagenome may be quite different due to a number of biases
Some protein families/functional categories are more abundant because of different community composition, and not because they are more important