Upload
ctitusbrown
View
590
Download
0
Embed Size (px)
Citation preview
Divide and conquer applied to metagenomic DNA
C. Titus [email protected]
CSE / MMG, Michigan State University
A brief intro to shotgun assemblyIt was the best of times, it was the wor, it was the worst of times, it was the isdom, it was the age of foolishness
mes, it was the age of wisdom, it was th
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
…but for 2 bn+ fragments.Not subdivisible; not easy to distribute; memory intensive.
the quick brown fox jumped
jumped over the lazy dog
the quick brown fox jumped over the lazy dog
na na na, batman!
my chemical romance: na na na
Repeats do cause problems:
Assemble based on word overlaps:
Whole genome shotgun sequencing & assembly
Randomly fragment & sequence from DNA;reassemble computationally.
UMD assembly primer (cbcb.umd.edu)
How does assembly scale?
• Our assembly approach scales with the amount of genomic novelty present in the sample.
• For “sane” problems (microbes, human genome, etc.) this isn’t too bad, although challenging.
• For metagenomes, with millions of different species at different abundances, this is an intractable problem (so far)…
Great Plains Grand Challenge –Sampling sites
• Wisconsin– Native prairie (Goose Pond, Audubon)– Long term cultivation (corn)– Switchgrass rotation (previously corn)– Restored prairie (from 1998)
• Iowa– Native prairie (Morris prairie)– Long term cultivation (corn)
• Kansas – Native prairie (Konza prairie)– Long term cultivation (corn)
Iowa Native Prairie
Switchgrass (Wisconsin)
Iowa >100 yr tilled
Sampling strategy per site
Reference soil
Soil cores: 1 inch diameter, 4 inches deep
Total:
8 Reference metagenomes +
64 spatially separated cores (pyrotag sequencing)
10 M
10 M
1 M
1 M
1 cM
1 cM
454 Titanium Shotgun sequencing
Illumina shotgun sequencing
Soil Metagenome
Community composition
454 Titanium Pyrotag sequencing
What kinds of questions?
• What genes are present?• What species are present?• What are those species doing, physiologically
speaking?• How does “function” change with cultivation,
CO2, fertilizer types, crop cycles, etc?
We are at a “pre-question” stage, unfortunately…
Iowa, Continuous c
orn
Iowa, Native P
rairie
Kansas
, Cultiva
ted co
rn
Kansas
, Native Prai
rie
Wisconsin
, Continuous c
orn
Wisconsin
, Native Prairie
Wisconsin
, Rest
ored Prairie
Wisconsin
, Switc
hgrass0
50
100
150
200
250
300
350
Great Prairie Sequencing Summary – Illumina whole metagenome shotgun
GAII HiSeq
Base
pair
s of S
eque
ncin
g (G
bp)
The basic problem.
• Lots of metagenomic sequence data(200 GB Illumina for < $20k?)
• Assembly, especially metagenome assembly, scales poorly (due to high diversity).
• Standard assembly techniques don’t work well with sequences from multiple abundance genomes.
• Many people don’t have the necessary computational resources to assemble (~1 TB of RAM or more, if at all).
We can’t just throw more hardware at the problem…
Lincoln Stein
Hat tip to Narayan Desai / ANL
We don’t have enough resources or people to analyze data.
Data generation vs data analysis
It now costs about $10,000 to generate a 200 GB sequencing data set (DNA) in about a week.
(Think: resequencing human; sequencing expressed genes; sequencing metagenomes, etc.)
…x1000 sequencers
Many useful analyses do not scale linearly in RAM or CPU with the amount of data.
The challenge:
Massive (and increasing) data generation capacity, operating at a boutique level, with
algorithms that are wholly incapable of scaling to the data volume.
Note: cloud computing isn’t a solution to a sustained scaling problem!! (See: Moore’s Law slide)
Awesomeness
Easy stuff like Google Search
Life’s too short to tackle the easy problems – come to academia!
Assembly of shotgun sequenceIt was the best of times, it was the wor, it was the worst of times, it was the isdom, it was the age of foolishness
mes, it was the age of wisdom, it was th
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
…but for 2 bn+ fragments.Not subdivisible; not easy to distribute; memory intensive.
the quick brown fox jumped
jumped over the lazy dog
the quick brown fox jumped over the lazy dog
na na na, batman!
my chemical romance: na na na
Repeats do cause problems:
Assemble based on word overlaps:
Whole genome shotgun sequencing & assembly
Randomly fragment & sequence from DNA;reassemble computationally.
UMD assembly primer (cbcb.umd.edu)
K-mer graphs - overlaps
J.R. Miller et al. / Genomics (2010)
K-mer graphs - branching
For decisions about which paths etc, biology-based heuristics come into play as well.
Iowa, Continuous c
orn
Iowa, Native P
rairie
Kansas
, Cultiva
ted co
rn
Kansas
, Native Prai
rie
Wisconsin
, Continuous c
orn
Wisconsin
, Native Prairie
Wisconsin
, Rest
ored Prairie
Wisconsin
, Switc
hgrass0
50
100
150
200
250
300
350
Great Prairie Sequencing Summary – Illumina whole metagenome shotgun
GAII HiSeq
Base
pair
s of S
eque
ncin
g (G
bp)
Billions and billions of …>850:2:1:1943:15232/1 0CCTGCCTGTGGAGCAGCCCACGCAGTTCGAGCTGATCATCAACCTCAAGACGGCCCAAGCCCTTGGCATCACGATT>850:2:1:1943:15232/2 0ACACCATTTAATCTTAGCCATAAAAGTTGTATAAGCATCAACGTTTTGTTTGTCTCAAAAAACGATTTTTTTTTTG>850:2:1:1943:19543/1 0ACTGTAGGTTTCTGGCTGCGTCCGACGATAGCAGCCCGCTCTGCCGACATTGTCA>850:2:1:1945:16822/2 0AGTCGACAGATCGACCTGAAGGAGGTGCCGGGAATTGAAGTCATCCAGGGCGCCGAGGAGAACTGATCGG>850:2:1:1946:10202/2 0AGCTTTTTCGCGCGCGTGAAAAAGCTTTGTCGATTTCTGGGTTTCGGCCTTCTCACAGTCACCGCCGAGGGCCGGG>850:2:1:1947:6533/2 0GGTCTCCGGACACACGAAGGCACGGCTCTCCGAGAAGCGGAGGATGTACTCGACCTCACGGCTGC>850:2:1:1948:15431/1 0ACCGCTTACTCGATGATGGAGCAAGGCAGAATCGACATGATTCTGAGCTCGCGTCCCGAAGATCGACGCGCGG>850:2:1:1949:19998/1 0AATTCAAAGTAGGCATTTTTGTTTTTGTAGGGTTGGCGATGTTAGGCGCGCTGGTCGTGCAATTC>850:2:1:1950:4213/2 0CCAACCGGGCCCTGGTCCTGCACGCCAACCTGTCCCCGCTGGTGG>850:2:1:1950:1388/1 0CAGCCGCAATGTTGGCATTCTTCAGCAGTTCGAGCGCCACAAAGCGGTCATTGTCTGAGGCTTCTGGG
Too much data – what can we do?
• Reduce the size of the data (either with an approximate or an exact approach)
• Divide & conquer: subdivide the problem.
• For exact data reduction or subdivision, need to grok the entire assembly graph structure.
• …but that is why assembly scales poorly in the first place.
Two exact data reduction techniques:
• Eliminate reads that do not connect to many other reads.
• Group reads by connectivity into different partitions of the entire graph.
For k-mer graph assemblers like Velvet and ABYSS, these are exact solutions.
Eliminating unconnected reads
“Graphsize filtering”
Subdividing reads by connection
“Partitioning”
Two exact data reduction techniques:
• Eliminate reads that do not connect to many other reads (“graphsize filtering”).
• Group reads by connectivity into different partitions of the entire graph (“partitioning”).
For k-mer graph assemblers like Velvet and ABYSS, these are exact solutions.
Engineering overview
• Built a k-mer graph representation based on Bloom filters, a simple probabilistic data structure;
• With this, we can store graphs efficiently in memory, ~1-2 bytes/(unique) k-mer for arbitrary k.
• Also implemented efficient global traversal of extremely large graphs (5-20 bn nodes).
For details see source code (github.com/ctb/khmer), or online webinar: http://oreillynet.com/pub/e/1784
Store graph nodes in Bloom filter
Graph traversal is done in full k-mer space;
Presence/absence of individual nodes is kept in Bloom filter data structure (hash tables w/o collision tracking).
Practical application
• Enables:– graph trimming (exact removal)– partitioning (exact subdivision)– abundance filtering
• … all for K <= 64, for 200+ gb sequence collections.
• All results (except for comparison) obtained using a single Amazon EC2 4xlarge node, 68 GB of RAM / 8 cores.
• Similar running times to using Velvet alone.
We pre-filter data for assembly:
Does removing small graphs work?
Small data set (35m reads / 3.4 gb rhizosphere soil sample)
Filtered at k=32, assembled at k=33 with ABYSS
N contigs / Total bp Largest contig130 223,341 61,766
Unfiltered (35m)130 223,341 61,766
Filtered (2m reads)
YES.
Does partitioning into disconnected graphs work?
Partitioned same data set (35m reads / 3.5 gb) into 45k partitions containing > 10 reads; assembled partitions separately (k0=32, k=33).
N contigs / Total bp Largest contig130 223,341 61,766
Unfiltered (35m)130 223,341 61,766
Sum partitions
YES.
Data reduction for assembly / practical details
Reduction performed on machine with 16 gb of RAM.
Removing poorly connected reads: 35m -> 2m reads.- Memory required reduced from 40 gb to 2 gb;- Time reduced from 4 hrs to 20 minutes.
Partitioning reads into disconnected groups:- Biggest group is 300k reads- Memory required reduced from 40 gb to 500 mb;- Time reduced from 4 hrs to < 5 minutes/group.
Does it work on bigger data sets?
35 m read data set partition sizes:
P1: 277,043 readsP2: 5776 readsP3: 4444 readsP4: 3513 readsP5: 2528 readsP6: 2397 reads…
Iowa continuous corn GA2 partitions (218.5 m reads):
P1: 204,582,365 readsP2: 3583 readsP3: 2917 readsP4: 2463 readsP5: 2435 readsP6: 2316 reads…
Problem: big data sets have one big partition!?
• Too big to handle on EC2.
• Assembles with low coverage.
• Contains 2.5 bn unique k-mers (~500 microbial genomes), at ~3-5x coverage
• As we sequence more deeply, the “lump” becomes bigger percentage of reads => trouble!– Both for our approach,– And possibly for assembly in general (because it assembles more poorly
than it should, for given coverage/size)
Why this lump?
1. Real biological connectivity (rRNA, conserved genes, etc.)
2. Bug in our software
3. Sequencing artifact or error
Why this lump?
1. Real biological connectivity? Probably not.- Increasing K from 32 to ~64 didn’t break up the lump: not biological.
2. Bug in our software? Probably not.- We have a second, completely separate approach &
implementation that confirmed the lump (bleu, by Rosangela Canino-Koning)
3. Sequencing artifact or error? YES.- (Note, we do filter & quality trim all sequences already)
“Good” vs “bad” assembly graph
Low density
High density
Non-biological levels of local graph connectivity:
Higher local graph density correlates with position in read
Higher local graph density correlates with position in read
ARTIFACT
Trimming reads• Trim at high “soddd”, sum of degree degree
distribution:– From each k-mer in each read, walk two k-mers in
all directions in the graph;– If more than 3 k-mers can be found at exactly two
steps, trim remainder of sequence.
Overly stringent; actually trimming (k-1) connectivity graph by degree.
Trimmed read examples
>895:5:1:1986:16019/2TGAGCACTACCTGCGGGCCGGGGACCGGGTCAGCCTGCTCGACCTGGGCCAACCGATGCGCC>895:5:1:1995:6913/1TTGCGCGCCATGAAGCGGTTAACGCGCTCGGTCCATAGCGCGATG>895:5:1:1995:6913/2GTTCATCGCGCTATGGACCGAGCGCGTTAACCGCTTCATGGCGCGCAAAGATCGGAAGAGCGTCGTGTAG
Preferential attachment due to bias
• Any sufficiently large collection of connected reads will have one or more reads containing an artifact;
• These artifacts will then connect that group of reads to all other groups possessing artifacts;
• …and all high-coverage contigs will amalgamate into a single graph.
Artifacts from sequencing falsely connect graphs
Preferential attachment due to bias
• Any sufficiently large collection of connected reads will have one or more reads containing an artifact;
• These artifacts will then connect that group of reads to all other groups possessing artifacts;
• …and all high-coverage contigs will amalgamate into a single graph.
Groxel view of knot-like region / Arend Hintze
Density trimming breaks up the lump:
Old P1, soddd trimmed(204.6 m reads -> 179 m):
P1: 23,444,332 readsP2: 60,703 readsP3: 48,818 readsP4: 39,755 readsP5: 34,902 readsP6: 33,284 reads…
Untrimmed partitioning (218.5 m reads):
P1: 204,582,365 readsP2: 3583 readsP3: 2917 readsP4: 2463 readsP5: 2435 readsP6: 2316 reads…
What does density trimming do to assembly?
204 m reads in lump: assembles into 52,610 contigs; total 73.5 MB
180 m reads in trimmed lump:assembles into 57,135 contigs;total 83.6 MB
(all contigs > 1kb)
Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0
Wait, what?
• Yes, trimming these “knot-like” sequences improves the overall assembly!
• We remove 25.6 m reads and gain 10.1 MB!?
• Trend is same for ABySS, another k-mer graph assembler, as well.
So what’s going on?
• Current assemblers are bad at dealing with certain graph structures (“knots”).
• If we can untangle knots for them, that’s good, maybe?
• Or, by eliminating locations where reads from differently abundant contigs connect, repeat resolution improves?
• Happens with other k-mer graph assemblers (ABYSS), and with at least one other (non-metagenomic) data set.
OK, let’s assemble!
Iowa corn (HiSeq + GA2): 219.11 Gb of sequence assembles to:
148,053 contigs,in 220 MB;max length 20322max coverage ~10x
…all done on Amazon EC2, ~ 1 week for under $500.
Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0
Full Iowa corn / mapping stats
• 1,806,800,000 QC/trimmed reads (1.8 bn)
• 204,900,000 reads map to some contig (11%)
• 37,244,000 reads map to contigs > 1kb (2.1%)
> 1 kb contig is a stringent criterion!
Compare:80% of MetaHIT reads to > 500 bp;
65%+ of rumen reads to > 1kb
Success, tentatively.
We are still evaluating assembly and assembly parameters; should be possible to improve in every way.
(~10 hrs to redo entire assembly, once partitioned.)
The main engineering point is that we can actually run this entire pipeline on a relatively small machine
(8 core/68 GB RAM)
We can do dozens of these in parallel on Amazon rental hardware.
And, from our preliminary results, we get ~ equivalent assembly results as if we were scaling our hardware.
Conclusions
• Engineering: can assemble large data sets.
• Scaling: can assemble on rented machines.
• Science: can optimize assembly for individual partitions.
• Science: retain low-abundance.
Conclusions
• Engineering: can assemble large data sets.
• Scaling: can assemble on rented machines.
• Science: can optimize assembly for individual partitions.
• Science: retain low-abundance.
Caveats
Quality of assembly??
• Illumina sequencing bias/error issue needs to be explored.
• Scaffolding with Velvet causes systematic problems
• Regardless of Illumina-specific issue, it’s good to have tools/approaches to look at structure of large graphs.
Future thoughts
• Our pre-filtering technique always has lower memory requirements than Velvet or other assemblers. So it is a good first step to try, even if it doesn’t reduce the problem significantly.
• Divide & conquer approach should allow more sophisticated (compute intensive) graph analysis approaches in the future.
• This approach enables (in theory) assembly of arbitrarily large amounts of metagenomic DNA sequence.
• Can k-mer filtering work for non-de Bruijn graph assemblers? (SGA, ALLPATHS-LG, …)
• mRNAseq and genome artifact filtering?
Kmer-> GTCGTAGTTCAGTTGGTTAGAACGCCGGCCTG 747:3:13:7042:16004/1 GATATCTGCAATATCCCGTTCGAATGGGGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGCCGGAGGCC 747:3:14:10559:9771/1 GAAATTCCGGTTTGATGCGGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGGAGGTCGCGGGTTCG 747:3:14:17232:4498/1 CAAATTTGAGATCTGAGATCCCAGGGGTTTGCGGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGG 747:3:15:7871:10206/1 TTTGCGGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGGAGGTCGCGGGTTCGAGTCCCGTCGG 747:3:16:17865:15895/2 TCAGGAGACGCCAGGGCGGTCTGAGTTCTTCAGGGGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGCCGG 747:3:27:9549:13966/1 GGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGGAGGTCGCGGGTTCGAGTCCCGTCGGCTCCGCC 747:3:30:10672:3136/1 GCGGGGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGCCGGAGGTCGCGAGTTCGAGTCTCGTCGGCCC
Better artifact filtering?
All paths lead to the same k-mers
0 5 10 15 20 250
10000
20000
30000
40000
50000
60000
70000
80000
90000
Number of times k-mer is traversed
Histogramof k-mertraversalcounts.
Estimating sequencing return on investment
• To reach ~rumen depth of sampling of top abundance organisms, would need ~1-2 TB
5x Sequencing Coverage(931 GB)
10x Sequencing Coverage(1900 GB)
<1% Novel Sequence
Argonne National Laboratory Institute for Genomic and Systems Biology
Argonne National Laboratory Institute for Genomic and Systems Biology
Earth Microbiome Projectwww.earthmicrobiome.org
• Goal – to systematically approach the problem of characterizing microbial life on earth
• Paradigm shift to analyzing communities from a microbes perspective:
• Strategy:– Explore microbes in environmental parameter space– Design ‘ideal’ strategy to interrogate these biomes– Acquire samples and sequence broad and deep both DNA,
mRNA and rRNA– Define microbial community structure and the protein universe
• Gilbert et al., 2010a,b Standards in Genomic Science, open access
Argonne National Laboratory Institute for Genomic and Systems Biology
• Challenges– 2.4 Quadrillion Base Pairs (2.4 Petabases) = 8000 HiSEQ2000 runs.
– Global Environmental Sample Database (GESI): identification and selection of 200,000 environmental samples, soil, air, marine and freshwater, host-associated, etc.
– The standardization of sampling, sample prep and sample processing, cataloging and sample metadata – Genomic Standards Consortium can help!
– The coordination of thousands of “volunteer” scientists for site characterization, sample collecting and processing
Earth Microbiome Projectwww.earthmicrobiome.org
Acknowledgements:The k-mer gang:
• Adina Howe
• Jason Pell• Rosangela Canino-Koning• Qingpeng Zhang• Arend Hintze
Collaborators:
• Jim Tiedje (Il padrino)
• Janet Jansson, Rachel Mackelprang, Regina Lamendella, Susannah Tringe, and many others (JGI)
• Charles Ofria (MSU)
Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.